計劃部署安裝
ovs + dpdk,爲了安裝過程順利少踩坑,所以嚴格按照ovs官網的部署安裝教程執行。Ovs版本採用2.7版本,dpdk採用16.11.1版本。
Ovs官方安裝步驟鏈接: http://docs.openvswitch.org/en/latest/intro/install/dpdk/
前面的安裝過程一切都順利,安裝完成後需要添加ovs網橋和端口。
命令如下:(網橋是br2 ,端口是dpdk0)
ovs-vsctl add-br br2 -- set bridge br2datapath_type=netdev
ovs-vsctl add-port br2 dpdk0 -- set Interface dpdk0 \
type=dpdkoptions:dpdk-devargs=0000:81:00.1
添加完端口看終端掛了,現象如下。
在查看下ovs的相關進程發現ovs-vswitchd進程掛了。
[root@10-0-192-25 src]# ps -aux | grep ovs
root 67164 0.0 0.0 17168 1536 ? Ss 11:07 0:00 ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
root 67735 0.0 0.0 112648 964 pts/0 S+ 11:10 0:00 grep --color=auto ovs
[root@10-0-192-25 src]#
正常情況如下所示:
[root@10-0-192-25 src]# ps -aux | grep ovs
root 67918 0.0 0.0 17056 1268 ? Ss 11:11 0:00 ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
root 67975 16.0 0.0 1312112 2692 ? Ssl 11:11 0:01 ovs-vswitchd unix:/usr/local/var/run/openvswitch/db.sock --pidfile --detach --log-file
root 68015 0.0 0.0 112648 964 pts/0 R+ 11:11 0:00 grep --color=auto ovs
[root@10-0-192-25 src]#
用gdb跟蹤下問題出現在了哪裏?
添加完ovs端口後查看問題復現了,bug掛在了ovs_list_insert這個函數,具體的代碼是 elem->prev = before->prev; 推斷八成是空指針引起的非法訪問造成的,函數調用棧如下:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000e78be7 in ovs_list_insert (before=0x1266290 <dpdk_mp_list>, elem=0x18) at ./include/openvswitch/list.h:124
124 elem->prev = before->prev;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64
(gdb) bt
#0 0x0000000000e78be7 in ovs_list_insert (before=0x1266290 <dpdk_mp_list>, elem=0x18)
at ./include/openvswitch/list.h:124
#1 0x0000000000e78c35 in ovs_list_push_back (list=0x1266290 <dpdk_mp_list>, elem=0x18)
at ./include/openvswitch/list.h:164
#2 0x0000000000e79388 in dpdk_mp_get (socket_id=1, mtu=2030) at lib/netdev-dpdk.c:533
#3 0x0000000000e79475 in netdev_dpdk_mempool_configure (dev=0x7fff7ffc55c0) at lib/netdev-dpdk.c:570
#4 0x0000000000e7f294 in netdev_dpdk_reconfigure (netdev=0x7fff7ffc55c0) at lib/netdev-dpdk.c:3134
#5 0x0000000000da7496 in netdev_reconfigure (netdev=0x7fff7ffc55c0) at lib/netdev.c:2001
#6 0x0000000000d7450c in port_reconfigure (port=0x15f7800) at lib/dpif-netdev.c:2952
#7 0x0000000000d7527f in reconfigure_datapath (dp=0x15bdca0) at lib/dpif-netdev.c:3273
#8 0x0000000000d70d87 in do_add_port (dp=0x15bdca0, devname=0x15bc700 "dpdk0", type=0xf55942 "dpdk", port_no=2)
at lib/dpif-netdev.c:1351
#9 0x0000000000d70e7d in dpif_netdev_port_add (dpif=0x15614d0, netdev=0x7fff7ffc55c0, port_nop=0x7fffffffe1b8)
at lib/dpif-netdev.c:1377
#10 0x0000000000d7afb4 in dpif_port_add (dpif=0x15614d0, netdev=0x7fff7ffc55c0, port_nop=0x7fffffffe20c)
at lib/dpif.c:544
#11 0x0000000000d242a0 in port_add (ofproto_=0x15bc940, netdev=0x7fff7ffc55c0) at ofproto/ofproto-dpif.c:3342
#12 0x0000000000d0c726 in ofproto_port_add (ofproto=0x15bc940, netdev=0x7fff7ffc55c0, ofp_portp=0x7fffffffe374)
at ofproto/ofproto.c:1998
#13 0x0000000000cf94fc in iface_do_create (br=0x15615b0, iface_cfg=0x15f8a80, ofp_portp=0x7fffffffe374,
netdevp=0x7fffffffe378, errp=0x7fffffffe368) at vswitchd/bridge.c:1763
#14 0x0000000000cf9683 in iface_create (br=0x15615b0, iface_cfg=0x15f8a80, port_cfg=0x1565320) at vswitchd/bridge.c:1801
#15 0x0000000000cf6f2b in bridge_add_ports__ (br=0x15615b0, wanted_ports=0x1561690, with_requested_port=false)
at vswitchd/bridge.c:912
#16 0x0000000000cf6fbc in bridge_add_ports (br=0x15615b0, wanted_ports=0x1561690) at vswitchd/bridge.c:928
#17 0x0000000000cf6510 in bridge_reconfigure (ovs_cfg=0x1588a80) at vswitchd/bridge.c:644
---Type <return> to continue, or q <return> to quit---
#18 0x0000000000cfc7a3 in bridge_run () at vswitchd/bridge.c:2961
#19 0x0000000000d01c78 in main (argc=4, argv=0x7fffffffe638) at vswitchd/ovs-vswitchd.c:111
(gdb)
問題函數如下:
static struct dpdk_mp *
dpdk_mp_get(int socket_id, int mtu)
{
struct dpdk_mp *dmp;
ovs_mutex_lock(&dpdk_mp_mutex);
LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
dmp->refcount++;
goto out;
}
}
dmp = dpdk_mp_create(socket_id, mtu);
//dmp返回值有問題
//dmp 返回NULL,卻沒有判斷直接用&dmp->list_node
ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
out:
ovs_mutex_unlock(&dpdk_mp_mutex);
return dmp;
}
用gdb打印了一下dmp =dpdk_mp_create(socket_id, mtu);此函數返回值果然爲NULL。ovs_list_push_back函數也沒有做參數檢查拿過來就用。
Socket_id打印出來是1,這個是後續問題推斷的一個重要線索,意思是在numa node1節點上申請內存。
順着這個線索往前推,看看到底爲什麼沒有分配到內存。
問題出現在dpdk_mp_create函數,先看看代碼都有哪幾種情形返回NULL。
用gdb跟蹤了一下發問題出現在調用rte_mempool_create函數返回值是NULL。且rte_mempool_create函數是dpdk的lib庫函數,ovs調用此函數創建內存池。定位到此處發現如果再繼續定位下去需要查看dpdk的代碼,沒辦法只能繼續。
問題跟蹤到dpdk的代碼進度是越來越慢,因爲跟蹤的這一路是dpdk的內存管理流程,代碼很難啃。繼續跟蹤rte_mempool_create函數。
函數調用流程如下:
rte_mempool_create
-> rte_mempool_create_empty
->rte_memzone_reserve
->rte_memzone_reserve_thread_safe
->memzone_reserve_aligned_thread_unsafe
->malloc_heap_alloc
->find_suitable_element
最後跟蹤到find_suitable_element函數,此函數的功能是dpdk的內存管理malloc申請堆內存時,首先查看空閒鏈表是否有空閒的內存塊,如果有空閒內存塊則返回空閒節點的地址,如果沒有返回NULL。
函數如下:
static struct malloc_elem *
find_suitable_element(struct malloc_heap *heap, size_t size,
unsigned flags, size_t align, size_t bound)
{
size_t idx;
struct malloc_elem *elem, *alt_elem = NULL;
for (idx = malloc_elem_free_list_index(size);
idx < RTE_HEAP_NUM_FREELISTS; idx++) {
for (elem = LIST_FIRST(&heap->free_head[idx]);
!!elem; elem = LIST_NEXT(elem, free_list)) {
if (malloc_elem_can_hold(elem, size, align, bound)) {
if (check_hugepage_sz(flags, elem->ms->hugepage_sz))
return elem;
if (alt_elem == NULL)
alt_elem = elem;
}
}
}
if ((alt_elem != NULL) && (flags & RTE_MEMZONE_SIZE_HINT_ONLY))
return alt_elem;
return NULL;
}
此內存管理的空閒鏈表是按照socket-id進行區分的,因爲ovs調用dpdk接口時已經傳入參數socket-id爲1,heap->free_head[idx]是socket-id爲1空閒鏈表的頭節點。
繼續往下查空閒鏈表爲什麼是空,只能分析下空閒鏈表是怎麼初始化的。搜索了一下dpdk代碼發現了空閒鏈表的插入函數:malloc_elem_free_list_insert
用gdb跟蹤下此函數的調用棧如下:
Main->bridge_run->dpdk_init->dpdk_init__
->rte_eal_init->rte_eal_memzone_init->rte_eal_malloc_heap_init
->malloc_heap_add_memseg->malloc_elem_free_list_insert
函數調用到rte_eal_init就進入了dpdk的lib庫,此函數之前是ovs的ovs-vswitchd進程函數,順着這個調用流程往前推。
分析了下rte_eal_malloc_heap_init函數,此函數初始化時讀取的rte_eal_get_configuration()->mem_config內存配置信息,且內存配置信息只有一條,此條配置信息是ms->socket_id = 0, ms->len = 1073741824 ,大小剛好是1G字節的內存也就是說numa的node0節點申請分配1G大小的內存空間,而node 1節點沒有申請分配內存,而程序運行時恰好是node 1節點在申請開闢內存池。
int
rte_eal_malloc_heap_init(void)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
unsigned ms_cnt;
struct rte_memseg *ms;
if (mcfg == NULL)
return -1;
for (ms = &mcfg->memseg[0], ms_cnt = 0;
(ms_cnt < RTE_MAX_MEMSEG) && (ms->len > 0);
ms_cnt++, ms++) {
malloc_heap_add_memseg(&mcfg->malloc_heaps[ms->socket_id], ms);
}
return 0;
}
開源軟件應該都會有一些輔助定位的app,如通過./dpdk-procinfo命令也可以查詢dpdk的內存分配情況,紅框中出現兩個segment,每個segment對應一個socket_id,大小是1G。如果只有socket_id:0有memory_segment,那麼就能確定內存分配出現問題了。
接下來的推導思路是查詢下dpdk的內存配置信息是怎麼給rte_eal_get_configuration()->mem_config賦值的。
經過搜索代碼rte_eal_get_configuration()->mem_config變量的賦值是在rte_eal_hugepage_init()函數中進行的。
此函數的調用棧如下:
Main->bridge_run->dpdk_init->dpdk_init__
->rte_eal_init->rte_eal_memory_init->rte_eal_hugepage_init
rte_eal_hugepage_init函數主要是在/mnt/huge目錄下創建的hugetlbfs配置的內存頁數的rtemap_xx文件,併爲每個rtemap_xx文件做mmap映射,保證mmap後的虛擬地址與實際的物理地址一致。
首先映射頁表中的所有大頁,經過兩次mmap保證虛擬地址和物理地址一致,調用calc_num_pages_per_socket函數計算每個numa node節點上的應該使用的pages 數目,最後調用unmap_unneeded_hugepages函數unmap掉無用的內存頁。在安裝部署的過程中申請了4個1G的內存頁,應該是numa的node 0和1節點各兩個大頁,但是實際情況是隻有node 0節點上有1個大頁被使用,rte_eal_hugepage_init函數初始化時原本mmap映射了4頁內存,但是後來被unmap_unneeded_hugepages函數釋放了只在node 0節點上保留一頁。是按照calc_num_pages_per_socket函數計算的結果進行調整的,分析calc_num_pages_per_socket函數得知計算每個numa node節點的內存頁數目是按照internal_config.memory配置的總內存頁數目和每一個node節點配置的內存頁數目計算的。internal_cfg->memory是internal_cfg->socket_mem[i]的總和。internal_cfg->socket_mem[i]的賦值在eal_parse_socket_mem函數中進行。
此函數的調用棧如下:
Main->bridge_run->dpdk_init->dpdk_init__->rte_eal_init->eal_parse_args->eal_parse_socket_mem
internal_cfg->socket_mem[i]配置的下刷要追溯到ovs的代碼dpdk_init__函數,此函數中調用get_dpdk_args函數獲取獲取配置參數,其中對socket_mem進行賦值的是在construct_dpdk_mutex_options函數中進行。
static int
construct_dpdk_mutex_options(const struct smap *ovs_other_config,
char ***argv, const int initial_size,
char **extra_args, const size_t extra_argc)
{
struct dpdk_exclusive_options_map {
const char *category;
const char *ovs_dpdk_options[MAX_DPDK_EXCL_OPTS];
const char *eal_dpdk_options[MAX_DPDK_EXCL_OPTS];
const char *default_value;
int default_option;
} excl_opts[] = {
{"memory type",
{"dpdk-alloc-mem", "dpdk-socket-mem", NULL,},
{"-m", "--socket-mem", NULL,},
"1024,0", 1
},
}; //默認值 1024Mb 和0Mb
........
}
socket_mem的默認配置是numa node 0節點1024Mb,node 1節點0Mb。
問題分析到這已經水落石出了,通過查看ovs官網得到如下命令可以修改各個節點socket_mem的數值。
ovs-vsctl --no-wait set Open_vSwitch . \
other_config:dpdk-socket-mem="1024,1024
配置此命令後node的0和1節點各分配1024Mb內存空間。再次配置ovs網橋和端口程序運行正常。
輔助定位信息:(進程啓動時有如下信息顯示)
1.可以看到socket-mem的分配情況。
2.添加ovs端口時我使用的網卡的pci值爲0000:81:00.1,如下啓動信息已顯示此網卡在numa的node1節點上。
[root@10-0-192-25 src]# ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file 2017-04-06T03:11:50Z|00001|vlog|INFO|opened log file /usr/local/var/log/openvswitch/ovs-vswitchd.log 2017-04-06T03:11:50Z|00002|ovs_numa|INFO|Discovered 16 CPU cores on NUMA node 0 2017-04-06T03:11:50Z|00003|ovs_numa|INFO|Discovered 16 CPU cores on NUMA node 1 2017-04-06T03:11:50Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 32 CPU cores 2017-04-06T03:11:50Z|00005|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connecting... 2017-04-06T03:11:50Z|00006|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connected 2017-04-06T03:11:50Z|00007|dpdk|INFO|DPDK Enabled - initializing... 2017-04-06T03:11:50Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /usr/local/var/run/openvswitch 2017-04-06T03:11:50Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd --socket-mem 1024,0 -c 0x00000001 EAL: Detected 32 lcore(s) EAL: Probing VFIO support... EAL: PCI device 0000:81:00.0 on NUMA socket 1 EAL: probe driver: 8086:10fb net_ixgbe EAL: PCI device 0000:81:00.1 on NUMA socket 1 EAL: probe driver: 8086:10fb net_ixgbe EAL: PCI device 0000:82:00.0 on NUMA socket 1 EAL: probe driver: 8086:10fb net_ixgbe EAL: PCI device 0000:82:00.1 on NUMA socket 1 EAL: probe driver: 8086:10fb net_ixgbe Zone 0: name:<rte_eth_dev_data>, phys:0xfbffcec40, len:0x30100, virt:0x7fd77ffcec40, socket_id:0, flags:0 2017-04-06T03:11:52Z|00010|dpdk|INFO|DPDK Enabled - initialized 2017-04-06T03:11:52Z|00011|timeval|WARN|Unreasonably long 1699ms poll interval (146ms user, 1452ms system) 2017-04-06T03:11:52Z|00012|timeval|WARN|faults: 1482 minor, 0 major 2017-04-06T03:11:52Z|00013|timeval|WARN|context switches: 7 voluntary, 38 involuntary 2017-04-06T03:11:52Z|00014|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=edcf6a06: