ovs + dpdk 定位配置ovs端口後ovs-vswitchd進程掛死問題的總結


計劃部署安裝

ovs + dpdk,爲了安裝過程順利少踩坑,所以嚴格按照ovs官網的部署安裝教程執行。Ovs版本採用2.7版本,dpdk採用16.11.1版本。

Ovs官方安裝步驟鏈接: http://docs.openvswitch.org/en/latest/intro/install/dpdk/

 

前面的安裝過程一切都順利,安裝完成後需要添加ovs網橋和端口。

 

命令如下:(網橋是br2 ,端口是dpdk0

ovs-vsctl add-br br2 -- set bridge br2datapath_type=netdev

 

ovs-vsctl add-port br2 dpdk0 -- set Interface dpdk0 \

type=dpdkoptions:dpdk-devargs=0000:81:00.1

 

添加完端口看終端掛了,現象如下。


 

在查看下ovs的相關進程發現ovs-vswitchd進程掛了。

[root@10-0-192-25 src]# ps -aux | grep ovs
root      67164  0.0  0.0  17168  1536 ?        Ss   11:07   0:00 ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
root      67735  0.0  0.0 112648   964 pts/0    S+   11:10   0:00 grep --color=auto ovs
[root@10-0-192-25 src]# 

正常情況如下所示:

[root@10-0-192-25 src]# ps -aux | grep ovs
root      67918  0.0  0.0  17056  1268 ?        Ss   11:11   0:00 ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
root      67975 16.0  0.0 1312112 2692 ?        Ssl  11:11   0:01 ovs-vswitchd unix:/usr/local/var/run/openvswitch/db.sock --pidfile --detach --log-file
root      68015  0.0  0.0 112648   964 pts/0    R+   11:11   0:00 grep --color=auto ovs
[root@10-0-192-25 src]# 

gdb跟蹤下問題出現在了哪裏?

添加完ovs端口後查看問題復現了,bug掛在了ovs_list_insert這個函數,具體的代碼是 elem->prev = before->prev;  推斷八成是空指針引起的非法訪問造成的,函數調用棧如下:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000e78be7 in ovs_list_insert (before=0x1266290 <dpdk_mp_list>, elem=0x18) at ./include/openvswitch/list.h:124
124	    elem->prev = before->prev;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64
(gdb) bt
#0  0x0000000000e78be7 in ovs_list_insert (before=0x1266290 <dpdk_mp_list>, elem=0x18)
    at ./include/openvswitch/list.h:124
#1  0x0000000000e78c35 in ovs_list_push_back (list=0x1266290 <dpdk_mp_list>, elem=0x18)
    at ./include/openvswitch/list.h:164
#2  0x0000000000e79388 in dpdk_mp_get (socket_id=1, mtu=2030) at lib/netdev-dpdk.c:533
#3  0x0000000000e79475 in netdev_dpdk_mempool_configure (dev=0x7fff7ffc55c0) at lib/netdev-dpdk.c:570
#4  0x0000000000e7f294 in netdev_dpdk_reconfigure (netdev=0x7fff7ffc55c0) at lib/netdev-dpdk.c:3134
#5  0x0000000000da7496 in netdev_reconfigure (netdev=0x7fff7ffc55c0) at lib/netdev.c:2001
#6  0x0000000000d7450c in port_reconfigure (port=0x15f7800) at lib/dpif-netdev.c:2952
#7  0x0000000000d7527f in reconfigure_datapath (dp=0x15bdca0) at lib/dpif-netdev.c:3273
#8  0x0000000000d70d87 in do_add_port (dp=0x15bdca0, devname=0x15bc700 "dpdk0", type=0xf55942 "dpdk", port_no=2)
    at lib/dpif-netdev.c:1351
#9  0x0000000000d70e7d in dpif_netdev_port_add (dpif=0x15614d0, netdev=0x7fff7ffc55c0, port_nop=0x7fffffffe1b8)
    at lib/dpif-netdev.c:1377
#10 0x0000000000d7afb4 in dpif_port_add (dpif=0x15614d0, netdev=0x7fff7ffc55c0, port_nop=0x7fffffffe20c)
    at lib/dpif.c:544
#11 0x0000000000d242a0 in port_add (ofproto_=0x15bc940, netdev=0x7fff7ffc55c0) at ofproto/ofproto-dpif.c:3342
#12 0x0000000000d0c726 in ofproto_port_add (ofproto=0x15bc940, netdev=0x7fff7ffc55c0, ofp_portp=0x7fffffffe374)
    at ofproto/ofproto.c:1998
#13 0x0000000000cf94fc in iface_do_create (br=0x15615b0, iface_cfg=0x15f8a80, ofp_portp=0x7fffffffe374, 
    netdevp=0x7fffffffe378, errp=0x7fffffffe368) at vswitchd/bridge.c:1763
#14 0x0000000000cf9683 in iface_create (br=0x15615b0, iface_cfg=0x15f8a80, port_cfg=0x1565320) at vswitchd/bridge.c:1801
#15 0x0000000000cf6f2b in bridge_add_ports__ (br=0x15615b0, wanted_ports=0x1561690, with_requested_port=false)
    at vswitchd/bridge.c:912
#16 0x0000000000cf6fbc in bridge_add_ports (br=0x15615b0, wanted_ports=0x1561690) at vswitchd/bridge.c:928
#17 0x0000000000cf6510 in bridge_reconfigure (ovs_cfg=0x1588a80) at vswitchd/bridge.c:644
---Type <return> to continue, or q <return> to quit---
#18 0x0000000000cfc7a3 in bridge_run () at vswitchd/bridge.c:2961
#19 0x0000000000d01c78 in main (argc=4, argv=0x7fffffffe638) at vswitchd/ovs-vswitchd.c:111
(gdb) 

問題函數如下:

static struct dpdk_mp *
dpdk_mp_get(int socket_id, int mtu)
{
    struct dpdk_mp *dmp;

    ovs_mutex_lock(&dpdk_mp_mutex);
    LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
        if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
            dmp->refcount++;
            goto out;
        }
    }

    dmp = dpdk_mp_create(socket_id, mtu);
	//dmp返回值有問題
	//dmp 返回NULL,卻沒有判斷直接用&dmp->list_node
    ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);

out:
    ovs_mutex_unlock(&dpdk_mp_mutex);

    return dmp;
}

gdb打印了一下dmp =dpdk_mp_create(socket_id, mtu);此函數返回值果然爲NULLovs_list_push_back函數也沒有做參數檢查拿過來就用。

Socket_id打印出來是1,這個是後續問題推斷的一個重要線索,意思是在numa node1節點上申請內存。

 

順着這個線索往前推,看看到底爲什麼沒有分配到內存。

問題出現在dpdk_mp_create函數,先看看代碼都有哪幾種情形返回NULL

gdb跟蹤了一下發問題出現在調用rte_mempool_create函數返回值是NULL。且rte_mempool_create函數是dpdklib庫函數,ovs調用此函數創建內存池。定位到此處發現如果再繼續定位下去需要查看dpdk的代碼,沒辦法只能繼續。

問題跟蹤到dpdk的代碼進度是越來越慢,因爲跟蹤的這一路是dpdk的內存管理流程,代碼很難啃。繼續跟蹤rte_mempool_create函數。

函數調用流程如下:

rte_mempool_create

-> rte_mempool_create_empty

->rte_memzone_reserve

->rte_memzone_reserve_thread_safe

->memzone_reserve_aligned_thread_unsafe

->malloc_heap_alloc

->find_suitable_element

 

最後跟蹤到find_suitable_element函數,此函數的功能是dpdk的內存管理malloc申請堆內存時,首先查看空閒鏈表是否有空閒的內存塊,如果有空閒內存塊則返回空閒節點的地址,如果沒有返回NULL

函數如下:

static struct malloc_elem *
find_suitable_element(struct malloc_heap *heap, size_t size,
		unsigned flags, size_t align, size_t bound)
{
	size_t idx;
	struct malloc_elem *elem, *alt_elem = NULL;

	for (idx = malloc_elem_free_list_index(size);
			idx < RTE_HEAP_NUM_FREELISTS; idx++) {
		for (elem = LIST_FIRST(&heap->free_head[idx]);
				!!elem; elem = LIST_NEXT(elem, free_list)) {
			if (malloc_elem_can_hold(elem, size, align, bound)) {
				if (check_hugepage_sz(flags, elem->ms->hugepage_sz))
					return elem;
				if (alt_elem == NULL)
					alt_elem = elem;
			}
		}
	}

	if ((alt_elem != NULL) && (flags & RTE_MEMZONE_SIZE_HINT_ONLY))
		return alt_elem;

	return NULL;
}

此內存管理的空閒鏈表是按照socket-id進行區分的,因爲ovs調用dpdk接口時已經傳入參數socket-id1heap->free_head[idx]socket-id1空閒鏈表的頭節點。

   繼續往下查空閒鏈表爲什麼是空,只能分析下空閒鏈表是怎麼初始化的。搜索了一下dpdk代碼發現了空閒鏈表的插入函數:malloc_elem_free_list_insert

 

gdb跟蹤下此函數的調用棧如下:

Main->bridge_run->dpdk_init->dpdk_init__

->rte_eal_init->rte_eal_memzone_init->rte_eal_malloc_heap_init

->malloc_heap_add_memseg->malloc_elem_free_list_insert

 

函數調用到rte_eal_init就進入了dpdklib庫,此函數之前是ovsovs-vswitchd進程函數,順着這個調用流程往前推。

分析了下rte_eal_malloc_heap_init函數,此函數初始化時讀取的rte_eal_get_configuration()->mem_config內存配置信息,且內存配置信息只有一條,此條配置信息是ms->socket_id = 0, ms->len = 1073741824 ,大小剛好是1G字節的內存也就是說numanode0節點申請分配1G大小的內存空間,而node 1節點沒有申請分配內存,而程序運行時恰好是node 1節點在申請開闢內存池。

int
rte_eal_malloc_heap_init(void)
{
	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
	unsigned ms_cnt;
	struct rte_memseg *ms;

	if (mcfg == NULL)
		return -1;

	for (ms = &mcfg->memseg[0], ms_cnt = 0;
			(ms_cnt < RTE_MAX_MEMSEG) && (ms->len > 0);
			ms_cnt++, ms++) {
		malloc_heap_add_memseg(&mcfg->malloc_heaps[ms->socket_id], ms);
	}

	return 0;
}

   開源軟件應該都會有一些輔助定位的app,如通過./dpdk-procinfo命令也可以查詢dpdk的內存分配情況,紅框中出現兩個segment,每個segment對應一個socket_id,大小是1G。如果只有socket_id:0memory_segment,那麼就能確定內存分配出現問題了。

 

接下來的推導思路是查詢下dpdk的內存配置信息是怎麼給rte_eal_get_configuration()->mem_config賦值的。

經過搜索代碼rte_eal_get_configuration()->mem_config變量的賦值是在rte_eal_hugepage_init()函數中進行的。

此函數的調用棧如下:

Main->bridge_run->dpdk_init->dpdk_init__

->rte_eal_init->rte_eal_memory_init->rte_eal_hugepage_init

 

   rte_eal_hugepage_init函數主要是在/mnt/huge目錄下創建的hugetlbfs配置的內存頁數的rtemap_xx文件,併爲每個rtemap_xx文件做mmap映射,保證mmap後的虛擬地址與實際的物理地址一致。

   

首先映射頁表中的所有大頁,經過兩次mmap保證虛擬地址和物理地址一致,調用calc_num_pages_per_socket函數計算每個numa node節點上的應該使用的pages 數目,最後調用unmap_unneeded_hugepages函數unmap掉無用的內存頁。在安裝部署的過程中申請了41G的內存頁,應該是numanode 01節點各兩個大頁,但是實際情況是隻有node 0節點上有1個大頁被使用,rte_eal_hugepage_init函數初始化時原本mmap映射了4頁內存,但是後來被unmap_unneeded_hugepages函數釋放了只在node 0節點上保留一頁。是按照calc_num_pages_per_socket函數計算的結果進行調整的,分析calc_num_pages_per_socket函數得知計算每個numa node節點的內存頁數目是按照internal_config.memory配置的總內存頁數目和每一個node節點配置的內存頁數目計算的。internal_cfg->memory是internal_cfg->socket_mem[i]的總和。internal_cfg->socket_mem[i]的賦值在eal_parse_socket_mem函數中進行。

此函數的調用棧如下:

Main->bridge_run->dpdk_init->dpdk_init__->rte_eal_init->eal_parse_args->eal_parse_socket_mem

internal_cfg->socket_mem[i]配置的下刷要追溯到ovs的代碼dpdk_init__函數,此函數中調用get_dpdk_args函數獲取獲取配置參數,其中對socket_mem進行賦值的是在construct_dpdk_mutex_options函數中進行。

static int
construct_dpdk_mutex_options(const struct smap *ovs_other_config,
                             char ***argv, const int initial_size,
                             char **extra_args, const size_t extra_argc)
{
    struct dpdk_exclusive_options_map {
        const char *category;
        const char *ovs_dpdk_options[MAX_DPDK_EXCL_OPTS];
        const char *eal_dpdk_options[MAX_DPDK_EXCL_OPTS];
        const char *default_value;
        int default_option;
    } excl_opts[] = {
        {"memory type",
         {"dpdk-alloc-mem", "dpdk-socket-mem", NULL,},
         {"-m",             "--socket-mem",    NULL,},
         "1024,0", 1
        },
    };   //默認值 1024Mb 和0Mb
........
 }

socket_mem的默認配置是numa node 0節點1024Mb,node 1節點0Mb

 

問題分析到這已經水落石出了,通過查看ovs官網得到如下命令可以修改各個節點socket_mem的數值。

ovs-vsctl --no-wait set Open_vSwitch . \

other_config:dpdk-socket-mem="1024,1024

 

配置此命令後node01節點各分配1024Mb內存空間。再次配置ovs網橋和端口程序運行正常。

 

輔助定位信息:(進程啓動時有如下信息顯示)

1.可以看到socket-mem的分配情況。


2.添加ovs端口時我使用的網卡的pci值爲0000:81:00.1,如下啓動信息已顯示此網卡在numanode1節點上。

[root@10-0-192-25 src]# ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file
2017-04-06T03:11:50Z|00001|vlog|INFO|opened log file /usr/local/var/log/openvswitch/ovs-vswitchd.log
2017-04-06T03:11:50Z|00002|ovs_numa|INFO|Discovered 16 CPU cores on NUMA node 0
2017-04-06T03:11:50Z|00003|ovs_numa|INFO|Discovered 16 CPU cores on NUMA node 1
2017-04-06T03:11:50Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 32 CPU cores
2017-04-06T03:11:50Z|00005|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connecting...
2017-04-06T03:11:50Z|00006|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connected
2017-04-06T03:11:50Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-04-06T03:11:50Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /usr/local/var/run/openvswitch
2017-04-06T03:11:50Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd --socket-mem 1024,0 -c 0x00000001
EAL: Detected 32 lcore(s)
EAL: Probing VFIO support...
EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:81:00.1 on NUMA socket 1
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:82:00.0 on NUMA socket 1
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:82:00.1 on NUMA socket 1
EAL:   probe driver: 8086:10fb net_ixgbe
Zone 0: name:<rte_eth_dev_data>, phys:0xfbffcec40, len:0x30100, virt:0x7fd77ffcec40, socket_id:0, flags:0
2017-04-06T03:11:52Z|00010|dpdk|INFO|DPDK Enabled - initialized
2017-04-06T03:11:52Z|00011|timeval|WARN|Unreasonably long 1699ms poll interval (146ms user, 1452ms system)
2017-04-06T03:11:52Z|00012|timeval|WARN|faults: 1482 minor, 0 major
2017-04-06T03:11:52Z|00013|timeval|WARN|context switches: 7 voluntary, 38 involuntary
2017-04-06T03:11:52Z|00014|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=edcf6a06:




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章