OVS啓動
vswitchd/ovs-vswitchd.c啓動main-->netdev_run-->netdev_initialize-->netdev_dpdk_register
-->netdev_register_provider註冊dpdk_vhost_user_class
添加dpdk端口的時候,會觸發創建pmd線程的操作。
dpif_netdev_port_add-->do_add_port-->dp_netdev_set_pmds_on_numa-->pmd_thread_main
如果已經添加了dpdk端口,啓動的時候也會觸發創建pmd的線程的操作。
dpif_netdev_pmd_set-->dp_netdev_reset_pmd_threads-->dp_netdev_set_pmds_on_numa-->pmd_thread_main
dp_netdev_process_rxq_port接口負責接收報文,然後調用接口dp_netdev_input–>dp_netdev_input__負責查表,
然後調用packet_batch_execute–>dp_netdev_execute_actions執行actions操作。
poll mode線程
pmd_thread_main
- pmd_thread_setaffinity_cpu設置線程綁定的lcore。
- for無限循環。
- for循環各個端口,執行dp_netdev_process_rxq_port處理端口。
- 循環中間會根據變動重新加載端口和隊列信息。
dp_netdev_process_rxq_port
- 調用netdev_rxq_recv接收報文,前後都有計時。
- 調用dp_netdev_input將報文傳輸給flow,並且發送報文,前後都有計時。
- netdev_rxq_recv=>netdev_dpdk_vhost_rxq_recv
- 調用dpdk接口rte_vhost_dequeue_burst接收報文。
- 調用netdev_dpdk_vhost_update_rx_counters更新統計信息。
dp_netdev_input=>dp_netdev_input__
- emc_processing主要是將收到的幾個報文解析key值,並且從cache中查找流表,匹配的報文放入流表;返回不匹配的報文個數。
- 如果存在不匹配的報文,調用fast_path_processing則繼續查找全部表項,找到則將流表放入cache,不匹配則上報到controller。
- 調用packet_batch_execute根據流表來操作報文。
emc_processing
- 調用miniflow_extract將報文解析到key值。
- 調用emc_lookup,從hash表中查找,並且進行key值比較。
- 如果匹配,調用dp_netdev_queue_batches將報文添加在flow->batches中。 不匹配將不匹配的報文當前排。
- 調用dp_netdev_count_packet統計匹配的報文數。
fast_path_processing
- dpcls_lookup通過classifier查找子流表,如果所有的報文都找到了匹配的子流表,將流表插入緩存中,並且將報文加入flow->batches。如果不匹配,則上報到controller。
- 統計匹配、不匹配和丟失。
packet_batch_per_flow_execute
- 調用dp_netdev_flow_get_actions獲取flow對應的actions。dp_netdev_execute_actions執行對應的actions
actions操作
dp_netdev_execute_actions=>odp_execute_actions
- dp_netdev_execute_actions=>odp_execute_actions
dp_execute_cb
- 如果是OVS_ACTION_ATTR_OUTPUT,調用dp_netdev_lookup_port查找端口,然後調用netdev_send進行報文發送。
- 如果是OVS_ACTION_ATTR_TUNNEL_PUSH,調用push_tnl_action進行tunnel封裝,然後調用dp_netdev_recirculate–>dp_netdev_input__重新查表操作。
- 如果是OVS_ACTION_ATTR_TUNNEL_POP,調用netdev_pop_header解封裝,然後調用dp_netdev_recirculate–>dp_netdev_input__重新查表操作。
netdev_send=>netdev_dpdk_vhost_send=>__netdev_dpdk_vhost_send
- 循環調用dpdk接口rte_vhost_enqueue_burst發送報文。
- 調用netdev_dpdk_vhost_update_tx_counters更新統計信息。
dpdk接收報文的接口
- rte_vhost_dequeue_burst涉及到一次數據拷貝
dpdk發送報文的接口
rte_vhost_enqueue_burst–>virtio_dev_rx
- 預取vhost_virtqueue的信息。
- 根據發送報文數count,將報文放入ring當中。此處涉及到一次數據拷貝。
- 內存屏障進行同步。
如果對方支持中斷,通過eventfd_write通知對方有報文了,一般是對方virtio有中斷,virtio-user無中斷。
pmd_thread_main分析
PMD線程在其輪詢列表中持續輪詢輸入端口,在每一個端口上最多可同時收32個包(NETDEV_MAX_BURST),根據激活的流規則可將每一個收包進行分類。分類的目的是爲了找到一個流,從而對包進行恰當的處理。包根據流進行分組,並且每一個分組將執行特定的動作。
pmd_thread_main是OVS通過pmd線程輪詢在用戶態收包流程的入口函數
pmd_thread_main(void *f_)
{
struct dp_netdev_pmd_thread *pmd = f_;
unsigned int lc = 0;
struct polled_queue *poll_list;
bool exiting;
int poll_cnt;
int i;
poll_list = NULL;
...
/*將pmd->poll_list存入poll_list並返回polled_queue數*/
poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
reload:
emc_cache_init(&pmd->flow_cache);
...
for (;;) {
for (i = 0; i < poll_cnt; i++) {
dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
poll_list[i].port_no);
}
...
}
poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
exiting = latch_is_set(&pmd->exit_latch); //若設置pmd->exit_latch,那麼終結pmd線程
/* Signal here to make sure the pmd finishes
* reloading the updated configuration. */
dp_netdev_pmd_reload_done(pmd);
emc_cache_uninit(&pmd->flow_cache);
if (!exiting) {
goto reload;
}
free(poll_list);
pmd_free_cached_ports(pmd);
return NULL;
}
pmd_thread_main通過調用dp_netdev_process_rxq_port處理netdev的收包過程
dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
struct netdev_rxq *rx,
odp_port_t port_no)
{
struct dp_packet_batch batch;
int error;
dp_packet_batch_init(&batch);
cycles_count_start(pmd);
/*通過調用netdev_class->rxq_recv從rx中收包存入batch中*/
error = netdev_rxq_recv(rx, &batch);
cycles_count_end(pmd, PMD_CYCLES_POLLING);
if (!error) {
*recirc_depth_get() = 0;
cycles_count_start(pmd);
/*將batch中的包轉入datapath中進行處理*/
dp_netdev_input(pmd, &batch, port_no);
cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
}
...
}
netdev_class的實例有NETDEV_DPDK_CLASS,NETDEV_DUMMY_CLASS,NETDEV_BSD_CLASS,NETDEV_LINUX_CLASS.
netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch)
{
struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
int nb_rx;
int dropped = 0;
if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
return EAGAIN;
}
/*調用dpdk接口rte_eth_rx_burst進行收包,一次最多收32個包*/
nb_rx = rte_eth_rx_burst(rx->port_id, rxq->queue_id,
(struct rte_mbuf **) batch->packets,
NETDEV_MAX_BURST);
if (!nb_rx) {
return EAGAIN;
}
/*若存在policer那麼對dp_packet_batch中的每一個dp_packet
*調用netdev_dpdk_policer_pkt_handle進行處理,返回值爲meter後的實際收包數*/
if (policer) {
dropped = nb_rx;
nb_rx = ingress_policer_run(policer,
(struct rte_mbuf **) batch->packets,
nb_rx);
dropped -= nb_rx;
}
/* Update stats to reflect dropped packets */
if (OVS_UNLIKELY(dropped)) {
rte_spinlock_lock(&dev->stats_lock);
dev->stats.rx_dropped += dropped;
rte_spinlock_unlock(&dev->stats_lock);
}
batch->count = nb_rx;
return 0;
}
包從物理或者虛擬接口進入OVS-DPDK後根據包的頭域將會得到一個唯一的標識或者hash,這個標識將會與以下3個交換表中的一條表項進行匹配。這三個交換表分別爲:exact match cache(EMC),datapath classifier(dpcls),ofproto classifier。包將會按順序遍歷以上3個表直到找到表項與其匹配,匹配後包將執行匹配所指示的所有動作,然後進行轉發。
EMC根據有限數量的表項對流提供快速的處理,在EMC中包標識必須與表項進行IP 5元組的精確匹配。若EMC未匹配上,那麼包將進入dpcls。dpcls擁有多重子表來維持更多的表項,並且可使用通配(wildcard)對包標識進行匹配。當包與dpcls匹配後流表項將在EMC中進行設置,在此之後那些擁有與當前包相同標識的包可以根據EMC快速處理。若EMC依舊未匹配上,那麼包將進入ofproto classifier根據openflow控制器進行處理。若在ofproto classifier中匹配了相應的表項,那個該表項將項快速交換表分發,在此之後那些擁有相同流的包將被快速處理。(翻譯自:https://software.intel.com/en-us/articles/open-vswitch-with-dpdk-overview)
注意:EMC是以PMD爲邊界的,每個PMD擁有自己的EMC;dpcls是以端口爲邊界的,每個端口擁有自己的dpcls;ofproto classifier是以橋爲邊界的,每個橋擁有自己的ofproto classifier
dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
struct dp_packet_batch *packets,
bool md_is_valid, odp_port_t port_no)
{
int cnt = packets->count;
#if !defined(__CHECKER__) && !defined(_WIN32)
const size_t PKT_ARRAY_SIZE = cnt;
#else
/* Sparse or MSVC doesn't like variable length array. */
enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
#endif
OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE];
struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
long long now = time_msec();
size_t newcnt, n_batches, i;
odp_port_t in_port;
n_batches = 0;
/*將dp_packet_batch中的所有包送入EMC(pmd->flow_cache)處理
*返回要被送入fast_path_processing中處理的包數
*同時若md_is_valid該函數還將根據port_no初始化metadata*/
newcnt = emc_processing(pmd, packets, keys, batches, &n_batches,
md_is_valid, port_no);
if (OVS_UNLIKELY(newcnt)) {
packets->count = newcnt;
/* Get ingress port from first packet's metadata. */
in_port = packets->packets[0]->md.in_port.odp_port;
fast_path_processing(pmd, packets, keys, batches, &n_batches, in_port, now);
}
/* All the flow batches need to be reset before any call to
* packet_batch_per_flow_execute() as it could potentially trigger
* recirculation. When a packet matching flow ‘j’ happens to be
* recirculated, the nested call to dp_netdev_input__() could potentially
* classify the packet as matching another flow - say 'k'. It could happen
* that in the previous call to dp_netdev_input__() that same flow 'k' had
* already its own batches[k] still waiting to be served. So if its
* ‘batch’ member is not reset, the recirculated packet would be wrongly
* appended to batches[k] of the 1st call to dp_netdev_input__(). */
for (i = 0; i < n_batches; i++) {
batches[i].flow->batch = NULL;
}
for (i = 0; i < n_batches; i++) {
packet_batch_per_flow_execute(&batches[i], pmd, now);
}
}
原文鏈接:https://blog.csdn.net/qq_15437629/article/details/80873307