OVS啓動

vswitchd/ovs-vswitchd.c啓動main-->netdev_run-->netdev_initialize-->netdev_dpdk_register
-->netdev_register_provider註冊dpdk_vhost_user_class

添加dpdk端口的時候，會觸發創建pmd線程的操作。

dpif_netdev_port_add-->do_add_port-->dp_netdev_set_pmds_on_numa-->pmd_thread_main

如果已經添加了dpdk端口，啓動的時候也會觸發創建pmd的線程的操作。

dpif_netdev_pmd_set-->dp_netdev_reset_pmd_threads-->dp_netdev_set_pmds_on_numa-->pmd_thread_main

dp_netdev_process_rxq_port接口負責接收報文，然後調用接口dp_netdev_input–>dp_netdev_input__負責查表，
然後調用packet_batch_execute–>dp_netdev_execute_actions執行actions操作。

poll mode線程

pmd_thread_main

pmd_thread_setaffinity_cpu設置線程綁定的lcore。
for無限循環。
for循環各個端口，執行dp_netdev_process_rxq_port處理端口。
循環中間會根據變動重新加載端口和隊列信息。

dp_netdev_process_rxq_port

調用netdev_rxq_recv接收報文，前後都有計時。
調用dp_netdev_input將報文傳輸給flow，並且發送報文，前後都有計時。
netdev_rxq_recv=>netdev_dpdk_vhost_rxq_recv
調用dpdk接口rte_vhost_dequeue_burst接收報文。
調用netdev_dpdk_vhost_update_rx_counters更新統計信息。

dp_netdev_input=>dp_netdev_input__

emc_processing主要是將收到的幾個報文解析key值，並且從cache中查找流表，匹配的報文放入流表；返回不匹配的報文個數。
如果存在不匹配的報文，調用fast_path_processing則繼續查找全部表項，找到則將流表放入cache，不匹配則上報到controller。
調用packet_batch_execute根據流表來操作報文。

emc_processing

調用miniflow_extract將報文解析到key值。
調用emc_lookup，從hash表中查找，並且進行key值比較。
如果匹配，調用dp_netdev_queue_batches將報文添加在flow->batches中。不匹配將不匹配的報文當前排。
調用dp_netdev_count_packet統計匹配的報文數。

fast_path_processing

dpcls_lookup通過classifier查找子流表，如果所有的報文都找到了匹配的子流表，將流表插入緩存中，並且將報文加入flow->batches。如果不匹配，則上報到controller。
統計匹配、不匹配和丟失。

packet_batch_per_flow_execute

調用dp_netdev_flow_get_actions獲取flow對應的actions。dp_netdev_execute_actions執行對應的actions

actions操作

dp_netdev_execute_actions=>odp_execute_actions

dp_netdev_execute_actions=>odp_execute_actions

dp_execute_cb

如果是OVS_ACTION_ATTR_OUTPUT，調用dp_netdev_lookup_port查找端口，然後調用netdev_send進行報文發送。
如果是OVS_ACTION_ATTR_TUNNEL_PUSH，調用push_tnl_action進行tunnel封裝，然後調用dp_netdev_recirculate–>dp_netdev_input__重新查表操作。
如果是OVS_ACTION_ATTR_TUNNEL_POP，調用netdev_pop_header解封裝，然後調用dp_netdev_recirculate–>dp_netdev_input__重新查表操作。

netdev_send=>netdev_dpdk_vhost_send=>__netdev_dpdk_vhost_send

循環調用dpdk接口rte_vhost_enqueue_burst發送報文。
調用netdev_dpdk_vhost_update_tx_counters更新統計信息。

dpdk接收報文的接口

rte_vhost_dequeue_burst涉及到一次數據拷貝

dpdk發送報文的接口

rte_vhost_enqueue_burst–>virtio_dev_rx

預取vhost_virtqueue的信息。
根據發送報文數count，將報文放入ring當中。此處涉及到一次數據拷貝。
內存屏障進行同步。
如果對方支持中斷，通過eventfd_write通知對方有報文了，一般是對方virtio有中斷，virtio-user無中斷。

pmd_thread_main分析

PMD線程在其輪詢列表中持續輪詢輸入端口，在每一個端口上最多可同時收32個包（NETDEV_MAX_BURST），根據激活的流規則可將每一個收包進行分類。分類的目的是爲了找到一個流，從而對包進行恰當的處理。包根據流進行分組，並且每一個分組將執行特定的動作。

pmd_thread_main是OVS通過pmd線程輪詢在用戶態收包流程的入口函數

pmd_thread_main(void *f_)
{
    struct dp_netdev_pmd_thread *pmd = f_;
    unsigned int lc = 0;
    struct polled_queue *poll_list;
    bool exiting;
    int poll_cnt;
    int i;

    poll_list = NULL;

    ...

     /*將pmd->poll_list存入poll_list並返回polled_queue數*/
    poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);   
reload:
    emc_cache_init(&pmd->flow_cache);

    ...

    for (;;) {
        for (i = 0; i < poll_cnt; i++) {
            dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
                                       poll_list[i].port_no);
        }

    ...

    }

    poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
    exiting = latch_is_set(&pmd->exit_latch);      //若設置pmd->exit_latch，那麼終結pmd線程
    /* Signal here to make sure the pmd finishes
     * reloading the updated configuration. */
    dp_netdev_pmd_reload_done(pmd);

    emc_cache_uninit(&pmd->flow_cache);

    if (!exiting) {
        goto reload;
    }

    free(poll_list);
    pmd_free_cached_ports(pmd);
    return NULL;
}

pmd_thread_main通過調用dp_netdev_process_rxq_port處理netdev的收包過程

dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
                           struct netdev_rxq *rx,
                           odp_port_t port_no)
{
    struct dp_packet_batch batch;
    int error;

    dp_packet_batch_init(&batch);
    cycles_count_start(pmd);
    /*通過調用netdev_class->rxq_recv從rx中收包存入batch中*/
    error = netdev_rxq_recv(rx, &batch);             
    cycles_count_end(pmd, PMD_CYCLES_POLLING);
    if (!error) {
        *recirc_depth_get() = 0;

        cycles_count_start(pmd);
        /*將batch中的包轉入datapath中進行處理*/
        dp_netdev_input(pmd, &batch, port_no);
        cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
    } 
    ...
}

netdev_class的實例有NETDEV_DPDK_CLASS，NETDEV_DUMMY_CLASS，NETDEV_BSD_CLASS，NETDEV_LINUX_CLASS.

netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch)
{
    struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
    struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
    struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
    int nb_rx;
    int dropped = 0;

    if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
        return EAGAIN;
    }

    /*調用dpdk接口rte_eth_rx_burst進行收包，一次最多收32個包*/
    nb_rx = rte_eth_rx_burst(rx->port_id, rxq->queue_id,
                             (struct rte_mbuf **) batch->packets,
                             NETDEV_MAX_BURST);
    if (!nb_rx) {
        return EAGAIN;
    }

    /*若存在policer那麼對dp_packet_batch中的每一個dp_packet
    *調用netdev_dpdk_policer_pkt_handle進行處理，返回值爲meter後的實際收包數*/
    if (policer) {
        dropped = nb_rx;
        nb_rx = ingress_policer_run(policer,
                                    (struct rte_mbuf **) batch->packets,
                                    nb_rx);
        dropped -= nb_rx;
    }

    /* Update stats to reflect dropped packets */
    if (OVS_UNLIKELY(dropped)) {
        rte_spinlock_lock(&dev->stats_lock);
        dev->stats.rx_dropped += dropped;
        rte_spinlock_unlock(&dev->stats_lock);
    }

    batch->count = nb_rx;

    return 0;
}

包從物理或者虛擬接口進入OVS-DPDK後根據包的頭域將會得到一個唯一的標識或者hash，這個標識將會與以下3個交換表中的一條表項進行匹配。這三個交換表分別爲：exact match cache(EMC)，datapath classifier(dpcls)，ofproto classifier。包將會按順序遍歷以上3個表直到找到表項與其匹配，匹配後包將執行匹配所指示的所有動作，然後進行轉發。

EMC根據有限數量的表項對流提供快速的處理，在EMC中包標識必須與表項進行IP 5元組的精確匹配。若EMC未匹配上，那麼包將進入dpcls。dpcls擁有多重子表來維持更多的表項，並且可使用通配（wildcard）對包標識進行匹配。當包與dpcls匹配後流表項將在EMC中進行設置，在此之後那些擁有與當前包相同標識的包可以根據EMC快速處理。若EMC依舊未匹配上，那麼包將進入ofproto classifier根據openflow控制器進行處理。若在ofproto classifier中匹配了相應的表項，那個該表項將項快速交換表分發，在此之後那些擁有相同流的包將被快速處理。（翻譯自：https://software.intel.com/en-us/articles/open-vswitch-with-dpdk-overview）

注意：EMC是以PMD爲邊界的，每個PMD擁有自己的EMC；dpcls是以端口爲邊界的，每個端口擁有自己的dpcls；ofproto classifier是以橋爲邊界的，每個橋擁有自己的ofproto classifier

dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
                  struct dp_packet_batch *packets,
                  bool md_is_valid, odp_port_t port_no)
{
    int cnt = packets->count;
#if !defined(__CHECKER__) && !defined(_WIN32)
    const size_t PKT_ARRAY_SIZE = cnt;
#else
    /* Sparse or MSVC doesn't like variable length array. */
    enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
#endif
    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE];
    struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
    long long now = time_msec();
    size_t newcnt, n_batches, i;
    odp_port_t in_port;

    n_batches = 0;
    /*將dp_packet_batch中的所有包送入EMC(pmd->flow_cache)處理
    *返回要被送入fast_path_processing中處理的包數
    *同時若md_is_valid該函數還將根據port_no初始化metadata*/
    newcnt = emc_processing(pmd, packets, keys, batches, &n_batches,
                            md_is_valid, port_no);
    if (OVS_UNLIKELY(newcnt)) {
        packets->count = newcnt;
        /* Get ingress port from first packet's metadata. */
        in_port = packets->packets[0]->md.in_port.odp_port;
        fast_path_processing(pmd, packets, keys, batches, &n_batches, in_port, now);
    }

    /* All the flow batches need to be reset before any call to
     * packet_batch_per_flow_execute() as it could potentially trigger
     * recirculation. When a packet matching flow ‘j’ happens to be
     * recirculated, the nested call to dp_netdev_input__() could potentially
     * classify the packet as matching another flow - say 'k'. It could happen
     * that in the previous call to dp_netdev_input__() that same flow 'k' had
     * already its own batches[k] still waiting to be served.  So if its
     * ‘batch’ member is not reset, the recirculated packet would be wrongly
     * appended to batches[k] of the 1st call to dp_netdev_input__(). */
    for (i = 0; i < n_batches; i++) {
        batches[i].flow->batch = NULL;
    }

    for (i = 0; i < n_batches; i++) {
        packet_batch_per_flow_execute(&batches[i], pmd, now);
    }
}

原文鏈接：https://blog.csdn.net/qq_15437629/article/details/80873307

OVS DPDK--報文處理流程（八）

OVS啓動

poll mode線程

pmd_thread_main

dp_netdev_process_rxq_port

dp_netdev_input=>dp_netdev_input__

emc_processing

fast_path_processing

packet_batch_per_flow_execute

actions操作

dp_netdev_execute_actions=>odp_execute_actions

dp_execute_cb

netdev_send=>netdev_dpdk_vhost_send=>__netdev_dpdk_vhost_send

dpdk接收報文的接口

dpdk發送報文的接口

rte_vhost_enqueue_burst–>virtio_dev_rx

pmd_thread_main分析

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

linux虛擬網絡設備--虛擬機網卡和linux bridge上tap設備的關係（七）

linux用戶態驅動--VFIO（一）

KVM virtio_net之NAPI機制（十七）

OVS端口鏡像（十五）

KVM Vhost-net 和 Virtio-net代碼詳解（十八）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結