【OpenVswitch源碼分析之六】內核空間轉發面數據結構與流程

內核態的報文處理起始有不少人已經寫了比較詳細的分析，這裏有SDNLAB的一篇文章（http://www.sdnlab.com/15713.html），這裏只是對那些文章再做些總結；內核對報文的處理整體上分爲三個大的步驟：

報文頭的提取
流表項的匹配
動作的執行

對於報文頭的提取，與傳統的路由器、交換機不同，OpenFlow的匹配域包含了L2-L4等匹配域。所以其設計了一個數據結構sw_flow_key來做提取

struct sw_flow_key {
    u8 tun_opts[255];
    u8 tun_opts_len;
    struct ip_tunnel_key tun_key;  /* Encapsulating tunnel key. */
    struct {
        u32 priority;   /* Packet QoS priority. */
        u32 skb_mark;   /* SKB mark. */
        u16 in_port;    /* Input switch port (or DP_MAX_PORTS). */
    } __packed phy; /* Safe when right after 'tun_key'. */
    u8 tun_proto;                   /* Protocol of encapsulating tunnel. */
    u32 ovs_flow_hash;      /* Datapath computed hash value.  */
    u32 recirc_id;          /* Recirculation ID.  */
    struct {
        u8     src[ETH_ALEN];   /* Ethernet source address. */
        u8     dst[ETH_ALEN];   /* Ethernet destination address. */
        __be16 tci;     /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
        __be16 type;        /* Ethernet frame type. */
    } eth;
    union {
        struct {
            __be32 top_lse; /* top label stack entry */
        } mpls;
        struct {
            u8     proto;   /* IP protocol or lower 8 bits of ARP opcode. */
            u8     tos;     /* IP ToS. */
            u8     ttl;     /* IP TTL/hop limit. */
            u8     frag;    /* One of OVS_FRAG_TYPE_*. */
        } ip;
    };
    struct {
        __be16 src;     /* TCP/UDP/SCTP source port. */
        __be16 dst;     /* TCP/UDP/SCTP destination port. */
        __be16 flags;       /* TCP flags. */
    } tp;
    union {
        struct {
            struct {
                __be32 src; /* IP source address. */
                __be32 dst; /* IP destination address. */
            } addr;
            struct {
                u8 sha[ETH_ALEN];   /* ARP source hardware address. */
                u8 tha[ETH_ALEN];   /* ARP target hardware address. */
            } arp;
        } ipv4;
        struct {
            struct {
                struct in6_addr src;    /* IPv6 source address. */
                struct in6_addr dst;    /* IPv6 destination address. */
            } addr;
            __be32 label;           /* IPv6 flow label. */
            struct {
                struct in6_addr target; /* ND target address. */
                u8 sll[ETH_ALEN];   /* ND source link layer address. */
                u8 tll[ETH_ALEN];   /* ND target link layer address. */
            } nd;
        } ipv6;
    };
    struct {
        /* Connection tracking fields. */
        u16 zone;
        u32 mark;
        u8 state;
        struct ovs_key_ct_labels labels;
    } ct;

} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */

struct sw_flow {
    struct rcu_head rcu;
    struct {
        struct hlist_node node[2];
        u32 hash;
    } flow_table, ufid_table;
    int stats_last_writer;      /* NUMA-node id of the last writer on
                     * 'stats[0]'.
                     */
    struct sw_flow_key key;
    struct sw_flow_id id;
    struct sw_flow_mask *mask;
    struct sw_flow_actions __rcu *sf_acts;
    struct flow_stats __rcu *stats[]; /* One for each NUMA node.  First one
                       * is allocated at flow creation time,
                       * the rest are allocated on demand
                       * while holding the 'stats[0].lock'.
                       */
};

相關數據結構的關係及詳細內容都在上面有表述，下面講講具體的工作步驟：

第一步，它會根據網橋上的流表結構體（table）中的mask_list成員來遍歷，這個mask_list成員是一條鏈表的頭結點，這條鏈表是由mask元素鏈接組成（裏面的list是沒有數據的鏈表結構，作用就是負責鏈接多個mask結構，是mask的成員）；流表查詢函數開始就是循環遍歷這條鏈表，每遍歷到得到一個mask結構體，就調用函數進入第二步。

第二步，是操作key值，調用函數讓從數據包提取到的key值和第一步得到的mask中的key值，進行與操作，然後把結構存放到另外一個key值中（masked_key）。順序執行第三步。

第三步，把第二步中得到的那個與操作後的key值（masked_key），傳入 jhash2()算法函數中，該算法是經典的哈希算法，想深入瞭解可以自己查資料（不過都是些數學推理，感覺挺難的），linux內核中也多處使用到了這個算法函數。通過這個函數把key值（masked_key）轉換成hash關鍵字。

第四步，把第三步得到的hash值，傳入 find_bucket()函數中，在該函數中再通過jhash_1word()算法函數，把hash關鍵字再次哈希得到一個全新的hash關鍵字。這個函數和第三步的哈希算法函數類似，只是參數不同，多了一個word。經過兩個哈希算法函數的計算得到一個新的hash值。

第五步，把第四步得到的hash關鍵字，傳入到flex_array_get()函數中，這個函數的作用就是找到對應的哈希頭位置。具體的請看上面的圖，流表結構（table）中有個buckets成員，該成員稱作爲哈希桶，哈希桶裏面存放的是成員字段和彈性數組parts[n]，而這個parts[n]數組裏面存放的就是所要找的哈希頭指針，這個哈希頭指針指向了一個流表項鍊表（在圖中的最下面struct sw_flow），所以這個纔是我們等下要匹配的流表項。（這個哈希桶到彈性數組這一段，我有點疑問，不是很清楚，在下一篇blog中會分析下這個疑問，大家看到如果和源代碼有出入，請按源代碼來分析），這一步就是根據hash關鍵字查找到流表項的鏈表頭指針。

第六步，由第五步得到的流表項鍊表頭指針，根據這個指針遍歷整個流表項節點元素（就是struct sw_flow結構體元素），每遍歷得到一個流表項sw_flow結構體元素，就把流表項中的mask成員和第一步遍歷得到的mask變量（忘記了可以重新回到第一步去看下）進行比較；比較完後還要讓流表項sw_flow結構體元素中的key值成員和第二步中得到的key值（masked_key）進行比較；只有當上面兩個比較都相等時，這個流表項纔是我們要匹配查詢的流表項了。然後直接返回該流表項的地址。如果找到了，很好說明用戶態的流表已經放入內核，則走fast path就可了。於是直接調用ovs_execute_actions，執行這個key對應的action。

如果不能找到，則只好調用ovs_dp_upcall，讓用戶態去查找流表。會調用static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, const struct sw_flow_key *key, const struct dp_upcall_info *upcall_info)

它會調用err = genlmsg_unicast(ovs_dp_get_net(dp), user_skb, upcall_info->portid);通過netlink將消息發送給用戶態。在用戶態，有線程監聽消息，一旦有消息，則觸發udpif_upcall_handler。

飛翔的美食家

發佈了31 篇原創文章 · 獲贊 1 · 訪問量 3萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【OpenVswitch源碼分析之六】內核空間轉發面數據結構與流程

【OpenVswitch源碼分析之一】背景

【OpenVswitch源碼分析之四】控制面關鍵接口與調用流程

尋找兄弟數字

計算二叉樹的深度和寬度

tarjan算法的原理和實現

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結