linux kernel 2.6.35中RFS特性詳解

本文鏈接地址: linux kernel 2.6.35中RFS特性詳解

前面我介紹過google對內核協議棧的patch，RPS,它主要是爲了軟中斷的負載均衡，這次繼續來介紹google 的對RPS的增強path RFS(receive flow steering),RPS是把軟中斷map到對應cpu，而這個時候還會有另外的性能影響，那就是如果應用程序所在的cpu和軟中斷處理的cpu不是同一個，此時對於cpu cache的影響會很大。這裏要注意，在kernel 的2.6.35中這兩個patch已經加入了。

ok,先來描述下它是怎麼做的，其實這個補丁很簡單，想對於rps來說就是添加了一個cpu的選擇，也就是說我們需要根據應用程序的cpu來選擇軟中斷需要被處理的cpu。這裏做法是當調用recvmsg的時候，應用程序的cpu會被存儲在一個hash table中，而索引是根據socket的rxhash進行計算的。而這個rxhash就是RPS中計算得出的那個skb的hash值.

可是這裏會有一個問題，那就是當多個線程或者進程讀取相同的socket的時候，此時就會導致cpu id不停的變化，從而導致大量的OOO的數據包(這是因爲cpu id變化，導致下面軟中斷不停的切換到不同的cpu，此時就會導致大量的亂序的包).

而RFS是如何解決這個問題的呢，它做了兩個表rps_sock_flow_table和rps_dev_flow_table，其中第一個rps_sock_flow_table是一個全局的hash表，這個錶針對socket的，映射了socket對應的cpu，這裏的cpu就是應用層期待軟中斷所在的cpu。

struct
rps_sock_flow_table {
    unsigned
int 
mask;
//hash表
    u16 ents[0];
};

可以看到它有兩個域，其中第一個是掩碼，用於來計算hash表的索引，而ents就是保存了對應socket的cpu。

然後是rps_dev_flow_table,這個是針對設備的，每個設備隊列都含有一個rps_dev_flow_table(這個表主要是保存了上次處理相同鏈接上的skb所在的cpu),這個hash表中每一個元素包含了一個cpu id，一個tail queue的計數器，這個值是一個很關鍵的值，它主要就是用來解決上面大量OOO的數據包的問題的，它保存了當前的dev flow table需要處理的數據包的尾部計數。接下來我們會詳細分析這個東西。

struct
netdev_rx_queue {
    struct
rps_map *rps_map;
//每個設備的隊列保存了一個rps_dev_flow_table
    struct
rps_dev_flow_table *rps_flow_table;
    struct
kobject kobj;
    struct
netdev_rx_queue *first;
    atomic_t count;
} ____cacheline_aligned_in_smp;
 
 
struct
rps_dev_flow_table {
    unsigned
int 
mask;
    struct
rcu_head rcu;
    struct
work_struct free_work;
//hash表
    struct
rps_dev_flow flows[0];
};
 
struct
rps_dev_flow {
    u16 cpu;
    u16 fill;
//tail計數。
    unsigned
int 
last_qtail;
};

首先我們知道，大量的OOO的數據包的引起是因爲多個進程同時請求相同的socket，而此時會導致這個socket對應的cpu id不停的切換，然後軟中斷如果不做處理，只是簡單的調度軟中斷到不同的cpu，就會導致順序的數據包被分發到不同的cpu，由於是smp，因此會導致大量的OOO的數據包，而在RFS中是這樣解決這個問題的，在soft_net中添加了2個域,input_queue_head和input_queue_tail，然後在設備隊列中添加了rps_flow_table，而rps_flow_table中的元素rps_dev_flow包含有一個last_qtail，RFS就通過這3個域來控制亂序的數據包。

這裏爲什麼需要3個值呢，這是因爲每個cpu上的隊列的個數input_queue_tail是一直增加的，而設備每一個隊列中的flow table對應的skb則是有可能會被調度到另外的cpu，而dev flow table的last_qtail表示當前的flow table所需要處理的數據包隊列(backlog queue)的尾部隊列計數,也就是說當input_queue_head大於等於它的時候說明當前的flow table可以切換了，否則的話不能切換到進程期待的cpu。

不過這裏還要注意就是最好能夠綁定進程到指定的cpu(配合rps和rfs的參數設置)，這樣的話，rfs和rps的效率會更好，所以我認爲像erlang這種在rfs和rps下性能應該提高非常大的.
下面就是softnet_data 的結構。

struct
softnet_data {
    struct
Qdisc        *output_queue;
    struct
Qdisc        **output_queue_tailp;
    struct
list_head    poll_list;
    struct
sk_buff      *completion_queue;
    struct
sk_buff_head process_queue;
 
    /* stats */
    unsigned
int        
processed;
    unsigned
int        
time_squeeze;
    unsigned
int        
cpu_collision;
    unsigned
int        
received_rps;
 
#ifdef CONFIG_RPS
    struct
softnet_data *rps_ipi_list;
 
    /* Elements below can be accessed between CPUs for RPS */
    struct
call_single_data csd ____cacheline_aligned_in_smp;
    struct
softnet_data *rps_ipi_next;
    unsigned
int        
cpu;
//最關鍵的兩個域
    unsigned
int        
input_queue_head;
    unsigned
int        
input_queue_tail;
#endif
    unsigned        dropped;
    struct
sk_buff_head input_pkt_queue;
    struct
napi_struct  backlog;
};

接下來我們來看代碼，來看內核是如何實現的，先來看inet_recvmsg,也就是調用rcvmsg時，內核會調用的函數,這個函數比較簡單，就是多加了一行代碼sock_rps_record_flow,這個函數主要是將本socket和cpu設置到rps_sock_flow_table這個hash表中。

首先要提一下，這裏這兩個flow table的初始化都是放在sys中初始化的，不過sys部分相關的代碼我就不分析了，因爲具體的邏輯和原理都是在協議棧部分實現的。

int
inet_recvmsg(struct
kiocb *iocb, struct
socket *sock, struct
msghdr *msg,
         size_t
size, int
flags)
{
    struct
sock *sk = sock->sk;
    int
addr_len = 0;
    int
err;
//設置hash表
    sock_rps_record_flow(sk);
 
    err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
                   flags & ~MSG_DONTWAIT, &addr_len);
    if
(err >= 0)
        msg->msg_namelen = addr_len;
    return
err;
}

然後就是rps_record_sock_flow，這個函數主要是得到全局的rps_sock_flow_table，然後調用rps_record_sock_flow來對rps_sock_flow_table進行設置，這裏會將socket的sk_rxhash傳遞進去當作hash的索引，而這個sk_rxhash其實就是skb裏面的rxhash，skb的rxhash就是rps中設置的hash值，這個值是根據四元組進行hash的。這裏用這個當索引一個是爲了相同的socket都能落入一個index。而且下面的軟中斷上下文也比較容易存取這個hash表。

struct
rps_sock_flow_table *rps_sock_flow_table __read_mostly;
static
inline 
void sock_rps_record_flow(const
struct 
sock *sk)
{
#ifdef CONFIG_RPS
    struct
rps_sock_flow_table *sock_flow_table;
 
    rcu_read_lock();
    sock_flow_table = rcu_dereference(rps_sock_flow_table);
//設置hash表
    rps_record_sock_flow(sock_flow_table, sk->sk_rxhash);
    rcu_read_unlock();
#endif
}

其實所有的事情都是rps_record_sock_flow中做的

static
inline 
void rps_record_sock_flow(struct
rps_sock_flow_table *table,
                    u32 hash)
{
    if
(table && hash) {
//獲取索引。
        unsigned
int 
cpu, index = hash & table->mask;
 
        /* We only give a hint, preemption can change cpu under us */
//獲取cpu
        cpu = raw_smp_processor_id();
//保存對應的cpu,如果等於當前cpu，則說明已經設置過了。
        if
(table->ents[index] != cpu)
//否則設置cpu
            table->ents[index] = cpu;
    }
}

上面是進程上下文做的事情，也就是設置對應的進程所期待的cpu，它用的是rps_sock_flow_table，而接下來就是軟中斷上下文了，rfs這個patch主要的工作都是在軟中斷上下文做的。不過看這裏的代碼之前最好能夠了解下RPS補丁，因爲RFS就是對rps做了一點小改動。

主要是兩個函數，第一個是enqueue_to_backlog，這個函數我們知道是用來將skb掛在到對應cpu的input queue上的，這裏我們就關注他的一個函數就是input_queue_tail_incr_save，他就是更新設備的input_queue_tail以及softnet_data的input_queue_tail。

        if
(skb_queue_len(&sd->input_pkt_queue)) {
enqueue:
            __skb_queue_tail(&sd->input_pkt_queue, skb);
//這個函數更新對應設備的rps_dev_flow_table中的input_queue_tail以及dev flow table的last_qtail
            input_queue_tail_incr_save(sd, qtail);
            rps_unlock(sd);
            local_irq_restore(flags);
            return
NET_RX_SUCCESS;
        }

第二個是get_rps_cpu，這個函數我們知道就是得到軟中斷應該運行的cpu，這裏我們就看RFS添加的部分,這裏它是這樣計算的，首先會得到兩個flow table，一個是sock_flow_table,另一個是設備的rps_flow_table(skb對應的設備隊列中對應的flow table)，這裏的邏輯是這樣子的取出來兩個cpu，一個是根據rps計算數據包前一次被調度過的cpu(tcpu)，一個是應用程序期望的cpu(next_cpu)，然後比較這兩個值，如果 1 tcpu未設置(等於RPS_NO_CPU） 2 tcpu是離線的 3 tcpu的input_queue_head大於rps_flow_table中的last_qtail 的話就調度這個skb到next_cpu.
而這裏第三點input_queue_head大於rps_flow_table則說明在當前的dev flow table中的數據包已經發送完畢，否則的話爲了避免亂序就還是繼續使用tcpu.

got_hash:
    flow_table = rcu_dereference(rxqueue->rps_flow_table);
    sock_flow_table = rcu_dereference(rps_sock_flow_table);
    if
(flow_table && sock_flow_table) {
        u16 next_cpu;
        struct
rps_dev_flow *rflow;
//得到flow table
        rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
        tcpu = rflow->cpu;
/得到next_cpu
        next_cpu = sock_flow_table->ents[skb->rxhash &
            sock_flow_table->mask];
 
//條件
        if
(unlikely(tcpu != next_cpu) &&
            (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
             ((int)(per_cpu(softnet_data, tcpu).input_queue_head
-
              rflow->last_qtail)) >= 0)) {
//設置tcpu
            tcpu = rflow->cpu = next_cpu;
            if
(tcpu != RPS_NO_CPU)
//更新last_qtail
                rflow->last_qtail = per_cpu(softnet_data,
                    tcpu).input_queue_head;
        }
        if
(tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
            *rflowp = rflow;
//設置返回cpu，以供軟中斷重新調度
            cpu = tcpu;
            goto
done;
        }
    }
....................................

最後我們來分析下第一次數據包到達協議棧而應用程序還沒有調用rcvmsg讀取數據包，此時會發生什麼問題，當第一次進來時tcpu是RPS_NO_CPU,並且next_cpu也是RPS_NO_CPU，此時會導致跳過rfs處理，而是直接使用rps的處理,也就是上面代碼的緊接着的部分,下面這段代碼前面rps時已經分析過了，這裏就不分析了。

map = rcu_dereference(rxqueue->rps_map);
if
(map) {
    tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
 
    if
(cpu_online(tcpu)) {
        cpu = tcpu;
        goto
done;
    }
}

此文章爲轉載，原文路徑http://www.pagefault.info/?p=115。

linux kernel 2.6.35中RFS特性詳解

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

linux 進程調度策略

TCP receive_queue prequeue backlog

linux kernel 2.6.35中RFS特性詳解

數據包在用戶空間的狀態

Coping with the TCP TIME-WAIT state on busy Linux servers

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結