Epoll實現原理解析

原文鏈接：http://blog.chinaunix.net/uid-20792262-id-2909847.html

1. 功能介紹
    epoll與select/poll不同的一點是，它是由一組系統調用組成。
    int epoll_create(int size);
    int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
    int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
    epoll相關係統調用是在Linux 2.5.44開始引入的。該系統調用針對傳統的select/poll系統調用的不足，設計上作了很大的改動。select/poll的缺點在於：
    1.每次調用時要重複地從用戶態讀入參數。
    2.每次調用時要重複地掃描文件描述符。
    3.每次在調用開始時，要把當前進程放入各個文件描述符的等待隊列。在調用結束後，又把進程從各個等待隊列中刪除。
    在實際應用中，select/poll監視的文件描述符可能會非常多，如果每次只是返回一小部分，那麼，這種情況下select/poll顯得不夠高效。 epoll的設計思路，是把select/poll單個的操作拆分爲1個epoll_create+多個epoll_ctrl+一個wait。此外，內核針對epoll操作添加了一個文件系統”eventpollfs”，每一個或者多個要監視的文件描述符都有一個對應的eventpollfs文件系統的inode節點，主要信息保存在eventpoll結構體中。而被監視的文件的重要信息則保存在epitem結構體中。所以他們是一對多的關係。
    由於在執行epoll_create和epoll_ctrl時，已經把用戶態的信息保存到內核態了，所以之後即使反覆地調用epoll_wait，也不會重複地拷貝參數，掃描文件描述符，反覆地把當前進程放入/放出等待隊列。這樣就避免了以上的三個缺點。接下去看看它們的實現：
2. 關鍵結構體
/* Wrapper struct used by poll queueing */
struct ep_pqueue {
        poll_table pt;
        struct epitem *epi;
};
    這個結構體類似於select/poll中的struct poll_wqueues。由於epoll需要在內核態保存大量信息，所以光光一個回調函數指針已經不能滿足要求，所以在這裏引入了一個新的結構體struct epitem。
/*
* Each file descriptor added to the eventpoll interface will
* have an entry of this type linked to the hash.
*/
struct epitem {
        /* RB-Tree node used to link this structure to the eventpoll rb-tree */
        struct rb_node rbn;
紅黑樹，用來保存eventpoll
        /* List header used to link this structure to the eventpoll ready list */
        struct list_head rdllink;
雙向鏈表，用來保存已經完成的eventpoll
        /* The file descriptor information this item refers to */
        struct epoll_filefd ffd;
這個結構體對應的被監聽的文件描述符信息
        /* Number of active wait queue attached to poll operations */
        int nwait;
poll操作中事件的個數
        /* List containing poll wait queues */
        struct list_head pwqlist;
雙向鏈表，保存着被監視文件的等待隊列，功能類似於select/poll中的poll_table
        /* The "container" of this item */
        struct eventpoll *ep;
指向eventpoll，多個epitem對應一個eventpoll
        /* The structure that describe the interested events and the source fd */
        struct epoll_event event;
記錄發生的事件和對應的fd
        /*
         * Used to keep track of the usage count of the structure. This avoids
         * that the structure will desappear from underneath our processing.
         */
        atomic_t usecnt;
引用計數
        /* List header used to link this item to the "struct file" items list */
        struct list_head fllink;
雙向鏈表，用來鏈接被監視的文件描述符對應的struct file。因爲file裏有f_ep_link,用來保存所有監視這個文件的epoll節點
        /* List header used to link the item to the transfer list */
        struct list_head txlink;
雙向鏈表，用來保存傳輸隊列
        /*
         * This is used during the collection/transfer of events to userspace
         * to pin items empty events set.
         */
        unsigned int revents;
文件描述符的狀態，在收集和傳輸時用來鎖住空的事件集合
};
    該結構體用來保存與epoll節點關聯的多個文件描述符，保存的方式是使用紅黑樹實現的hash表.至於爲什麼要保存，下文有詳細解釋。它與被監聽的文件描述符一一對應.

struct eventpoll {
        /* Protect the this structure access */
        rwlock_t lock;
讀寫鎖
        /*
         * This semaphore is used to ensure that files are not removed
         * while epoll is using them. This is read-held during the event
         * collection loop and it is write-held during the file cleanup
         * path, the epoll file exit code and the ctl operations.
         */
        struct rw_semaphore sem;
讀寫信號量
        /* Wait queue used by sys_epoll_wait() */
        wait_queue_head_t wq;
        /* Wait queue used by file->poll() */
        wait_queue_head_t poll_wait;
        /* List of ready file descriptors */
        struct list_head rdllist;
已經完成的操作事件的隊列。
        /* RB-Tree root used to store monitored fd structs */
        struct rb_root rbr;
保存epoll監視的文件描述符
};
這個結構體保存了epoll文件描述符的擴展信息，它被保存在file結構體的private_data中。它與epoll文件節點一一對應。通常一個epoll文件節點對應多個被監視的文件描述符。所以一個eventpoll結構體會對應多個epitem結構體。
那麼，epoll中的等待事件放在哪裏呢？見下面
/* Wait structure used by the poll hooks */
struct eppoll_entry {
        /* List header used to link this structure to the "struct epitem" */
        struct list_head llink;
        /* The "base" pointer is set to the container "struct epitem" */
        void *base;
        /*
         * Wait queue item that will be linked to the target file wait
         * queue head.
         */
        wait_queue_t wait;
        /* The wait queue head that linked the "wait" wait queue item */
        wait_queue_head_t *whead;
};
    與select/poll的struct poll_table_entry相比，epoll的表示等待隊列節點的結構體只是稍有不同struct poll_table_entry比較一下。
struct poll_table_entry {
        struct file * filp;
        wait_queue_t wait;
        wait_queue_head_t * wait_address;
};
    由於epitem對應一個被監視的文件，所以通過base可以方便地得到被監視的文件信息。又因爲一個文件可能有多個事件發生，所以用llink鏈接這些事件。
3. epoll_create的實現
    epoll_create()的功能是創建一個eventpollfs文件系統的inode節點。具體由ep_getfd()完成。ep_getfd()先調用ep_eventpoll_inode()創建一個inode節點，然後調用d_alloc()爲inode分配一個dentry。最後把file,dentry,inode三者關聯起來。
    在執行了ep_getfd()之後，它又調用了ep_file_init(),分配了eventpoll結構體，並把eventpoll的指針賦給file結構體，這樣eventpoll就與file結構體關聯起來了。
    需要注意的是epoll_create()的參數size實際上只是起參考作用，只要它不小於等於0，就並不限制這個epoll inode關聯的文件描述符數量。
4. epoll_ctl的實現
    epoll_ctl的功能是實現一系列操作，如把文件與eventpollfs文件系統的inode節點關聯起來。這裏要介紹一下eventpoll結構體，它保存在file->f_private中，記錄了eventpollfs文件系統的inode節點的重要信息，其中成員rbr保存了該epoll文件節點監視的所有文件描述符。組織的方式是一棵紅黑樹，這種結構體在查找節點時非常高效。
    首先它調用ep_find()從eventpoll中的紅黑樹獲得epitem結構體。然後根據op參數的不同而選擇不同的操作。如果op爲EPOLL_CTL_ADD，那麼正常情況下epitem是不可能在eventpoll的紅黑樹中找到的，所以調用ep_insert創建一個epitem結構體並插
入到對應的紅黑樹中。
    ep_insert()首先分配一個epitem對象，對它初始化後，把它放入對應的紅黑樹。此外，這個函數還要作一個操作，就是把當前進程放入對應文件操作的等待隊列。這一步是由下面的代碼完成的。
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
    。。。
    revents = tfile->f_op->poll(tfile, &epq.pt);
    函數先調用init_poll_funcptr註冊了一個回調函數 ep_ptable_queue_proc，這個函數會在調用f_op->poll時被執行。該函數分配一個epoll等待隊列結點eppoll_entry：一方面把它掛到文件操作的等待隊列中，另一方面把它掛到epitem的隊列中。此外，它還註冊了一個等待隊列的回調函數ep_poll_callback。當文件操作完成，喚醒當前進程之前，會調用ep_poll_callback()，把eventpoll放到epitem的完成隊列中，並喚醒等待進程。
    如果在執行f_op->poll以後，發現被監視的文件操作已經完成了，那麼把它放在完成隊列中了，並立即把等待操作的那些進程喚醒。
5. epoll_wait的實現
    epoll_wait的工作是等待文件操作完成並返回。
    它的主體是ep_poll()，該函數在for循環中檢查epitem中有沒有已經完成的事件，有的話就把結果返回。沒有的話調用schedule_timeout()進入休眠，直到進程被再度喚醒或者超時。
6. 性能分析
    epoll機制是針對select/poll的缺陷設計的。通過新引入的eventpollfs文件系統，epoll把參數拷貝到內核態，在每次輪詢時不會重複拷貝。通過把操作拆分爲epoll_create,epoll_ctl,epoll_wait，避免了重複地遍歷要監視的文件描述符。此外，由於調用epoll的進程被喚醒後，只要直接從epitem的完成隊列中找出完成的事件，找出完成事件的複雜度由O(N)降到了O(1)。
    但是epoll的性能提高是有前提的，那就是監視的文件描述符非常多，而且每次完成操作的文件非常少。所以，epoll能否顯著提高效率，取決於實際的應用場景。這方面需要進一步測試。

Epoll實現原理解析

一個簡單的網絡應用——面部拍照軟件

Epoll實現原理解析

不同系統中回車符‘/r’和換行符'/n'的區別

手把手教你把Vim改裝成IDE

關於信號有關的APUE和SIGALRM信號實例

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結