Epoll實現分析——作者：lvyilong316

通過上一章分析，poll運行效率的兩個瓶頸已經找出，現在的問題是怎麼改進。首先，如果要監聽1000個fd，每次poll都要把1000個fd 拷入內核，太不科學了，內核幹嘛不自己保存已經拷入的fd呢？答對了，epoll就是自己保存拷入的fd，它的API就已經說明了這一點——不是 epoll_wait的時候才傳入fd，而是通過epoll_ctl把所有fd傳入內核再一起"wait"，這就省掉了不必要的重複拷貝。其次，在 epoll_wait時，也不是把current輪流的加入fd對應的設備等待隊列，而是在設備等待隊列醒來時調用一個回調函數（當然，這就需要“喚醒回調”機制），把產生事件的fd歸入一個鏈表，然後返回這個鏈表上的fd。
另外，epoll機制實現了自己特有的文件系統eventpoll filesystem

1. 內核數據結構

(1) struct eventpoll {

spinlock_t lock;

struct mutex mtx;

wait_queue_head_t wq; /* Wait queue used by sys_epoll_wait() ,調用epoll_wait()時, 我們就是"睡"在了這個等待隊列上*/

wait_queue_head_t poll_wait; /* Wait queue used by file->poll() , 這個用於epollfd本事被poll的時候*/

struct list_head rdllist; /* List of ready file descriptors, 所有已經ready的epitem都在這個鏈表裏面*/

structrb_root rbr; /* RB tree root used to store monitored fd structs, 所有要監聽的epitem都在這裏*/

epitem *ovflist; /*存放的epitem都是我們在傳遞數據給用戶空間時監聽到了事件*/.

struct user_struct *user; /*這裏保存了一些用戶變量,比如fd監聽數量的最大值等*/

};

通過epoll_ctl接口加入該epoll描述符監聽的套接字則屬於socket filesystem，這點一定要注意。每個添加的待監聽（這裏監聽和listen調用不同）都對應於一個epitem結構體，該結構體已紅黑樹的結構組織，eventpoll結構中保存了樹的根節點（rbr成員）。同時有監聽事件到來的套接字的該結構以雙向鏈表組織起來，鏈表頭保存在eventpoll中（rdllist成員）。

* Each file descriptor added to the eventpoll interface will have an entry of this type linked to the "rbr" RB tree.

(2) struct epitem {

struct rb_node rbn; /* RB tree node used to link this structure to the eventpoll RB tree */

struct list_head rdllink; /* 鏈表節點, 所有已經ready的epitem都會被鏈到eventpoll的rdllist中 */

struct epitem *next;

struct epoll_filefd ffd; /* The file descriptor information this item refers to */

int nwait; /* Number of active wait queue attached to poll operations */

struct list_head pwqlist; /* List containing poll wait queues */

struct eventpoll *ep; /* The "container" of this item */

struct list_head fllink; /* List header used to link this item to the "struct file" items list */

struct epoll_event event; /*當前的epitem關係哪些events, 這個數據是調用epoll_ctl時從用戶態傳遞過來 */

};

(3) struct epoll_filefd {

struct file *file;

int fd;};

(4) struct eppoll_entry { /* Wait structure used by the poll hooks */

struct list_head llink; /* List header used to link this structure to the "struct epitem" */

struct epitem *base; /* The "base" pointer is set to the container "struct epitem" */

wait_queue_t wait; / Wait queue item that will be linked to the target file wait queue head. /

wait_queue_head_t *whead;/The wait queue head that linked the "wait" wait queue item */

};//注：後兩項相當於等待隊列

(5) struct ep_pqueue {/* Wrapper struct used by poll queueing */

poll_table pt; // struct poll_table是一個函數指針的包裹

struct epitem *epi;

};

(6) struct ep_send_events_data {

/* Used by the ep_send_events() function as callback private data */

int maxevents;

struct epoll_event __user *events;

};

各個數據結構的關係如下圖：

2. 函數調用分析

epoll函數調用關係全局圖：

3. 函數實現分析

3.1 eventpoll_init

epoll是個module，所以先看看module的入口eventpoll_init
[fs/eventpoll.c-->evetpoll_init()]（簡化後）
static int __init eventpoll_init(void)
{
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL);

pwq_cache = kmem_cache_create("eventpoll_pwq",
sizeof(struct eppoll_entry), 0, EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL);
//註冊了一個新的文件系統，叫"eventpollfs"

error = register_filesystem(&eventpoll_fs_type);
eventpoll_mnt = kern_mount(&eventpoll_fs_type);;
}
很有趣，這個module在初始化時註冊了一個新的文件系統，叫"eventpollfs"（在eventpoll_fs_type結構裏），然後掛載此文件系統。另外創建兩個內核cache（在內核編程中，如果需要頻繁分配小塊內存，應該創建kmem_cahe來做“內存池”）,分別用於存放struct epitem和eppoll_entry。

現在想想epoll_create爲什麼會返回一個新的fd？因爲它就是在這個叫做"eventpollfs"的文件系統裏創建了一個新文件！如下：

3.2 sys_epoll_create

[fs/eventpoll.c-->sys_epoll_create()]
asmlinkage long sys_epoll_create(int size)
{
int error, fd;
struct inode *inode;
struct file *file;
error = ep_getfd(&fd, &inode, &file);
/* Setup the file internal data structure ( "struct eventpoll" ) */
error = ep_file_init(file);

}
函數很簡單，其中ep_getfd看上去是“get”，其實在第一次調用epoll_create時，它是要創建新inode、新的file、新的fd。而ep_file_init則要創建一個struct eventpoll結構，並把它放入file->private_data，注意，這個private_data後面還要用到的。

3.3 epoll_ctl

epoll_create好了，該epoll_ctl了，我們略去判斷性的代碼：
[fs/eventpoll.c-->sys_epoll_ctl()]
asmlinkage long
sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event)
{
struct file *file, *tfile;
struct eventpoll *ep;
struct epitem *epi;
struct epoll_event epds;
....
epi = ep_find(ep, tfile, fd);//tfile存放要監聽的fd對應在rb-tree中的epitem
switch (op) {//省略了判空處理
case EPOLL_CTL_ADD: epds.events |= POLLERR | POLLHUP;

error = ep_insert(ep, &epds, tfile, fd); break;
case EPOLL_CTL_DEL: error = ep_remove(ep, epi); break;
case EPOLL_CTL_MOD: epds.events |= POLLERR | POLLHUP;

error = ep_modify(ep, epi, &epds); break;
}
原來就是在一個“大的結構”（struct eventpoll）裏先ep_find，如果找到了struct epitem,而根據用戶操作是ADD、DEL、MOD調用相應的函數，這些函數在epitem組成紅黑樹中增加、刪除、修改相應節點（每一個監聽fd對應一個節點）。很直白。那這個“大結構”是什麼呢？看ep_find的調用方式，ep參數應該是指向這個“大結構”的指針，再看ep = file->private_data，我們才明白，原來這個“大結構”就是那個在epoll_create時創建的struct eventpoll，具體再看看ep_find的實現，發現原來是struct eventpoll的rbr成員（struct rb_root），原來這是一個紅黑樹的根！而紅黑樹上掛的都是struct epitem。
現在清楚了，一個新創建的epoll文件帶有一個struct eventpoll結構，這個結構上再掛一個紅黑樹，而這個紅黑樹就是每次epoll_ctl時fd存放的地方！

3.4 sys_epoll_wait

現在數據結構都已經清楚了，我們來看最核心的:
[fs/eventpoll.c-->sys_epoll_wait()]
asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events, int maxevents,

int timeout)
{
struct file *file;
struct eventpoll *ep;
/* Get the "struct file *" for the eventpoll file */
file = fget(epfd);
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.(所以如果這裏是普通的文件fd會出錯)
*/
if (!IS_FILE_EPOLL(file))
goto eexit_2;

ep = file->private_data;
error = ep_poll(ep, events, maxevents, timeout);

……

}

故伎重演，從file->private_data中拿到struct eventpoll，再調用ep_poll

3.5 ep_poll()

[fs/eventpoll.c-->sys_epoll_wait()->ep_poll()]
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents,

long timeout)
{
int res;
wait_queue_t wait;//等待隊列項
if (list_empty(&ep->rdllist)) {
//ep->rdllist存放的是已就緒(read)的fd，爲空時說明當前沒有就緒的fd，所以需要將當前
init_waitqueue_entry(&wait, current);//創建一個等待隊列項，並使用當前進程（current）初始化
add_wait_queue(&ep->wq, &wait);//將剛創建的等待隊列項加入到ep中的等待隊列（即將當前進程添加到等待隊列）
for (;;) {
/*將進程狀態設置爲TASK_INTERRUPTIBLE，因爲我們不希望這期間ep_poll_callback()發信號喚醒進程的時候，進程還在sleep */
set_current_state(TASK_INTERRUPTIBLE);
if (!list_empty(&ep->rdllist) || !jtimeout)//如果ep->rdllist非空(即有就緒的fd)或時間到則跳出循環

break;
if (signal_pending(current)) {
res = -EINTR;
break;
}
}
remove_wait_queue(&ep->wq, &wait);//將等待隊列項移出等待隊列(將當前進程移出)
set_current_state(TASK_RUNNING);
}
....
又是一個大循環，不過這個大循環比poll的那個好，因爲仔細一看——它居然除了睡覺和判斷ep->rdllist是否爲空以外，啥也沒做！什麼也沒做當然效率高了，但到底是誰來讓ep->rdllist不爲空呢？答案是ep_insert時設下的回調函數.

3.6 ep_insert()

[fs/eventpoll.c-->sys_epoll_ctl()-->ep_insert()]
static int ep_insert(struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd)
{

struct epitem *epi;
struct ep_pqueue epq;// 創建ep_pqueue對象
epi = EPI_MEM_ALLOC();//分配一個epitem
/* 初始化這個epitem ... */
epi->ep = ep;//將創建的epitem添加到傳進來的struct eventpoll

/*後幾行是設置epitem的相應字段*/
EP_SET_FFD(&epi->ffd, tfile, fd);//將要監聽的fd加入到剛創建的epitem
epi->event = *event;
epi->nwait = 0;

/* Initialize the poll table using the queue callback */
epq.epi = epi; //將一個epq和新插入的epitem(epi)關聯

//下面一句等價於&(epq.pt)->qproc = ep_ptable_queue_proc;

init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

revents = tfile->f_op->poll(tfile, &epq.pt); //tfile代表target file，即被監聽的文件,poll()返回就緒事件的掩碼，賦給revents.

list_add_tail(&epi->fllink, &tfile->f_ep_links);// 每個文件會將所有監聽自己的epitem鏈起來

ep_rbtree_insert(ep, epi);// 都搞定後, 將epitem插入到對應的eventpoll中去

……

}

緊接着 tfile->f_op->poll(tfile, &epq.pt)其實就是調用被監控文件（epoll裏叫“target file”)的poll方法，而這個poll其實就是調用poll_wait（還記得poll_wait嗎？每個支持poll的設備驅動程序都要調用的），最後就是調用ep_ptable_queue_proc。（注：f_op->poll()一般來說只是個wrapper, 它會調用真正的poll實現, 拿UDP的socket來舉例, 這裏就是這樣的調用流程: f_op->poll(), sock_poll(), udp_poll(), datagram_poll(), sock_poll_wait()。）這是比較難解的一個調用關係，因爲不是語言級的直接調用。ep_insert還把struct epitem放到struct file裏的f_ep_links連表裏，以方便查找，struct epitem裏的fllink就是擔負這個使命的。

3.7 ep_ptable_queue_proc

[fs/eventpoll.c-->ep_ptable_queue_proc()]
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt)
{
struct epitem *epi = EP_ITEM_FROM_EPQUEUE(pt);
struct eppoll_entry *pwq;
if (epi->nwait >= 0 && (pwq = PWQ_MEM_ALLOC())) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* We have to signal that an error occurred */
epi->nwait = -1;
}
}
上面的代碼就是ep_insert中要做的最重要的事：創建struct eppoll_entry，設置其喚醒回調函數爲ep_poll_callback，然後加入設備等待隊列（注意這裏的whead就是上一章所說的每個設備驅動都要帶的等待隊列）。只有這樣，當設備就緒，喚醒等待隊列上的等待進程時，ep_poll_callback就會被調用。每次調用poll系統調用，操作系統都要把current（當前進程）掛到fd對應的所有設備的等待隊列上，可以想象，fd多到上千的時候，這樣“掛”法很費事；而每次調用epoll_wait則沒有這麼羅嗦，epoll只在epoll_ctl時把current掛一遍（這第一遍是免不了的）並給每個fd一個命令“好了就調回調函數”，如果設備有事件了，通過回調函數，會把fd放入rdllist，而每次調用epoll_wait就只是收集rdllist裏的fd就可以了——epoll巧妙的利用回調函數，實現了更高效的事件驅動模型。
現在我們猜也能猜出來ep_poll_callback會幹什麼了——肯定是把紅黑樹(ep->rbr)上的收到event的epitem（代表每個fd）插入ep->rdllist中，這樣，當epoll_wait返回時，rdllist裏就都是就緒的fd了！

3.8 ep_poll_callback

[fs/eventpoll.c-->ep_poll_callback()]
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
struct epitem *epi = EP_ITEM_FROM_WAIT(wait);
struct eventpoll *ep = epi->ep;
/* If this file is already in the ready list we exit soon */
if (EP_IS_LINKED(&epi->rdllink))
goto is_linked;
list_add_tail(&epi->rdllink, &ep->rdllist);
is_linked:
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
if (waitqueue_active(&ep->wq))
wake_up(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}

4. epoll獨有的EPOLLET

EPOLLET是epoll系統調用獨有的flag，ET就是Edge Trigger（邊緣觸發）的意思，具體含義和應用大家可google之。有了EPOLLET，重複的事件就不會總是出來打擾程序的判斷，故而常被使用。那EPOLLET的原理是什麼呢？
上篇我們講到epoll把fd都掛上一個回調函數，當fd對應的設備有消息時，回調函數就把fd放入rdllist鏈表，這樣epoll_wait只要檢查這個rdllist鏈表就可以知道哪些fd有事件了。我們看看ep_poll的最後幾行代碼：

4.1 ep_poll() (接3.5)

[fs/eventpoll.c->ep_poll()]

/* Try to transfer events to user space. */
ep_events_transfer(ep, events, maxevents)
......
把rdllist裏的fd拷到用戶空間，這個任務是ep_events_transfer做的.

4.2 ep_events_transfer

[fs/eventpoll.c->ep_events_transfer()]
static int ep_events_transfer(struct eventpoll *ep, struct epoll_event __user *events,

int maxevents)
{
int eventcnt = 0;
struct list_head txlist;
INIT_LIST_HEAD(&txlist);
/* Collect/extract ready items */
if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {
/* Build result set in userspace */
eventcnt = ep_send_events(ep, &txlist, events);
/* Reinject ready items into the ready list */
ep_reinject_items(ep, &txlist);
}
up_read(&ep->sem);
return eventcnt;
}
代碼很少，其中ep_collect_ready_items把rdllist裏的fd挪到txlist裏（挪完後rdllist就空了），接着ep_send_events把txlist裏的fd拷給用戶空間，然後ep_reinject_items把一部分fd從txlist裏“返還”給rdllist以便下次還能從rdllist裏發現它。
其中ep_send_events的實現：

4.3 ep_send_events()

[fs/eventpoll.c->ep_send_events()]
static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,

struct epoll_event __user *events)
{
int eventcnt = 0;
unsigned int revents;
struct list_head *lnk;
struct epitem *epi;
list_for_each(lnk, txlist) {
epi = list_entry(lnk, struct epitem, txlink);
revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);//調用每個監聽文件的poll方法獲取就緒事件（掩碼），並賦值給revents

epi->revents = revents & epi->event.events;
if (epi->revents) {
     if (__put_user(epi->revents, &events[eventcnt].events) || __put_user(epi->event.data,
     &events[eventcnt].data))//將event從內核空間發送到用戶空間
     return -EFAULT;
    if (epi->event.events & EPOLLONESHOT)
    epi->event.events &= EP_PRIVATE_BITS;
    eventcnt++;
     }     }
    return eventcnt; }
    這個拷貝實現其實沒什麼可看的，但是請注意紅色的一行，這個poll很狡猾，它把第二個參數置爲NULL來調用。我們先看一下設備驅動通常是怎麼實現poll的：
static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
struct scull_pipe *dev = filp->private_data;
unsigned int mask = 0;
poll_wait(filp, &dev->inq, wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp)
mask |= POLLIN | POLLRDNORM; /* readable */
if (spacefree(dev))
mask |= POLLOUT | POLLWRNORM; /* writable */
return mask;
}
    上面這段代碼摘自《linux設備驅動程序（第三版）》，絕對經典，設備先要把current（當前進程）掛在inq和outq兩個隊列上（這個“掛”操作是wait回調函數指針做的），然後等設備來喚醒，喚醒後就能通過mask拿到事件掩碼了（注意那個mask參數，它就是負責拿事件掩碼的）。那如果wait爲NULL，poll_wait會做些什麼呢？

4.4 poll_wait

[include/linux/poll.h->poll_wait]
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address,poll_table *p)
{
if (p && wait_address)
p->qproc(filp, wait_address, p);
}
喏，看見了，如果poll_table爲空，什麼也不做。我們倒回ep_send_events，那句標紅的poll，實際上就是“我不想休眠，我只想拿到事件掩碼”的意思。然後再把拿到的事件掩碼拷給用戶空間。ep_send_events完成後，就輪到ep_reinject_items了。

4.5 p_reinject_items

[fs/eventpoll.c->ep_reinject_items]
static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
{
     int ricnt = 0, pwake = 0;
     unsigned long flags;
     struct epitem *epi;
     while (!list_empty(txlist)) {//遍歷txlist（此時txlist存放的是已就緒的epitem）
     epi = list_entry(txlist->next, struct epitem, txlink);
     EP_LIST_DEL(&epi->txlink);//將當前的epitem從txlist中刪除
     if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) &&
     (epi->revents & epi->event.events) && !EP_IS_LINKED(&epi->rdllink)) {

     list_add_tail(&epi->rdllink, &ep->rdllist);//將當前epitem重新加入ep->rdllist
     ricnt++;// ep->rdllist中epitem的個數（即從新加入就緒的epitem的個數）
      }
    }
if (ricnt) {//如果ep->rdllist不空，重新喚醒等、等待隊列的進程（current）
    if (waitqueue_active(&ep->wq))
    wake_up(&ep->wq);
    if (waitqueue_active(&ep->poll_wait))
    pwake++;
    }
   ……

}
ep_reinject_items把txlist裏的一部分fd又放回rdllist，那麼，是把哪一部分fd放回去呢？看上面那個判斷——是那些“沒有標上EPOLLET(即默認的LT)”（標紅代碼）且“事件被關注”（標藍代碼）的fd被重新放回了rdllist。那麼下次epoll_wait當然會又把rdllist裏的fd拿來拷給用戶了。舉個例子。假設一個socket，只是connect，還沒有收發數據，那麼它的poll事件掩碼總是有POLLOUT的（參見上面的驅動示例），每次調用epoll_wait總是返回POLLOUT事件（比較煩），因爲它的fd就總是被放回rdllist；假如此時有人往這個socket裏寫了一大堆數據，造成socket塞住（不可寫了），那麼標藍色的判斷就不成立了（沒有POLLOUT了），fd不會放回rdllist，epoll_wait將不會再返回用戶POLLOUT事件。現在我們給這個socket加上EPOLLET，然後connect，沒有收發數據，此時，標紅的判斷又不成立了，所以epoll_wait只會返回一次POLLOUT通知給用戶（因爲此fd不會再回到rdllist了），接下來的epoll_wait都不會有任何事件通知了。

總結：

epoll函數調用關係全局圖：

注：上述函數關係圖中有個問題，當ep_reinject_items()將LT的上次就緒的eptiem重新放回就緒鏈表，下次ep_poll()直接返回，這不就造成了一個循環了嗎？什麼時候這些LT的epitem纔不再加入就緒鏈表呢？這個問題的解決在4.3——ep_send_events()中，注意這個函數中標紅的那個poll調用，我們分析過當傳入NULL時，poll僅僅是拿到事件掩碼，所以如果之前用戶對事件的處理導致的文件的revents（狀態）改變，那麼這裏就會得到更新。例如：用戶以可讀監聽，當讀完數據後文件的會變爲不可讀，這時ep_send_events()中獲取的revents中將不再有可讀事件，也就不滿足ep_reinject_items()中的藍色判斷，所以epitem不再被加入就緒鏈表（ep->rdllist）。但是如果只讀部分數據，並不會引起文件狀態改變（文件仍可讀），所以仍會加入就緒鏈表通知用戶空間，這也就是如果是TL，就會一直通知用戶讀事件，直到某些操作導致那個文件描述符不再爲就緒狀態了(比如，你在發送，接收或者接收請求，或者發送接收的數據少於一定量時導致了一個EWOULDBLOCK 錯誤）。

將上述調用添加到函數調用關係圖後，如下（添加的爲藍線）：

epoll實現數據結構全局關係圖：

poll&&epoll實現分析（二）——epoll實現

1. 內核數據結構

2. 函數調用分析

3. 函數實現分析

3.1 eventpoll_init

3.2 sys_epoll_create

3.3 epoll_ctl

3.4 sys_epoll_wait

3.5 ep_poll()

3.6 ep_insert()

3.7 ep_ptable_queue_proc

3.8 ep_poll_callback

4. epoll獨有的EPOLLET

4.1 ep_poll() (接3.5)

4.2 ep_events_transfer

4.3 ep_send_events()

4.4 poll_wait

4.5 p_reinject_items

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

將man出來的信息保存到文本

OpenSSL 常用函數——Base64編碼及解碼和證書操作

Open SSL 常用函數——簽名與驗證

一個fwrite的錯誤

Server開發(deamon)進程排他性(文件鎖),腳本排他性(fuser)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結