epoll概述

epoll是linux中IO多路複用的一種機制，I/O多路複用就是通過一種機制，一個進程可以監視多個描述符，一旦某個描述符就緒（一般是讀就緒或者寫就緒），能夠通知程序進行相應的讀寫操作。當然linux中IO多路複用不僅僅是epoll，其他多路複用機制還有select、poll，但是接下來介紹epoll的內核實現。

網上關於epoll接口的介紹非常多，這個不是我關注的重點，但是還是有必要了解。該接口非常簡單，一共就三個函數，這裏我摘抄了網上關於該接口的介紹：

int epoll_create(int size);
創建一個epoll的句柄，size用來告訴內核這個監聽的數目一共有多大。這個參數不同於select()中的第一個參數，給出最大監聽的fd+1的值。需要注意的是，當創建好epoll句柄後，它就是會佔用一個fd值，在linux下如果查看/proc/進程id/fd/，是能夠看到這個fd的，所以在使用完epoll後，必須調用close()關閉，否則可能導致fd被耗盡。
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
epoll的事件註冊函數，它不同與select()是在監聽事件時告訴內核要監聽什麼類型的事件，而是在這裏先註冊要監聽的事件類型。第一個參數是epoll_create()的返回值，第二個參數表示動作，用三個宏來表示：
EPOLL_CTL_ADD：註冊新的fd到epfd中；
EPOLL_CTL_MOD：修改已經註冊的fd的監聽事件；
EPOLL_CTL_DEL：從epfd中刪除一個fd；
第三個參數是需要監聽的fd，第四個參數是告訴內核需要監聽什麼事，struct epoll_event結構如下：

struct epoll_event {
  __uint32_t events;  /* Epoll events */
  epoll_data_t data;  /* User data variable */
};

events可以是以下幾個宏的集合：

EPOLLIN ：表示對應的文件描述符可以讀（包括對端SOCKET正常關閉）；
EPOLLOUT：表示對應的文件描述符可以寫；
EPOLLPRI：表示對應的文件描述符有緊急的數據可讀（這裏應該表示有帶外數據到來）；
EPOLLERR：表示對應的文件描述符發生錯誤；
EPOLLHUP：表示對應的文件描述符被掛斷；
EPOLLET：將EPOLL設爲邊緣觸發(Edge Triggered)模式，這是相對於水平觸發(Level Triggered)來說的。
EPOLLONESHOT：只監聽一次事件，當監聽完這次事件之後，如果還需要繼續監聽這個socket的話，需要再次把這個socket加入到EPOLL隊列裏

int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
等待事件的產生，類似於select()調用。參數events用來從內核得到事件的集合，maxevents告之內核這個events有多大，這個maxevents的值不能大於創建epoll_create()時的size(備註：在4.1.2內核裏面，epoll_create的size沒有什麼用），參數timeout是超時時間（毫秒，0會立即返回，小於0時將是永久阻塞）。該函數返回需要處理的事件數目，如返回0表示已超時

epoll相比select/poll的優勢：

select/poll每次調用都要傳遞所要監控的所有fd給select/poll系統調用（這意味着每次調用都要將fd列表從用戶態拷貝到內核態，當fd數目很多時，這會造成低效）。而每次調用epoll_wait時（作用相當於調用select/poll），不需要再傳遞fd列表給內核，因爲已經在epoll_ctl中將需要監控的fd告訴了內核（epoll_ctl不需要每次都拷貝所有的fd，只需要進行增量式操作）。所以，在調用epoll_create之後，內核已經在內核態開始準備數據結構存放要監控的fd了。每次epoll_ctl只是對這個數據結構進行簡單的維護。
select/poll一個致命弱點就是當你擁有一個很大的socket集合，不過由於網絡延時，任一時間只有部分的socket是"活躍"的，但是select/poll每次調用都會線性掃描全部的集合，導致效率呈現線性下降。但是epoll不存在這個問題，它只會對"活躍"的socket進行操作---這是因爲在內核實現中epoll是根據每個fd上面的callback函數實現的。
當我們調用epoll_ctl往裏塞入百萬個fd時，epoll_wait仍然可以飛快的返回，並有效的將發生事件的fd給我們用戶。這是由於我們在調用epoll_create時，內核除了幫我們在epoll文件系統裏建了個file結點，在內核cache裏建了個紅黑樹用於存儲以後epoll_ctl傳來的fd外，還會再建立一個list鏈表，用於存儲準備就緒的事件，當epoll_wait調用時，僅僅觀察這個list鏈表裏有沒有數據即可。有數據就返回，沒有數據就sleep，等到timeout時間到後即使鏈表沒數據也返回。所以，epoll_wait非常高效。而且，通常情況下即使我們要監控百萬計的fd，大多一次也只返回很少量的準備就緒fd而已，所以，epoll_wait僅需要從內核態copy少量的fd到用戶態而已。那麼，這個準備就緒list鏈表是怎麼維護的呢？當我們執行epoll_ctl時，除了把fd放到epoll文件系統裏file對象對應的紅黑樹上之外，還會給內核中斷處理程序註冊一個回調函數，告訴內核，如果這個fd的中斷到了，就把它放到準備就緒list鏈表裏。所以，當一個fd（例如socket）上有數據到了，內核在把設備（例如網卡）上的數據copy到內核中後就來把fd（socket）插入到準備就緒list鏈表裏了。

源碼分析

epoll相關的內核代碼在fs/eventpoll.c文件中，下面分別分析epoll_create、epoll_ctl和epoll_wait三個函數在內核中的實現，分析所用linux內核源碼爲4.1.2版本。

epoll_create

epoll_create用於創建一個epoll的句柄，其在內核的系統實現如下：

sys_epoll_create:

SYSCALL_DEFINE1(epoll_create, int, size)
{
    if (size <= 0)
        return -EINVAL;

    return sys_epoll_create1(0);
}

可見，我們在調用epoll_create時，傳入的size參數，僅僅是用來判斷是否小於等於0，之後再也沒有其他用處。
整個函數就3行代碼，真正的工作還是放在sys_epoll_create1函數中。

sys_epoll_create -> sys_epoll_create1:

/*
 * Open an eventpoll file descriptor.
 */
SYSCALL_DEFINE1(epoll_create1, int, flags)
{
    int error, fd;
    struct eventpoll *ep = NULL;
    struct file *file;

    /* Check the EPOLL_* constant for consistency.  */
    BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

    if (flags & ~EPOLL_CLOEXEC)
        return -EINVAL;
    /*
     * Create the internal data structure ("struct eventpoll").
     */
    error = ep_alloc(&ep);
    if (error < 0)
        return error;
    /*
     * Creates all the items needed to setup an eventpoll file. That is,
     * a file structure and a free file descriptor.
     */
    fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
    if (fd < 0) {
        error = fd;
        goto out_free_ep;
    }
    file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
                 O_RDWR | (flags & O_CLOEXEC));
    if (IS_ERR(file)) {
        error = PTR_ERR(file);
        goto out_free_fd;
    }
    ep->file = file;
    fd_install(fd, file);
    return fd;

out_free_fd:
    put_unused_fd(fd);
out_free_ep:
    ep_free(ep);
    return error;
}

sys_epoll_create1 函數流程如下：

首先調用ep_alloc函數申請一個eventpoll結構，並且初始化該結構的成員，這裏沒什麼好說的，代碼如下：

sys_epoll_create -> sys_epoll_create1 -> ep_alloc:

static int ep_alloc(struct eventpoll **pep)
{
    int error;
    struct user_struct *user;
    struct eventpoll *ep;

    user = get_current_user();
    error = -ENOMEM;
    ep = kzalloc(sizeof(*ep), GFP_KERNEL);
    if (unlikely(!ep))
        goto free_uid;

    spin_lock_init(&ep->lock);
    mutex_init(&ep->mtx);
    init_waitqueue_head(&ep->wq);
    init_waitqueue_head(&ep->poll_wait);
    INIT_LIST_HEAD(&ep->rdllist);
    ep->rbr = RB_ROOT;
    ep->ovflist = EP_UNACTIVE_PTR;
    ep->user = user;

    *pep = ep;

    return 0;

free_uid:
    free_uid(user);
    return error;
}

接下來調用get_unused_fd_flags函數，在本進程中申請一個未使用的fd文件描述符。

sys_epoll_create -> sys_epoll_create1 -> ep_alloc -> get_unused_fd_flags:

int get_unused_fd_flags(unsigned flags)
{
    return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
}

linux內核中，current是個宏，返回的是一個task_struct結構（我們稱之爲進程描述符）的變量，表示的是當前進程，進程打開的文件資源保存在進程描述符的files成員裏面，所以current->files返回的當前進程打開的文件資源。rlimit(RLIMIT_NOFILE) 函數獲取的是當前進程可以打開的最大文件描述符數，這個值可以設置，默認是1024。

__alloc_fd的工作是爲進程在[start,end)之間(備註：這裏start爲0， end爲進程可以打開的最大文件描述符數)分配一個可用的文件描述符,這裏就不繼續深入下去了，代碼如下：

sys_epoll_create -> sys_epoll_create1 -> ep_alloc -> get_unused_fd_flags -> __alloc_fd:

/*
 * allocate a file descriptor, mark it busy.
 */
int __alloc_fd(struct files_struct *files,
           unsigned start, unsigned end, unsigned flags)
{
    unsigned int fd;
    int error;
    struct fdtable *fdt;

    spin_lock(&files->file_lock);
repeat:
    fdt = files_fdtable(files);
    fd = start;
    if (fd < files->next_fd)
        fd = files->next_fd;

    if (fd < fdt->max_fds)
        fd = find_next_fd(fdt, fd);

    /*
     * N.B. For clone tasks sharing a files structure, this test
     * will limit the total number of files that can be opened.
     */
    error = -EMFILE;
    if (fd >= end)
        goto out;

    error = expand_files(files, fd);
    if (error < 0)
        goto out;

    /*
     * If we needed to expand the fs array we
     * might have blocked - try again.
     */
    if (error)
        goto repeat;

    if (start <= files->next_fd)
        files->next_fd = fd + 1;

    __set_open_fd(fd, fdt);
    if (flags & O_CLOEXEC)
        __set_close_on_exec(fd, fdt);
    else
        __clear_close_on_exec(fd, fdt);
    error = fd;
#if 1
    /* Sanity check */
    if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
        printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
        rcu_assign_pointer(fdt->fd[fd], NULL);
    }
#endif

out:
    spin_unlock(&files->file_lock);****
    return error;
}

然後，epoll_create1會調用anon_inode_getfile，創建一個file結構，如下：

sys_epoll_create -> sys_epoll_create1 -> anon_inode_getfile:

/**
 * anon_inode_getfile - creates a new file instance by hooking it up to an
 *                      anonymous inode, and a dentry that describe the "class"
 *                      of the file
 *
 * @name:    [in]    name of the "class" of the new file
 * @fops:    [in]    file operations for the new file
 * @priv:    [in]    private data for the new file (will be file's private_data)
 * @flags:   [in]    flags
 *
 * Creates a new file by hooking it on a single inode. This is useful for files
 * that do not need to have a full-fledged inode in order to operate correctly.
 * All the files created with anon_inode_getfile() will share a single inode,
 * hence saving memory and avoiding code duplication for the file/inode/dentry
 * setup.  Returns the newly created file* or an error pointer.
 */
struct file *anon_inode_getfile(const char *name,
                const struct file_operations *fops,
                void *priv, int flags)
{
    struct qstr this;
    struct path path;
    struct file *file;

    if (IS_ERR(anon_inode_inode))
        return ERR_PTR(-ENODEV);

    if (fops->owner && !try_module_get(fops->owner))
        return ERR_PTR(-ENOENT);

    /*
     * Link the inode to a directory entry by creating a unique name
     * using the inode sequence number.
     */
    file = ERR_PTR(-ENOMEM);
    this.name = name;
    this.len = strlen(name);
    this.hash = 0;
    path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
    if (!path.dentry)
        goto err_module;

    path.mnt = mntget(anon_inode_mnt);
    /*
     * We know the anon_inode inode count is always greater than zero,
     * so ihold() is safe.
     */
    ihold(anon_inode_inode);

    d_instantiate(path.dentry, anon_inode_inode);

    file = alloc_file(&path, OPEN_FMODE(flags), fops);
    if (IS_ERR(file))
        goto err_dput;
    file->f_mapping = anon_inode_inode->i_mapping;

    file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
    file->private_data = priv;

    return file;

err_dput:
    path_put(&path);
err_module:
    module_put(fops->owner);
    return file;
}

anon_inode_getfile函數中首先會alloc一個file結構和一個dentry結構，然後將該file結構與一個匿名inode節點anon_inode_inode掛鉤在一起，這裏要注意的是，在調用anon_inode_getfile函數申請file結構時，傳入了前面申請的eventpoll結構的ep變量，申請的file->private_data會指向這個ep變量，同時，在anon_inode_getfile函數返回來後，ep->file會指向該函數申請的file結構變量。

簡要說一下file/dentry/inode，當進程打開一個文件時，內核就會爲該進程分配一個file結構，表示打開的文件在進程的上下文，然後應用程序會通過一個int類型的文件描述符來訪問這個結構，實際上內核的進程裏面維護一個file結構的數組，而文件描述符就是相應的file結構在數組中的下標。

dentry結構（稱之爲“目錄項”）記錄着文件的各種屬性，比如文件名、訪問權限等，每個文件都只有一個dentry結構，然後一個進程可以多次打開一個文件，多個進程也可以打開同一個文件，這些情況，內核都會申請多個file結構，建立多個文件上下文。但是，對同一個文件來說，無論打開多少次，內核只會爲該文件分配一個dentry。所以，file結構與dentry結構的關係是多對一的。

同時，每個文件除了有一個dentry目錄項結構外，還有一個索引節點inode結構，裏面記錄文件在存儲介質上的位置和分佈等信息，每個文件在內核中只分配一個inode。 dentry與inode描述的目標是不同的，一個文件可能會有好幾個文件名（比如鏈接文件），通過不同文件名訪問同一個文件的權限也可能不同。dentry文件所代表的是邏輯意義上的文件，記錄的是其邏輯上的屬性，而inode結構所代表的是其物理意義上的文件，記錄的是其物理上的屬性。dentry與inode結構的關係是多對一的關係。

最後，epoll_create1調用fd_install函數，將fd與file交給關聯在一起，之後，內核可以通過應用傳入的fd參數訪問file結構,本段代碼比較簡單，不繼續深入下去了。

sys_epoll_create -> sys_epoll_create1 -> fd_install:

/*
 * Install a file pointer in the fd array.
 *
 * The VFS is full of places where we drop the files lock between
 * setting the open_fds bitmap and installing the file in the file
 * array.  At any such point, we are vulnerable to a dup2() race
 * installing a file in the array before us.  We need to detect this and
 * fput() the struct file we are about to overwrite in this case.
 *
 * It should never happen - if we allow dup2() do it, _really_ bad things
 * will follow.
 *
 * NOTE: __fd_install() variant is really, really low-level; don't
 * use it unless you are forced to by truly lousy API shoved down
 * your throat.  'files' *MUST* be either current->files or obtained
 * by get_files_struct(current) done by whoever had given it to you,
 * or really bad things will happen.  Normally you want to use
 * fd_install() instead.
 */

void __fd_install(struct files_struct *files, unsigned int fd,
        struct file *file)
{
    struct fdtable *fdt;

    might_sleep();
    rcu_read_lock_sched();

    while (unlikely(files->resize_in_progress)) {
        rcu_read_unlock_sched();
        wait_event(files->resize_wait, !files->resize_in_progress);
        rcu_read_lock_sched();
    }
    /* coupled with smp_wmb() in expand_fdtable() */
    smp_rmb();
    fdt = rcu_dereference_sched(files->fdt);
    BUG_ON(fdt->fd[fd] != NULL);
    rcu_assign_pointer(fdt->fd[fd], file);
    rcu_read_unlock_sched();
}

void fd_install(unsigned int fd, struct file *file)
{
    __fd_install(current->files, fd, file);
}

總結epoll_create函數所做的事：調用epoll_create後，在內核中分配一個eventpoll結構和代表epoll文件的file結構，並且將這兩個結構關聯在一塊，同時，返回一個也與file結構相關聯的epoll文件描述符fd。當應用程序操作epoll時，需要傳入一個epoll文件描述符fd，內核根據這個fd，找到epoll的file結構，然後通過file，獲取之前epoll_create申請eventpoll結構變量，epoll相關的重要信息都存儲在這個結構裏面。接下來，所有epoll接口函數的操作，都是在eventpoll結構變量上進行的。

所以，epoll_create的作用就是爲進程在內核中建立一個從epoll文件描述符到eventpoll結構變量的通道。

epoll_ctl

epoll_ctl接口的作用是添加/修改/刪除文件的監聽事件，內核代碼如下：

sys_epoll_ctl:

/*
 * The following function implements the controller interface for
 * the eventpoll file that enables the insertion/removal/change of
 * file descriptors inside the interest set.
 */
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    int error;
    int full_check = 0;
    struct fd f, tf;
    struct eventpoll *ep;
    struct epitem *epi;
    struct epoll_event epds;
    struct eventpoll *tep = NULL;

    error = -EFAULT;
    if (ep_op_has_event(op) &&
        copy_from_user(&epds, event, sizeof(struct epoll_event)))
        goto error_return;

    error = -EBADF;
    f = fdget(epfd);
    if (!f.file)
        goto error_return;

    /* Get the "struct file *" for the target file */
    tf = fdget(fd);
    if (!tf.file)
        goto error_fput;

    /* The target file descriptor must support poll */
    error = -EPERM;
    if (!tf.file->f_op->poll)
        goto error_tgt_fput;

    /* Check if EPOLLWAKEUP is allowed */
    if (ep_op_has_event(op))
        ep_take_care_of_epollwakeup(&epds);

    /*
     * We have to check that the file structure underneath the file descriptor
     * the user passed to us _is_ an eventpoll file. And also we do not permit
     * adding an epoll file descriptor inside itself.
     */
    error = -EINVAL;
    if (f.file == tf.file || !is_file_epoll(f.file))
        goto error_tgt_fput;

    /*
     * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
     * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
     * Also, we do not currently supported nested exclusive wakeups.
     */
    if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {
        if (op == EPOLL_CTL_MOD)
            goto error_tgt_fput;
        if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
                (epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
            goto error_tgt_fput;
    }

    /*
     * At this point it is safe to assume that the "private_data" contains
     * our own data structure.
     */
    ep = f.file->private_data;

    /*
     * When we insert an epoll file descriptor, inside another epoll file
     * descriptor, there is the change of creating closed loops, which are
     * better be handled here, than in more critical paths. While we are
     * checking for loops we also determine the list of files reachable
     * and hang them on the tfile_check_list, so we can check that we
     * haven't created too many possible wakeup paths.
     *
     * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
     * the epoll file descriptor is attaching directly to a wakeup source,
     * unless the epoll file descriptor is nested. The purpose of taking the
     * 'epmutex' on add is to prevent complex toplogies such as loops and
     * deep wakeup paths from forming in parallel through multiple
     * EPOLL_CTL_ADD operations.
     */
    mutex_lock_nested(&ep->mtx, 0);
    if (op == EPOLL_CTL_ADD) {
        if (!list_empty(&f.file->f_ep_links) ||
                        is_file_epoll(tf.file)) {
            full_check = 1;
            mutex_unlock(&ep->mtx);
            mutex_lock(&epmutex);
            if (is_file_epoll(tf.file)) {
                error = -ELOOP;
                if (ep_loop_check(ep, tf.file) != 0) {
                    clear_tfile_check_list();
                    goto error_tgt_fput;
                }
            } else
                list_add(&tf.file->f_tfile_llink,
                            &tfile_check_list);
            mutex_lock_nested(&ep->mtx, 0);
            if (is_file_epoll(tf.file)) {
                tep = tf.file->private_data;
                mutex_lock_nested(&tep->mtx, 1);
            }
        }
    }

    /*
     * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
     * above, we can be sure to be able to use the item looked up by
     * ep_find() till we release the mutex.
     */
    epi = ep_find(ep, tf.file, fd);

    error = -EINVAL;
    switch (op) {
    case EPOLL_CTL_ADD:
        if (!epi) {
            epds.events |= POLLERR | POLLHUP;
            error = ep_insert(ep, &epds, tf.file, fd, full_check);
        } else
            error = -EEXIST;
        if (full_check)
            clear_tfile_check_list();
        break;
    case EPOLL_CTL_DEL:
        if (epi)
            error = ep_remove(ep, epi);
        else
            error = -ENOENT;
        break;
    case EPOLL_CTL_MOD:
        if (epi) {
            if (!(epi->event.events & EPOLLEXCLUSIVE)) {
                epds.events |= POLLERR | POLLHUP;
                error = ep_modify(ep, epi, &epds);
            }
        } else
            error = -ENOENT;
        break;
    }
    if (tep != NULL)
        mutex_unlock(&tep->mtx);
    mutex_unlock(&ep->mtx);

error_tgt_fput:
    if (full_check)
        mutex_unlock(&epmutex);

    fdput(tf);
error_fput:
    fdput(f);
error_return:

    return error;
}

根據前面對epoll_ctl接口的介紹，op是對epoll操作的動作（添加/修改/刪除事件），ep_op_has_event(op)判斷是否不是刪除操作，如果op != EPOLL_CTL_DEL爲true，則需要調用copy_from_user函數將用戶空間傳過來的event事件拷貝到內核的epds變量中。因爲，只有刪除操作，內核不需要使用進程傳入的event事件。

接着連續調用兩次fdget分別獲取epoll文件和被監聽文件（以下稱爲目標文件）的file結構變量（備註：該函數返回fd結構變量，fd結構包含file結構）。

接下來就是對參數的一些檢查，出現如下情況，就可以認爲傳入的參數有問題，直接返回出錯：

目標文件不支持poll操作(!tf.file->f_op->poll)；
監聽的目標文件就是epoll文件本身(f.file == tf.file)；
用戶傳入的epoll文件(epfd代表的文件）並不是一個真正的epoll的文件(!is_file_epoll(f.file));
如果操作動作是修改操作，並且事件類型爲EPOLLEXCLUSIVE，返回出錯等等。

當然下面還有一些關於操作動作如果是添加操作的判斷，這裏不做解釋，比較簡單，自行閱讀。

在ep裏面，維護着一個紅黑樹，每次添加註冊事件時，都會申請一個epitem結構的變量表示事件的監聽項，然後插入ep的紅黑樹裏面。在epoll_ctl裏面，會調用ep_find函數從ep的紅黑樹裏面查找目標文件表示的監聽項，返回的監聽項可能爲空。

接下來switch這塊區域的代碼就是整個epoll_ctl函數的核心，對op進行switch出來的有添加(EPOLL_CTL_ADD)、刪除(EPOLL_CTL_DEL)和修改(EPOLL_CTL_MOD)三種情況，這裏我以添加爲例講解，其他兩種情況類似，知道了如何添加監聽事件，其他刪除和修改監聽事件都可以舉一反三。

爲目標文件添加監控事件時，首先要保證當前ep裏面還沒有對該目標文件進行監聽，如果存在(epi不爲空)，就返回-EEXIST錯誤。否則說明參數正常，然後先默認設置對目標文件的POLLERR和POLLHUP監聽事件，然後調用ep_insert函數，將對目標文件的監聽事件插入到ep維護的紅黑樹裏面：

sys_epoll_ctl -> ep_insert:

/*
 * Must be called with "mtx" held.
 */
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
             struct file *tfile, int fd, int full_check)
{
    int error, revents, pwake = 0;
    unsigned long flags;
    long user_watches;
    struct epitem *epi;
    struct ep_pqueue epq;

    user_watches = atomic_long_read(&ep->user->epoll_watches);
    if (unlikely(user_watches >= max_user_watches))
        return -ENOSPC;
    if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        return -ENOMEM;

    /* Item initialization follow here ... */
    INIT_LIST_HEAD(&epi->rdllink);
    INIT_LIST_HEAD(&epi->fllink);
    INIT_LIST_HEAD(&epi->pwqlist);
    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd);
    epi->event = *event;
    epi->nwait = 0;
    epi->next = EP_UNACTIVE_PTR;
    if (epi->event.events & EPOLLWAKEUP) {
        error = ep_create_wakeup_source(epi);
        if (error)
            goto error_create_wakeup_source;
    } else {
        RCU_INIT_POINTER(epi->ws, NULL);
    }

    /* Initialize the poll table using the queue callback */
    epq.epi = epi;
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

    /*
     * Attach the item to the poll hooks and get current event bits.
     * We can safely use the file* here because its usage count has
     * been increased by the caller of this function. Note that after
     * this operation completes, the poll callback can start hitting
     * the new item.
     */
    revents = ep_item_poll(epi, &epq.pt);

    /*
     * We have to check if something went wrong during the poll wait queue
     * install process. Namely an allocation for a wait queue failed due
     * high memory pressure.
     */
    error = -ENOMEM;
    if (epi->nwait < 0)
        goto error_unregister;

    /* Add the current item to the list of active epoll hook for this file */
    spin_lock(&tfile->f_lock);
    list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
    spin_unlock(&tfile->f_lock);

    /*
     * Add the current item to the RB tree. All RB tree operations are
     * protected by "mtx", and ep_insert() is called with "mtx" held.
     */
    ep_rbtree_insert(ep, epi);

    /* now check if we've created too many backpaths */
    error = -EINVAL;
    if (full_check && reverse_path_check())
        goto error_remove_epi;

    /* We have to drop the new item inside our item list to keep track of it */
    spin_lock_irqsave(&ep->lock, flags);

    /* record NAPI ID of new item if present */
    ep_set_busy_poll_napi_id(epi);

    /* If the file is already "ready" we drop it inside the ready list */
    if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake(epi);

        /* Notify waiting tasks that events are available */
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }

    spin_unlock_irqrestore(&ep->lock, flags);

    atomic_long_inc(&ep->user->epoll_watches);

    /* We have to call this outside the lock */
    if (pwake)
        ep_poll_safewake(&ep->poll_wait);

    return 0;

error_remove_epi:
    spin_lock(&tfile->f_lock);
    list_del_rcu(&epi->fllink);
    spin_unlock(&tfile->f_lock);

    rb_erase(&epi->rbn, &ep->rbr);

error_unregister:
    ep_unregister_pollwait(ep, epi);

    /*
     * We need to do this because an event could have been arrived on some
     * allocated wait queue. Note that we don't care about the ep->ovflist
     * list, since that is used/cleaned only inside a section bound by "mtx".
     * And ep_insert() is called with "mtx" held.
     */
    spin_lock_irqsave(&ep->lock, flags);
    if (ep_is_linked(&epi->rdllink))
        list_del_init(&epi->rdllink);
    spin_unlock_irqrestore(&ep->lock, flags);

    wakeup_source_unregister(ep_wakeup_source(epi));

error_create_wakeup_source:
    kmem_cache_free(epi_cache, epi);

    return error;
}

前面說過，對目標文件的監聽是由一個epitem結構的監聽項變量維護的，所以在ep_insert函數裏面，首先調用kmem_cache_alloc函數，從slab分配器裏面分配一個epitem結構監聽項，然後對該結構進行初始化，這裏也沒有什麼好說的。我們接下來看ep_item_poll這個函數調用：

sys_epoll_ctl -> ep_insert -> ep_item_poll:

static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
{
    pt->_key = epi->event.events;

    return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
}

ep_item_poll函數裏面，調用目標文件的poll函數，這個函數針對不同的目標文件而指向不同的函數，如果目標文件爲套接字的話，這個poll就指向sock_poll，而如果目標文件爲tcp套接字來說，這個poll就是tcp_poll函數。雖然poll指向的函數可能會不同，但是其作用都是一樣的，就是獲取目標文件當前產生的事件位，並且將監聽項綁定到目標文件的poll鉤子裏面（最重要的是註冊ep_ptable_queue_proc這個poll callback回調函數），這步操作完成後，以後目標文件產生事件就會調用ep_ptable_queue_proc回調函數。

接下來，調用list_add_tail_rcu將當前監聽項添加到目標文件的f_ep_links鏈表裏面，該鏈表是目標文件的epoll鉤子鏈表，所有對該目標文件進行監聽的監聽項都會加入到該鏈表裏面。

然後就是調用ep_rbtree_insert，將epi監聽項添加到ep維護的紅黑樹裏面,這裏不做解釋，代碼如下：

sys_epoll_ctl -> ep_insert -> ep_rbtree_insert:

static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
{
    int kcmp;
    struct rb_node **p = &ep->rbr.rb_node, *parent = NULL;
    struct epitem *epic;

    while (*p) {
        parent = *p;
        epic = rb_entry(parent, struct epitem, rbn);
        kcmp = ep_cmp_ffd(&epi->ffd, &epic->ffd);
        if (kcmp > 0)
            p = &parent->rb_right;
        else
            p = &parent->rb_left;
    }
    rb_link_node(&epi->rbn, parent, p);
    rb_insert_color(&epi->rbn, &ep->rbr);
}

前面提到，ep_insert有調用ep_item_poll去獲取目標文件產生的事件位，在調用epoll_ctl前這段時間，可能會產生相關進程需要監聽的事件，如果有監聽的事件產生，(revents & event->events 爲 true)，並且目標文件相關的監聽項沒有鏈接到ep的準備鏈表rdlist裏面的話，就將該監聽項添加到ep的rdlist準備鏈表裏面，rdlist鏈接的是該epoll描述符監聽的所有已經就緒的目標文件的監聽項。並且，如果有任務在等待產生事件時，就調用wake_up_locked函數喚醒所有正在等待的任務，處理相應的事件。當進程調用epoll_wait時，該進程就出現在ep的wq等待隊列裏面。接下來講解epoll_wait函數。

總結epoll_ctl函數：該函數根據監聽的事件，爲目標文件申請一個監聽項，並將該監聽項掛人到eventpoll結構的紅黑樹裏面。

epoll_wait

epoll_wait等待事件的產生，內核代碼如下：

sys_epoll_wait:

/*
 * Implement the event wait interface for the eventpoll file. It is the kernel
 * part of the user space epoll_wait(2).
 */
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
        int, maxevents, int, timeout)
{
    int error;
    struct fd f;
    struct eventpoll *ep;

    /* The maximum number of event must be greater than zero */
    if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
        return -EINVAL;

    /* Verify that the area passed by the user is writeable */
    if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
        return -EFAULT;

    /* Get the "struct file *" for the eventpoll file */
    f = fdget(epfd);
    if (!f.file)
        return -EBADF;

    /*
     * We have to check that the file structure underneath the fd
     * the user passed to us _is_ an eventpoll file.
     */
    error = -EINVAL;
    if (!is_file_epoll(f.file))
        goto error_fput;

    /*
     * At this point it is safe to assume that the "private_data" contains
     * our own data structure.
     */
    ep = f.file->private_data;

    /* Time to fish for events ... */
    error = ep_poll(ep, events, maxevents, timeout);

error_fput:
    fdput(f);
    return error;
}

首先是對進程傳進來的一些參數的檢查：

maxevents必須大於0並且小於EP_MAX_EVENTS，否則就返回-EINVAL；
內核必須有對events變量寫文件的權限，否則返回-EFAULT；
epfd代表的文件必須是個真正的epoll文件，否則返回-EBADF。

參數全部檢查合格後，接下來就調用ep_poll函數進行真正的處理：

sys_epoll_wait -> ep_poll:

/**
 * ep_poll - Retrieves ready events, and delivers them to the caller supplied
 *           event buffer.
 *
 * @ep: Pointer to the eventpoll context.
 * @events: Pointer to the userspace buffer where the ready events should be
 *          stored.
 * @maxevents: Size (in terms of number of events) of the caller event buffer.
 * @timeout: Maximum timeout for the ready events fetch operation, in
 *           milliseconds. If the @timeout is zero, the function will not block,
 *           while if the @timeout is less than zero, the function will block
 *           until at least one event has been retrieved (or an error
 *           occurred).
 *
 * Returns: Returns the number of ready events which have been fetched, or an
 *          error code, in case of error.
 */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
           int maxevents, long timeout)
{
    int res = 0, eavail, timed_out = 0;
    unsigned long flags;
    u64 slack = 0;
    wait_queue_t wait;
    ktime_t expires, *to = NULL;

    if (timeout > 0) {
        struct timespec64 end_time = ep_set_mstimeout(timeout);

        slack = select_estimate_accuracy(&end_time);
        to = &expires;
        *to = timespec64_to_ktime(end_time);
    } else if (timeout == 0) {
        /*
         * Avoid the unnecessary trip to the wait queue loop, if the
         * caller specified a non blocking operation.
         */
        timed_out = 1;
        spin_lock_irqsave(&ep->lock, flags);
        goto check_events;
    }

fetch_events:

    if (!ep_events_available(ep))
        ep_busy_loop(ep, timed_out);

    spin_lock_irqsave(&ep->lock, flags);

    if (!ep_events_available(ep)) {
        /*
         * Busy poll timed out.  Drop NAPI ID for now, we can add
         * it back in when we have moved a socket with a valid NAPI
         * ID onto the ready list.
         */
        ep_reset_busy_poll_napi_id(ep);

        /*
         * We don't have any available event to return to the caller.
         * We need to sleep here, and we will be wake up by
         * ep_poll_callback() when events will become available.
         */
        init_waitqueue_entry(&wait, current);
        __add_wait_queue_exclusive(&ep->wq, &wait);

        for (;;) {
            /*
             * We don't want to sleep if the ep_poll_callback() sends us
             * a wakeup in between. That's why we set the task state
             * to TASK_INTERRUPTIBLE before doing the checks.
             */
            set_current_state(TASK_INTERRUPTIBLE);
            if (ep_events_available(ep) || timed_out)
                break;
            if (signal_pending(current)) {
                res = -EINTR;
                break;
            }

            spin_unlock_irqrestore(&ep->lock, flags);
            if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
                timed_out = 1;

            spin_lock_irqsave(&ep->lock, flags);
        }

        __remove_wait_queue(&ep->wq, &wait);
        __set_current_state(TASK_RUNNING);
    }
check_events:
    /* Is it worth to try to dig for events ? */
    eavail = ep_events_available(ep);

    spin_unlock_irqrestore(&ep->lock, flags);

    /*
     * Try to transfer events to user space. In case we get 0 events and
     * there's still timeout left over, we go trying again in search of
     * more luck.
     */
    if (!res && eavail &&
        !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
        goto fetch_events;

    return res;
}

ep_poll中首先是對等待時間的處理，timeout超時時間以ms爲單位，timeout大於0，說明等待timeout時間後超時，如果timeout等於0，函數不阻塞，直接返回，小於0的情況，是永久阻塞，直到有事件產生才返回。

當沒有事件產生時（(!ep_events_available(ep))爲true）,調用__add_wait_queue_exclusive函數將當前進程加入到ep->wq等待隊列裏面，然後在一個無限for循環裏面，首先調用set_current_state(TASK_INTERRUPTIBLE)，將當前進程設置爲可中斷的睡眠狀態，然後當前進程就讓出cpu，進入睡眠，直到有其他進程調用wake_up或者有中斷信號進來喚醒本進程，它纔會去執行接下來的代碼。

如果進程被喚醒後，首先檢查是否有事件產生，或者是否出現超時還是被其他信號喚醒的。如果出現這些情況，就跳出循環，將當前進程從ep->wp的等待隊列裏面移除，並且將當前進程設置爲TASK_RUNNING就緒狀態。

如果真的有事件產生，就調用ep_send_events函數，將events事件轉移到用戶空間裏面。

sys_epoll_wait -> ep_poll -> ep_send_events:

static int ep_send_events(struct eventpoll *ep,
              struct epoll_event __user *events, int maxevents)
{
    struct ep_send_events_data esed;

    esed.maxevents = maxevents;
    esed.events = events;

    return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}

ep_send_events沒有什麼工作，真正的工作是在ep_scan_ready_list函數裏面：

sys_epoll_wait -> ep_poll -> ep_send_events -> ep_scan_ready_list:

/**
 * ep_scan_ready_list - Scans the ready list in a way that makes possible for
 *                      the scan code, to call f_op->poll(). Also allows for
 *                      O(NumReady) performance.
 *
 * @ep: Pointer to the epoll private data structure.
 * @sproc: Pointer to the scan callback.
 * @priv: Private opaque data passed to the @sproc callback.
 * @depth: The current depth of recursive f_op->poll calls.
 * @ep_locked: caller already holds ep->mtx
 *
 * Returns: The same integer error code returned by the @sproc callback.
 */
static int ep_scan_ready_list(struct eventpoll *ep,
                  int (*sproc)(struct eventpoll *,
                       struct list_head *, void *),
                  void *priv, int depth, bool ep_locked)
{
    int error, pwake = 0;
    unsigned long flags;
    struct epitem *epi, *nepi;
    LIST_HEAD(txlist);

    /*
     * We need to lock this because we could be hit by
     * eventpoll_release_file() and epoll_ctl().
     */

    if (!ep_locked)
        mutex_lock_nested(&ep->mtx, depth);

    /*
     * Steal the ready list, and re-init the original one to the
     * empty list. Also, set ep->ovflist to NULL so that events
     * happening while looping w/out locks, are not lost. We cannot
     * have the poll callback to queue directly on ep->rdllist,
     * because we want the "sproc" callback to be able to do it
     * in a lockless way.
     */
    spin_lock_irqsave(&ep->lock, flags);
    list_splice_init(&ep->rdllist, &txlist);
    ep->ovflist = NULL;
    spin_unlock_irqrestore(&ep->lock, flags);

    /*
     * Now call the callback function.
     */
    error = (*sproc)(ep, &txlist, priv);

    spin_lock_irqsave(&ep->lock, flags);
    /*
     * During the time we spent inside the "sproc" callback, some
     * other events might have been queued by the poll callback.
     * We re-insert them inside the main ready-list here.
     */
    for (nepi = ep->ovflist; (epi = nepi) != NULL;
         nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
        /*
         * We need to check if the item is already in the list.
         * During the "sproc" callback execution time, items are
         * queued into ->ovflist but the "txlist" might already
         * contain them, and the list_splice() below takes care of them.
         */
        if (!ep_is_linked(&epi->rdllink)) {
            list_add_tail(&epi->rdllink, &ep->rdllist);
            ep_pm_stay_awake(epi);
        }
    }
    /*
     * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
     * releasing the lock, events will be queued in the normal way inside
     * ep->rdllist.
     */
    ep->ovflist = EP_UNACTIVE_PTR;

    /*
     * Quickly re-inject items left on "txlist".
     */
    list_splice(&txlist, &ep->rdllist);
    __pm_relax(ep->ws);

    if (!list_empty(&ep->rdllist)) {
        /*
         * Wake up (if active) both the eventpoll wait list and
         * the ->poll() wait list (delayed after we release the lock).
         */
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }
    spin_unlock_irqrestore(&ep->lock, flags);

    if (!ep_locked)
        mutex_unlock(&ep->mtx);

    /* We have to call this outside the lock */
    if (pwake)
        ep_poll_safewake(&ep->poll_wait);

    return error;
}

ep_scan_ready_list首先將ep就緒鏈表裏面的數據鏈接到一個全局的txlist裏面，然後清空ep的就緒鏈表，同時還將ep的ovflist鏈表設置爲NULL，ovflist是用單鏈表，是一個接受就緒事件的備份鏈表，當內核進程將事件從內核拷貝到用戶空間時，這段時間目標文件可能會產生新的事件，這個時候，就需要將新的時間鏈入到ovlist裏面。

僅接着，調用sproc回調函數(這裏將調用ep_send_events_proc函數)將事件數據從內核拷貝到用戶空間。

sys_epoll_wait -> ep_poll -> ep_send_events -> ep_scan_ready_list -> ep_send_events_proc:

static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
                   void *priv)
{
    struct ep_send_events_data *esed = priv;
    int eventcnt;
    unsigned int revents;
    struct epitem *epi;
    struct epoll_event __user *uevent;
    struct wakeup_source *ws;
    poll_table pt;

    init_poll_funcptr(&pt, NULL);

    /*
     * We can loop without lock because we are passed a task private list.
     * Items cannot vanish during the loop because ep_scan_ready_list() is
     * holding "mtx" during this call.
     */
    for (eventcnt = 0, uevent = esed->events;
         !list_empty(head) && eventcnt < esed->maxevents;) {
        epi = list_first_entry(head, struct epitem, rdllink);

        /*
         * Activate ep->ws before deactivating epi->ws to prevent
         * triggering auto-suspend here (in case we reactive epi->ws
         * below).
         *
         * This could be rearranged to delay the deactivation of epi->ws
         * instead, but then epi->ws would temporarily be out of sync
         * with ep_is_linked().
         */
        ws = ep_wakeup_source(epi);
        if (ws) {
            if (ws->active)
                __pm_stay_awake(ep->ws);
            __pm_relax(ws);
        }

        list_del_init(&epi->rdllink);

        revents = ep_item_poll(epi, &pt);

        /*
         * If the event mask intersect the caller-requested one,
         * deliver the event to userspace. Again, ep_scan_ready_list()
         * is holding "mtx", so no operations coming from userspace
         * can change the item.
         */
        if (revents) {
            if (__put_user(revents, &uevent->events) ||
                __put_user(epi->event.data, &uevent->data)) {
                list_add(&epi->rdllink, head);
                ep_pm_stay_awake(epi);
                return eventcnt ? eventcnt : -EFAULT;
            }
            eventcnt++;
            uevent++;
            if (epi->event.events & EPOLLONESHOT)
                epi->event.events &= EP_PRIVATE_BITS;
            else if (!(epi->event.events & EPOLLET)) {
                /*
                 * If this file has been added with Level
                 * Trigger mode, we need to insert back inside
                 * the ready list, so that the next call to
                 * epoll_wait() will check again the events
                 * availability. At this point, no one can insert
                 * into ep->rdllist besides us. The epoll_ctl()
                 * callers are locked out by
                 * ep_scan_ready_list() holding "mtx" and the
                 * poll callback will queue them in ep->ovflist.
                 */
                list_add_tail(&epi->rdllink, &ep->rdllist);
                ep_pm_stay_awake(epi);
            }
        }
    }

    return eventcnt;
}

ep_send_events_proc回調函數循環獲取監聽項的事件數據，對每個監聽項，調用ep_item_poll獲取監聽到的目標文件的事件，如果獲取到事件，就調用__put_user函數將數據拷貝到用戶空間。

回到ep_scan_ready_list函數，上面說到，在sproc回調函數執行期間，目標文件可能會產生新的事件鏈入ovlist鏈表裏面，所以，在回調結束後，需要重新將ovlist鏈表裏面的事件添加到rdllist就緒事件鏈表裏面。

同時在最後，如果rdlist不爲空（表示是否有就緒事件），並且由進程等待該事件，就調用wake_up_locked再一次喚醒內核進程處理事件的到達（流程跟前面一樣，也就是將事件拷貝到用戶空間）。

到這，epoll_wait的流程是結束了，但是有一個問題，就是前面提到的進程調用epoll_wait後會睡眠，但是這個進程什麼時候被喚醒呢？在調用epoll_ctl爲目標文件註冊監聽項時，對目標文件的監聽項註冊一個ep_ptable_queue_proc回調函數，ep_ptable_queue_proc回調函數將進程添加到目標文件的wakeup鏈表裏面，並且註冊ep_poll_callbak回調，當目標文件產生事件時，ep_poll_callbak回調就去喚醒等待隊列裏面的進程。

總結一下epoll該函數： epoll_wait函數會使調用它的進程進入睡眠（timeout爲0時除外），如果有監聽的事件產生，該進程就被喚醒，同時將事件從內核裏面拷貝到用戶空間返回給該進程。

參考

[1] http://blog.csdn.net/chen19870707/article/details/42525887
[2] http://www.cnblogs.com/apprentice89/p/3234677.html

epoll概述和epoll內核源碼分析（epoll最全講解）

epoll概述

源碼分析

epoll_create

epoll_ctl

epoll_wait

參考

Shell/Python中的用戶名獲取

epoll概述和epoll內核源碼分析（epoll最全講解）

Intel DPDK源代碼分析

P4的可編程的設計流程

KVM詳解，教科書般的經典講解

Linux ubuntu-18內核版本降級error: macro "access_ok" passed 3 arguments, but takes just 2

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結