epoll概述
epoll是linux中IO多路複用的一種機制,I/O多路複用就是通過一種機制,一個進程可以監視多個描述符,一旦某個描述符就緒(一般是讀就緒或者寫就緒),能夠通知程序進行相應的讀寫操作。當然linux中IO多路複用不僅僅是epoll,其他多路複用機制還有select、poll,但是接下來介紹epoll的內核實現。
網上關於epoll接口的介紹非常多,這個不是我關注的重點,但是還是有必要了解。該接口非常簡單,一共就三個函數,這裏我摘抄了網上關於該接口的介紹:
-
int epoll_create(int size);
創建一個epoll的句柄,size用來告訴內核這個監聽的數目一共有多大。這個參數不同於select()中的第一個參數,給出最大監聽的fd+1的值。需要注意的是,當創建好epoll句柄後,它就是會佔用一個fd值,在linux下如果查看/proc/進程id/fd/,是能夠看到這個fd的,所以在使用完epoll後,必須調用close()關閉,否則可能導致fd被耗盡。 -
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
epoll的事件註冊函數,它不同與select()是在監聽事件時告訴內核要監聽什麼類型的事件,而是在這裏先註冊要監聽的事件類型。第一個參數是epoll_create()的返回值,第二個參數表示動作,用三個宏來表示:
EPOLL_CTL_ADD:註冊新的fd到epfd中;
EPOLL_CTL_MOD:修改已經註冊的fd的監聽事件;
EPOLL_CTL_DEL:從epfd中刪除一個fd;
第三個參數是需要監聽的fd,第四個參數是告訴內核需要監聽什麼事,struct epoll_event結構如下:
struct epoll_event {
__uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
events可以是以下幾個宏的集合:
EPOLLIN :表示對應的文件描述符可以讀(包括對端SOCKET正常關閉);
EPOLLOUT:表示對應的文件描述符可以寫;
EPOLLPRI:表示對應的文件描述符有緊急的數據可讀(這裏應該表示有帶外數據到來);
EPOLLERR:表示對應的文件描述符發生錯誤;
EPOLLHUP:表示對應的文件描述符被掛斷;
EPOLLET: 將EPOLL設爲邊緣觸發(Edge Triggered)模式,這是相對於水平觸發(Level Triggered)來說的。
EPOLLONESHOT:只監聽一次事件,當監聽完這次事件之後,如果還需要繼續監聽這個socket的話,需要再次把這個socket加入到EPOLL隊列裏
- int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
等待事件的產生,類似於select()調用。參數events用來從內核得到事件的集合,maxevents告之內核這個events有多大,這個maxevents的值不能大於創建epoll_create()時的size(備註:在4.1.2內核裏面,epoll_create的size沒有什麼用),參數timeout是超時時間(毫秒,0會立即返回,小於0時將是永久阻塞)。該函數返回需要處理的事件數目,如返回0表示已超時
epoll相比select/poll的優勢:
-
select/poll每次調用都要傳遞所要監控的所有fd給select/poll系統調用(這意味着每次調用都要將fd列表從用戶態拷貝到內核態,當fd數目很多時,這會造成低效)。而每次調用epoll_wait時(作用相當於調用select/poll),不需要再傳遞fd列表給內核,因爲已經在epoll_ctl中將需要監控的fd告訴了內核(epoll_ctl不需要每次都拷貝所有的fd,只需要進行增量式操作)。所以,在調用epoll_create之後,內核已經在內核態開始準備數據結構存放要監控的fd了。每次epoll_ctl只是對這個數據結構進行簡單的維護。
-
select/poll一個致命弱點就是當你擁有一個很大的socket集合,不過由於網絡延時,任一時間只有部分的socket是"活躍"的,但是select/poll每次調用都會線性掃描全部的集合,導致效率呈現線性下降。但是epoll不存在這個問題,它只會對"活躍"的socket進行操作---這是因爲在內核實現中epoll是根據每個fd上面的callback函數實現的。
-
當我們調用epoll_ctl往裏塞入百萬個fd時,epoll_wait仍然可以飛快的返回,並有效的將發生事件的fd給我們用戶。這是由於我們在調用epoll_create時,內核除了幫我們在epoll文件系統裏建了個file結點,在內核cache裏建了個紅黑樹用於存儲以後epoll_ctl傳來的fd外,還會再建立一個list鏈表,用於存儲準備就緒的事件,當epoll_wait調用時,僅僅觀察這個list鏈表裏有沒有數據即可。有數據就返回,沒有數據就sleep,等到timeout時間到後即使鏈表沒數據也返回。所以,epoll_wait非常高效。而且,通常情況下即使我們要監控百萬計的fd,大多一次也只返回很少量的準備就緒fd而已,所以,epoll_wait僅需要從內核態copy少量的fd到用戶態而已。那麼,這個準備就緒list鏈表是怎麼維護的呢?當我們執行epoll_ctl時,除了把fd放到epoll文件系統裏file對象對應的紅黑樹上之外,還會給內核中斷處理程序註冊一個回調函數,告訴內核,如果這個fd的中斷到了,就把它放到準備就緒list鏈表裏。所以,當一個fd(例如socket)上有數據到了,內核在把設備(例如網卡)上的數據copy到內核中後就來把fd(socket)插入到準備就緒list鏈表裏了。
源碼分析
epoll相關的內核代碼在fs/eventpoll.c文件中,下面分別分析epoll_create、epoll_ctl和epoll_wait三個函數在內核中的實現,分析所用linux內核源碼爲4.1.2版本。
epoll_create
epoll_create用於創建一個epoll的句柄,其在內核的系統實現如下:
sys_epoll_create:
SYSCALL_DEFINE1(epoll_create, int, size)
{
if (size <= 0)
return -EINVAL;
return sys_epoll_create1(0);
}
可見,我們在調用epoll_create時,傳入的size參數,僅僅是用來判斷是否小於等於0,之後再也沒有其他用處。
整個函數就3行代碼,真正的工作還是放在sys_epoll_create1函數中。
sys_epoll_create -> sys_epoll_create1:
/*
* Open an eventpoll file descriptor.
*/
SYSCALL_DEFINE1(epoll_create1, int, flags)
{
int error, fd;
struct eventpoll *ep = NULL;
struct file *file;
/* Check the EPOLL_* constant for consistency. */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
if (flags & ~EPOLL_CLOEXEC)
return -EINVAL;
/*
* Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
if (error < 0)
return error;
/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
if (fd < 0) {
error = fd;
goto out_free_ep;
}
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC));
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto out_free_fd;
}
ep->file = file;
fd_install(fd, file);
return fd;
out_free_fd:
put_unused_fd(fd);
out_free_ep:
ep_free(ep);
return error;
}
sys_epoll_create1 函數流程如下:
- 首先調用ep_alloc函數申請一個eventpoll結構,並且初始化該結構的成員,這裏沒什麼好說的,代碼如下:
sys_epoll_create -> sys_epoll_create1 -> ep_alloc:
static int ep_alloc(struct eventpoll **pep)
{
int error;
struct user_struct *user;
struct eventpoll *ep;
user = get_current_user();
error = -ENOMEM;
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
if (unlikely(!ep))
goto free_uid;
spin_lock_init(&ep->lock);
mutex_init(&ep->mtx);
init_waitqueue_head(&ep->wq);
init_waitqueue_head(&ep->poll_wait);
INIT_LIST_HEAD(&ep->rdllist);
ep->rbr = RB_ROOT;
ep->ovflist = EP_UNACTIVE_PTR;
ep->user = user;
*pep = ep;
return 0;
free_uid:
free_uid(user);
return error;
}
- 接下來調用get_unused_fd_flags函數,在本進程中申請一個未使用的fd文件描述符。
sys_epoll_create -> sys_epoll_create1 -> ep_alloc -> get_unused_fd_flags:
int get_unused_fd_flags(unsigned flags)
{
return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
}
linux內核中,current是個宏,返回的是一個task_struct結構(我們稱之爲進程描述符)的變量,表示的是當前進程,進程打開的文件資源保存在進程描述符的files成員裏面,所以current->files返回的當前進程打開的文件資源。rlimit(RLIMIT_NOFILE) 函數獲取的是當前進程可以打開的最大文件描述符數,這個值可以設置,默認是1024。
__alloc_fd的工作是爲進程在[start,end)之間(備註:這裏start爲0, end爲進程可以打開的最大文件描述符數)分配一個可用的文件描述符,這裏就不繼續深入下去了,代碼如下:
sys_epoll_create -> sys_epoll_create1 -> ep_alloc -> get_unused_fd_flags -> __alloc_fd:
/*
* allocate a file descriptor, mark it busy.
*/
int __alloc_fd(struct files_struct *files,
unsigned start, unsigned end, unsigned flags)
{
unsigned int fd;
int error;
struct fdtable *fdt;
spin_lock(&files->file_lock);
repeat:
fdt = files_fdtable(files);
fd = start;
if (fd < files->next_fd)
fd = files->next_fd;
if (fd < fdt->max_fds)
fd = find_next_fd(fdt, fd);
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
if (fd >= end)
goto out;
error = expand_files(files, fd);
if (error < 0)
goto out;
/*
* If we needed to expand the fs array we
* might have blocked - try again.
*/
if (error)
goto repeat;
if (start <= files->next_fd)
files->next_fd = fd + 1;
__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
__set_close_on_exec(fd, fdt);
else
__clear_close_on_exec(fd, fdt);
error = fd;
#if 1
/* Sanity check */
if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
rcu_assign_pointer(fdt->fd[fd], NULL);
}
#endif
out:
spin_unlock(&files->file_lock);****
return error;
}
- 然後,epoll_create1會調用anon_inode_getfile,創建一個file結構,如下:
sys_epoll_create -> sys_epoll_create1 -> anon_inode_getfile:
/**
* anon_inode_getfile - creates a new file instance by hooking it up to an
* anonymous inode, and a dentry that describe the "class"
* of the file
*
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
* @flags: [in] flags
*
* Creates a new file by hooking it on a single inode. This is useful for files
* that do not need to have a full-fledged inode in order to operate correctly.
* All the files created with anon_inode_getfile() will share a single inode,
* hence saving memory and avoiding code duplication for the file/inode/dentry
* setup. Returns the newly created file* or an error pointer.
*/
struct file *anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags)
{
struct qstr this;
struct path path;
struct file *file;
if (IS_ERR(anon_inode_inode))
return ERR_PTR(-ENODEV);
if (fops->owner && !try_module_get(fops->owner))
return ERR_PTR(-ENOENT);
/*
* Link the inode to a directory entry by creating a unique name
* using the inode sequence number.
*/
file = ERR_PTR(-ENOMEM);
this.name = name;
this.len = strlen(name);
this.hash = 0;
path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
if (!path.dentry)
goto err_module;
path.mnt = mntget(anon_inode_mnt);
/*
* We know the anon_inode inode count is always greater than zero,
* so ihold() is safe.
*/
ihold(anon_inode_inode);
d_instantiate(path.dentry, anon_inode_inode);
file = alloc_file(&path, OPEN_FMODE(flags), fops);
if (IS_ERR(file))
goto err_dput;
file->f_mapping = anon_inode_inode->i_mapping;
file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
file->private_data = priv;
return file;
err_dput:
path_put(&path);
err_module:
module_put(fops->owner);
return file;
}
anon_inode_getfile函數中首先會alloc一個file結構和一個dentry結構,然後將該file結構與一個匿名inode節點anon_inode_inode掛鉤在一起,這裏要注意的是,在調用anon_inode_getfile函數申請file結構時,傳入了前面申請的eventpoll結構的ep變量,申請的file->private_data會指向這個ep變量,同時,在anon_inode_getfile函數返回來後,ep->file會指向該函數申請的file結構變量。
簡要說一下file/dentry/inode,當進程打開一個文件時,內核就會爲該進程分配一個file結構,表示打開的文件在進程的上下文,然後應用程序會通過一個int類型的文件描述符來訪問這個結構,實際上內核的進程裏面維護一個file結構的數組,而文件描述符就是相應的file結構在數組中的下標。
dentry結構(稱之爲“目錄項”)記錄着文件的各種屬性,比如文件名、訪問權限等,每個文件都只有一個dentry結構,然後一個進程可以多次打開一個文件,多個進程也可以打開同一個文件,這些情況,內核都會申請多個file結構,建立多個文件上下文。但是,對同一個文件來說,無論打開多少次,內核只會爲該文件分配一個dentry。所以,file結構與dentry結構的關係是多對一的。
同時,每個文件除了有一個dentry目錄項結構外,還有一個索引節點inode結構,裏面記錄文件在存儲介質上的位置和分佈等信息,每個文件在內核中只分配一個inode。 dentry與inode描述的目標是不同的,一個文件可能會有好幾個文件名(比如鏈接文件),通過不同文件名訪問同一個文件的權限也可能不同。dentry文件所代表的是邏輯意義上的文件,記錄的是其邏輯上的屬性,而inode結構所代表的是其物理意義上的文件,記錄的是其物理上的屬性。dentry與inode結構的關係是多對一的關係。
- 最後,epoll_create1調用fd_install函數,將fd與file交給關聯在一起,之後,內核可以通過應用傳入的fd參數訪問file結構,本段代碼比較簡單,不繼續深入下去了。
sys_epoll_create -> sys_epoll_create1 -> fd_install:
/*
* Install a file pointer in the fd array.
*
* The VFS is full of places where we drop the files lock between
* setting the open_fds bitmap and installing the file in the file
* array. At any such point, we are vulnerable to a dup2() race
* installing a file in the array before us. We need to detect this and
* fput() the struct file we are about to overwrite in this case.
*
* It should never happen - if we allow dup2() do it, _really_ bad things
* will follow.
*
* NOTE: __fd_install() variant is really, really low-level; don't
* use it unless you are forced to by truly lousy API shoved down
* your throat. 'files' *MUST* be either current->files or obtained
* by get_files_struct(current) done by whoever had given it to you,
* or really bad things will happen. Normally you want to use
* fd_install() instead.
*/
void __fd_install(struct files_struct *files, unsigned int fd,
struct file *file)
{
struct fdtable *fdt;
might_sleep();
rcu_read_lock_sched();
while (unlikely(files->resize_in_progress)) {
rcu_read_unlock_sched();
wait_event(files->resize_wait, !files->resize_in_progress);
rcu_read_lock_sched();
}
/* coupled with smp_wmb() in expand_fdtable() */
smp_rmb();
fdt = rcu_dereference_sched(files->fdt);
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
rcu_read_unlock_sched();
}
void fd_install(unsigned int fd, struct file *file)
{
__fd_install(current->files, fd, file);
}
總結epoll_create函數所做的事:調用epoll_create後,在內核中分配一個eventpoll結構和代表epoll文件的file結構,並且將這兩個結構關聯在一塊,同時,返回一個也與file結構相關聯的epoll文件描述符fd。當應用程序操作epoll時,需要傳入一個epoll文件描述符fd,內核根據這個fd,找到epoll的file結構,然後通過file,獲取之前epoll_create申請eventpoll結構變量,epoll相關的重要信息都存儲在這個結構裏面。接下來,所有epoll接口函數的操作,都是在eventpoll結構變量上進行的。
所以,epoll_create的作用就是爲進程在內核中建立一個從epoll文件描述符到eventpoll結構變量的通道。
epoll_ctl
epoll_ctl接口的作用是添加/修改/刪除文件的監聽事件,內核代碼如下:
sys_epoll_ctl:
/*
* The following function implements the controller interface for
* the eventpoll file that enables the insertion/removal/change of
* file descriptors inside the interest set.
*/
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
struct epoll_event __user *, event)
{
int error;
int full_check = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
struct epoll_event epds;
struct eventpoll *tep = NULL;
error = -EFAULT;
if (ep_op_has_event(op) &&
copy_from_user(&epds, event, sizeof(struct epoll_event)))
goto error_return;
error = -EBADF;
f = fdget(epfd);
if (!f.file)
goto error_return;
/* Get the "struct file *" for the target file */
tf = fdget(fd);
if (!tf.file)
goto error_fput;
/* The target file descriptor must support poll */
error = -EPERM;
if (!tf.file->f_op->poll)
goto error_tgt_fput;
/* Check if EPOLLWAKEUP is allowed */
if (ep_op_has_event(op))
ep_take_care_of_epollwakeup(&epds);
/*
* We have to check that the file structure underneath the file descriptor
* the user passed to us _is_ an eventpoll file. And also we do not permit
* adding an epoll file descriptor inside itself.
*/
error = -EINVAL;
if (f.file == tf.file || !is_file_epoll(f.file))
goto error_tgt_fput;
/*
* epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
* so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
* Also, we do not currently supported nested exclusive wakeups.
*/
if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {
if (op == EPOLL_CTL_MOD)
goto error_tgt_fput;
if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
goto error_tgt_fput;
}
/*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
*/
ep = f.file->private_data;
/*
* When we insert an epoll file descriptor, inside another epoll file
* descriptor, there is the change of creating closed loops, which are
* better be handled here, than in more critical paths. While we are
* checking for loops we also determine the list of files reachable
* and hang them on the tfile_check_list, so we can check that we
* haven't created too many possible wakeup paths.
*
* We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
* the epoll file descriptor is attaching directly to a wakeup source,
* unless the epoll file descriptor is nested. The purpose of taking the
* 'epmutex' on add is to prevent complex toplogies such as loops and
* deep wakeup paths from forming in parallel through multiple
* EPOLL_CTL_ADD operations.
*/
mutex_lock_nested(&ep->mtx, 0);
if (op == EPOLL_CTL_ADD) {
if (!list_empty(&f.file->f_ep_links) ||
is_file_epoll(tf.file)) {
full_check = 1;
mutex_unlock(&ep->mtx);
mutex_lock(&epmutex);
if (is_file_epoll(tf.file)) {
error = -ELOOP;
if (ep_loop_check(ep, tf.file) != 0) {
clear_tfile_check_list();
goto error_tgt_fput;
}
} else
list_add(&tf.file->f_tfile_llink,
&tfile_check_list);
mutex_lock_nested(&ep->mtx, 0);
if (is_file_epoll(tf.file)) {
tep = tf.file->private_data;
mutex_lock_nested(&tep->mtx, 1);
}
}
}
/*
* Try to lookup the file inside our RB tree, Since we grabbed "mtx"
* above, we can be sure to be able to use the item looked up by
* ep_find() till we release the mutex.
*/
epi = ep_find(ep, tf.file, fd);
error = -EINVAL;
switch (op) {
case EPOLL_CTL_ADD:
if (!epi) {
epds.events |= POLLERR | POLLHUP;
error = ep_insert(ep, &epds, tf.file, fd, full_check);
} else
error = -EEXIST;
if (full_check)
clear_tfile_check_list();
break;
case EPOLL_CTL_DEL:
if (epi)
error = ep_remove(ep, epi);
else
error = -ENOENT;
break;
case EPOLL_CTL_MOD:
if (epi) {
if (!(epi->event.events & EPOLLEXCLUSIVE)) {
epds.events |= POLLERR | POLLHUP;
error = ep_modify(ep, epi, &epds);
}
} else
error = -ENOENT;
break;
}
if (tep != NULL)
mutex_unlock(&tep->mtx);
mutex_unlock(&ep->mtx);
error_tgt_fput:
if (full_check)
mutex_unlock(&epmutex);
fdput(tf);
error_fput:
fdput(f);
error_return:
return error;
}
根據前面對epoll_ctl接口的介紹,op是對epoll操作的動作(添加/修改/刪除事件),ep_op_has_event(op)判斷是否不是刪除操作,如果op != EPOLL_CTL_DEL爲true,則需要調用copy_from_user函數將用戶空間傳過來的event事件拷貝到內核的epds變量中。因爲,只有刪除操作,內核不需要使用進程傳入的event事件。
接着連續調用兩次fdget分別獲取epoll文件和被監聽文件(以下稱爲目標文件)的file結構變量(備註:該函數返回fd結構變量,fd結構包含file結構)。
接下來就是對參數的一些檢查,出現如下情況,就可以認爲傳入的參數有問題,直接返回出錯:
- 目標文件不支持poll操作(!tf.file->f_op->poll);
- 監聽的目標文件就是epoll文件本身(f.file == tf.file);
- 用戶傳入的epoll文件(epfd代表的文件)並不是一個真正的epoll的文件(!is_file_epoll(f.file));
- 如果操作動作是修改操作,並且事件類型爲EPOLLEXCLUSIVE,返回出錯等等。
當然下面還有一些關於操作動作如果是添加操作的判斷,這裏不做解釋,比較簡單,自行閱讀。
在ep裏面,維護着一個紅黑樹,每次添加註冊事件時,都會申請一個epitem結構的變量表示事件的監聽項,然後插入ep的紅黑樹裏面。在epoll_ctl裏面,會調用ep_find函數從ep的紅黑樹裏面查找目標文件表示的監聽項,返回的監聽項可能爲空。
接下來switch這塊區域的代碼就是整個epoll_ctl函數的核心,對op進行switch出來的有添加(EPOLL_CTL_ADD)、刪除(EPOLL_CTL_DEL)和修改(EPOLL_CTL_MOD)三種情況,這裏我以添加爲例講解,其他兩種情況類似,知道了如何添加監聽事件,其他刪除和修改監聽事件都可以舉一反三。
爲目標文件添加監控事件時,首先要保證當前ep裏面還沒有對該目標文件進行監聽,如果存在(epi不爲空),就返回-EEXIST錯誤。否則說明參數正常,然後先默認設置對目標文件的POLLERR和POLLHUP監聽事件,然後調用ep_insert函數,將對目標文件的監聽事件插入到ep維護的紅黑樹裏面:
sys_epoll_ctl -> ep_insert:
/*
* Must be called with "mtx" held.
*/
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
struct file *tfile, int fd, int full_check)
{
int error, revents, pwake = 0;
unsigned long flags;
long user_watches;
struct epitem *epi;
struct ep_pqueue epq;
user_watches = atomic_long_read(&ep->user->epoll_watches);
if (unlikely(user_watches >= max_user_watches))
return -ENOSPC;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
/* Item initialization follow here ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;
if (epi->event.events & EPOLLWAKEUP) {
error = ep_create_wakeup_source(epi);
if (error)
goto error_create_wakeup_source;
} else {
RCU_INIT_POINTER(epi->ws, NULL);
}
/* Initialize the poll table using the queue callback */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
/*
* Attach the item to the poll hooks and get current event bits.
* We can safely use the file* here because its usage count has
* been increased by the caller of this function. Note that after
* this operation completes, the poll callback can start hitting
* the new item.
*/
revents = ep_item_poll(epi, &epq.pt);
/*
* We have to check if something went wrong during the poll wait queue
* install process. Namely an allocation for a wait queue failed due
* high memory pressure.
*/
error = -ENOMEM;
if (epi->nwait < 0)
goto error_unregister;
/* Add the current item to the list of active epoll hook for this file */
spin_lock(&tfile->f_lock);
list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
spin_unlock(&tfile->f_lock);
/*
* Add the current item to the RB tree. All RB tree operations are
* protected by "mtx", and ep_insert() is called with "mtx" held.
*/
ep_rbtree_insert(ep, epi);
/* now check if we've created too many backpaths */
error = -EINVAL;
if (full_check && reverse_path_check())
goto error_remove_epi;
/* We have to drop the new item inside our item list to keep track of it */
spin_lock_irqsave(&ep->lock, flags);
/* record NAPI ID of new item if present */
ep_set_busy_poll_napi_id(epi);
/* If the file is already "ready" we drop it inside the ready list */
if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
/* Notify waiting tasks that events are available */
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
spin_unlock_irqrestore(&ep->lock, flags);
atomic_long_inc(&ep->user->epoll_watches);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return 0;
error_remove_epi:
spin_lock(&tfile->f_lock);
list_del_rcu(&epi->fllink);
spin_unlock(&tfile->f_lock);
rb_erase(&epi->rbn, &ep->rbr);
error_unregister:
ep_unregister_pollwait(ep, epi);
/*
* We need to do this because an event could have been arrived on some
* allocated wait queue. Note that we don't care about the ep->ovflist
* list, since that is used/cleaned only inside a section bound by "mtx".
* And ep_insert() is called with "mtx" held.
*/
spin_lock_irqsave(&ep->lock, flags);
if (ep_is_linked(&epi->rdllink))
list_del_init(&epi->rdllink);
spin_unlock_irqrestore(&ep->lock, flags);
wakeup_source_unregister(ep_wakeup_source(epi));
error_create_wakeup_source:
kmem_cache_free(epi_cache, epi);
return error;
}
前面說過,對目標文件的監聽是由一個epitem結構的監聽項變量維護的,所以在ep_insert函數裏面,首先調用kmem_cache_alloc函數,從slab分配器裏面分配一個epitem結構監聽項,然後對該結構進行初始化,這裏也沒有什麼好說的。我們接下來看ep_item_poll這個函數調用:
sys_epoll_ctl -> ep_insert -> ep_item_poll:
static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
{
pt->_key = epi->event.events;
return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
}
ep_item_poll函數裏面,調用目標文件的poll函數,這個函數針對不同的目標文件而指向不同的函數,如果目標文件爲套接字的話,這個poll就指向sock_poll,而如果目標文件爲tcp套接字來說,這個poll就是tcp_poll函數。雖然poll指向的函數可能會不同,但是其作用都是一樣的,就是獲取目標文件當前產生的事件位,並且將監聽項綁定到目標文件的poll鉤子裏面(最重要的是註冊ep_ptable_queue_proc這個poll callback回調函數),這步操作完成後,以後目標文件產生事件就會調用ep_ptable_queue_proc回調函數。
接下來,調用list_add_tail_rcu將當前監聽項添加到目標文件的f_ep_links鏈表裏面,該鏈表是目標文件的epoll鉤子鏈表,所有對該目標文件進行監聽的監聽項都會加入到該鏈表裏面。
然後就是調用ep_rbtree_insert,將epi監聽項添加到ep維護的紅黑樹裏面,這裏不做解釋,代碼如下:
sys_epoll_ctl -> ep_insert -> ep_rbtree_insert:
static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
{
int kcmp;
struct rb_node **p = &ep->rbr.rb_node, *parent = NULL;
struct epitem *epic;
while (*p) {
parent = *p;
epic = rb_entry(parent, struct epitem, rbn);
kcmp = ep_cmp_ffd(&epi->ffd, &epic->ffd);
if (kcmp > 0)
p = &parent->rb_right;
else
p = &parent->rb_left;
}
rb_link_node(&epi->rbn, parent, p);
rb_insert_color(&epi->rbn, &ep->rbr);
}
前面提到,ep_insert有調用ep_item_poll去獲取目標文件產生的事件位,在調用epoll_ctl前這段時間,可能會產生相關進程需要監聽的事件,如果有監聽的事件產生,(revents & event->events 爲 true),並且目標文件相關的監聽項沒有鏈接到ep的準備鏈表rdlist裏面的話,就將該監聽項添加到ep的rdlist準備鏈表裏面,rdlist鏈接的是該epoll描述符監聽的所有已經就緒的目標文件的監聽項。並且,如果有任務在等待產生事件時,就調用wake_up_locked函數喚醒所有正在等待的任務,處理相應的事件。當進程調用epoll_wait時,該進程就出現在ep的wq等待隊列裏面。接下來講解epoll_wait函數。
總結epoll_ctl函數:該函數根據監聽的事件,爲目標文件申請一個監聽項,並將該監聽項掛人到eventpoll結構的紅黑樹裏面。
epoll_wait
epoll_wait等待事件的產生,內核代碼如下:
sys_epoll_wait:
/*
* Implement the event wait interface for the eventpoll file. It is the kernel
* part of the user space epoll_wait(2).
*/
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
{
int error;
struct fd f;
struct eventpoll *ep;
/* The maximum number of event must be greater than zero */
if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
return -EINVAL;
/* Verify that the area passed by the user is writeable */
if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
return -EFAULT;
/* Get the "struct file *" for the eventpoll file */
f = fdget(epfd);
if (!f.file)
return -EBADF;
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
if (!is_file_epoll(f.file))
goto error_fput;
/*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
*/
ep = f.file->private_data;
/* Time to fish for events ... */
error = ep_poll(ep, events, maxevents, timeout);
error_fput:
fdput(f);
return error;
}
首先是對進程傳進來的一些參數的檢查:
- maxevents必須大於0並且小於EP_MAX_EVENTS,否則就返回-EINVAL;
- 內核必須有對events變量寫文件的權限,否則返回-EFAULT;
- epfd代表的文件必須是個真正的epoll文件,否則返回-EBADF。
參數全部檢查合格後,接下來就調用ep_poll函數進行真正的處理:
sys_epoll_wait -> ep_poll:
/**
* ep_poll - Retrieves ready events, and delivers them to the caller supplied
* event buffer.
*
* @ep: Pointer to the eventpoll context.
* @events: Pointer to the userspace buffer where the ready events should be
* stored.
* @maxevents: Size (in terms of number of events) of the caller event buffer.
* @timeout: Maximum timeout for the ready events fetch operation, in
* milliseconds. If the @timeout is zero, the function will not block,
* while if the @timeout is less than zero, the function will block
* until at least one event has been retrieved (or an error
* occurred).
*
* Returns: Returns the number of ready events which have been fetched, or an
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
int maxevents, long timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
u64 slack = 0;
wait_queue_t wait;
ktime_t expires, *to = NULL;
if (timeout > 0) {
struct timespec64 end_time = ep_set_mstimeout(timeout);
slack = select_estimate_accuracy(&end_time);
to = &expires;
*to = timespec64_to_ktime(end_time);
} else if (timeout == 0) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
* caller specified a non blocking operation.
*/
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
goto check_events;
}
fetch_events:
if (!ep_events_available(ep))
ep_busy_loop(ep, timed_out);
spin_lock_irqsave(&ep->lock, flags);
if (!ep_events_available(ep)) {
/*
* Busy poll timed out. Drop NAPI ID for now, we can add
* it back in when we have moved a socket with a valid NAPI
* ID onto the ready list.
*/
ep_reset_busy_poll_napi_id(ep);
/*
* We don't have any available event to return to the caller.
* We need to sleep here, and we will be wake up by
* ep_poll_callback() when events will become available.
*/
init_waitqueue_entry(&wait, current);
__add_wait_queue_exclusive(&ep->wq, &wait);
for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
* to TASK_INTERRUPTIBLE before doing the checks.
*/
set_current_state(TASK_INTERRUPTIBLE);
if (ep_events_available(ep) || timed_out)
break;
if (signal_pending(current)) {
res = -EINTR;
break;
}
spin_unlock_irqrestore(&ep->lock, flags);
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
}
__remove_wait_queue(&ep->wq, &wait);
__set_current_state(TASK_RUNNING);
}
check_events:
/* Is it worth to try to dig for events ? */
eavail = ep_events_available(ep);
spin_unlock_irqrestore(&ep->lock, flags);
/*
* Try to transfer events to user space. In case we get 0 events and
* there's still timeout left over, we go trying again in search of
* more luck.
*/
if (!res && eavail &&
!(res = ep_send_events(ep, events, maxevents)) && !timed_out)
goto fetch_events;
return res;
}
ep_poll中首先是對等待時間的處理,timeout超時時間以ms爲單位,timeout大於0,說明等待timeout時間後超時,如果timeout等於0,函數不阻塞,直接返回,小於0的情況,是永久阻塞,直到有事件產生才返回。
當沒有事件產生時((!ep_events_available(ep))爲true),調用__add_wait_queue_exclusive函數將當前進程加入到ep->wq等待隊列裏面,然後在一個無限for循環裏面,首先調用set_current_state(TASK_INTERRUPTIBLE),將當前進程設置爲可中斷的睡眠狀態,然後當前進程就讓出cpu,進入睡眠,直到有其他進程調用wake_up或者有中斷信號進來喚醒本進程,它纔會去執行接下來的代碼。
如果進程被喚醒後,首先檢查是否有事件產生,或者是否出現超時還是被其他信號喚醒的。如果出現這些情況,就跳出循環,將當前進程從ep->wp的等待隊列裏面移除,並且將當前進程設置爲TASK_RUNNING就緒狀態。
如果真的有事件產生,就調用ep_send_events函數,將events事件轉移到用戶空間裏面。
sys_epoll_wait -> ep_poll -> ep_send_events:
static int ep_send_events(struct eventpoll *ep,
struct epoll_event __user *events, int maxevents)
{
struct ep_send_events_data esed;
esed.maxevents = maxevents;
esed.events = events;
return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}
ep_send_events沒有什麼工作,真正的工作是在ep_scan_ready_list函數裏面:
sys_epoll_wait -> ep_poll -> ep_send_events -> ep_scan_ready_list:
/**
* ep_scan_ready_list - Scans the ready list in a way that makes possible for
* the scan code, to call f_op->poll(). Also allows for
* O(NumReady) performance.
*
* @ep: Pointer to the epoll private data structure.
* @sproc: Pointer to the scan callback.
* @priv: Private opaque data passed to the @sproc callback.
* @depth: The current depth of recursive f_op->poll calls.
* @ep_locked: caller already holds ep->mtx
*
* Returns: The same integer error code returned by the @sproc callback.
*/
static int ep_scan_ready_list(struct eventpoll *ep,
int (*sproc)(struct eventpoll *,
struct list_head *, void *),
void *priv, int depth, bool ep_locked)
{
int error, pwake = 0;
unsigned long flags;
struct epitem *epi, *nepi;
LIST_HEAD(txlist);
/*
* We need to lock this because we could be hit by
* eventpoll_release_file() and epoll_ctl().
*/
if (!ep_locked)
mutex_lock_nested(&ep->mtx, depth);
/*
* Steal the ready list, and re-init the original one to the
* empty list. Also, set ep->ovflist to NULL so that events
* happening while looping w/out locks, are not lost. We cannot
* have the poll callback to queue directly on ep->rdllist,
* because we want the "sproc" callback to be able to do it
* in a lockless way.
*/
spin_lock_irqsave(&ep->lock, flags);
list_splice_init(&ep->rdllist, &txlist);
ep->ovflist = NULL;
spin_unlock_irqrestore(&ep->lock, flags);
/*
* Now call the callback function.
*/
error = (*sproc)(ep, &txlist, priv);
spin_lock_irqsave(&ep->lock, flags);
/*
* During the time we spent inside the "sproc" callback, some
* other events might have been queued by the poll callback.
* We re-insert them inside the main ready-list here.
*/
for (nepi = ep->ovflist; (epi = nepi) != NULL;
nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
/*
* We need to check if the item is already in the list.
* During the "sproc" callback execution time, items are
* queued into ->ovflist but the "txlist" might already
* contain them, and the list_splice() below takes care of them.
*/
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
}
}
/*
* We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
* releasing the lock, events will be queued in the normal way inside
* ep->rdllist.
*/
ep->ovflist = EP_UNACTIVE_PTR;
/*
* Quickly re-inject items left on "txlist".
*/
list_splice(&txlist, &ep->rdllist);
__pm_relax(ep->ws);
if (!list_empty(&ep->rdllist)) {
/*
* Wake up (if active) both the eventpoll wait list and
* the ->poll() wait list (delayed after we release the lock).
*/
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
spin_unlock_irqrestore(&ep->lock, flags);
if (!ep_locked)
mutex_unlock(&ep->mtx);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return error;
}
ep_scan_ready_list首先將ep就緒鏈表裏面的數據鏈接到一個全局的txlist裏面,然後清空ep的就緒鏈表,同時還將ep的ovflist鏈表設置爲NULL,ovflist是用單鏈表,是一個接受就緒事件的備份鏈表,當內核進程將事件從內核拷貝到用戶空間時,這段時間目標文件可能會產生新的事件,這個時候,就需要將新的時間鏈入到ovlist裏面。
僅接着,調用sproc回調函數(這裏將調用ep_send_events_proc函數)將事件數據從內核拷貝到用戶空間。
sys_epoll_wait -> ep_poll -> ep_send_events -> ep_scan_ready_list -> ep_send_events_proc:
static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
void *priv)
{
struct ep_send_events_data *esed = priv;
int eventcnt;
unsigned int revents;
struct epitem *epi;
struct epoll_event __user *uevent;
struct wakeup_source *ws;
poll_table pt;
init_poll_funcptr(&pt, NULL);
/*
* We can loop without lock because we are passed a task private list.
* Items cannot vanish during the loop because ep_scan_ready_list() is
* holding "mtx" during this call.
*/
for (eventcnt = 0, uevent = esed->events;
!list_empty(head) && eventcnt < esed->maxevents;) {
epi = list_first_entry(head, struct epitem, rdllink);
/*
* Activate ep->ws before deactivating epi->ws to prevent
* triggering auto-suspend here (in case we reactive epi->ws
* below).
*
* This could be rearranged to delay the deactivation of epi->ws
* instead, but then epi->ws would temporarily be out of sync
* with ep_is_linked().
*/
ws = ep_wakeup_source(epi);
if (ws) {
if (ws->active)
__pm_stay_awake(ep->ws);
__pm_relax(ws);
}
list_del_init(&epi->rdllink);
revents = ep_item_poll(epi, &pt);
/*
* If the event mask intersect the caller-requested one,
* deliver the event to userspace. Again, ep_scan_ready_list()
* is holding "mtx", so no operations coming from userspace
* can change the item.
*/
if (revents) {
if (__put_user(revents, &uevent->events) ||
__put_user(epi->event.data, &uevent->data)) {
list_add(&epi->rdllink, head);
ep_pm_stay_awake(epi);
return eventcnt ? eventcnt : -EFAULT;
}
eventcnt++;
uevent++;
if (epi->event.events & EPOLLONESHOT)
epi->event.events &= EP_PRIVATE_BITS;
else if (!(epi->event.events & EPOLLET)) {
/*
* If this file has been added with Level
* Trigger mode, we need to insert back inside
* the ready list, so that the next call to
* epoll_wait() will check again the events
* availability. At this point, no one can insert
* into ep->rdllist besides us. The epoll_ctl()
* callers are locked out by
* ep_scan_ready_list() holding "mtx" and the
* poll callback will queue them in ep->ovflist.
*/
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
}
}
}
return eventcnt;
}
ep_send_events_proc回調函數循環獲取監聽項的事件數據,對每個監聽項,調用ep_item_poll獲取監聽到的目標文件的事件,如果獲取到事件,就調用__put_user函數將數據拷貝到用戶空間。
回到ep_scan_ready_list函數,上面說到,在sproc回調函數執行期間,目標文件可能會產生新的事件鏈入ovlist鏈表裏面,所以,在回調結束後,需要重新將ovlist鏈表裏面的事件添加到rdllist就緒事件鏈表裏面。
同時在最後,如果rdlist不爲空(表示是否有就緒事件),並且由進程等待該事件,就調用wake_up_locked再一次喚醒內核進程處理事件的到達(流程跟前面一樣,也就是將事件拷貝到用戶空間)。
到這,epoll_wait的流程是結束了,但是有一個問題,就是前面提到的進程調用epoll_wait後會睡眠,但是這個進程什麼時候被喚醒呢?在調用epoll_ctl爲目標文件註冊監聽項時,對目標文件的監聽項註冊一個ep_ptable_queue_proc回調函數,ep_ptable_queue_proc回調函數將進程添加到目標文件的wakeup鏈表裏面,並且註冊ep_poll_callbak回調,當目標文件產生事件時,ep_poll_callbak回調就去喚醒等待隊列裏面的進程。
總結一下epoll該函數: epoll_wait函數會使調用它的進程進入睡眠(timeout爲0時除外),如果有監聽的事件產生,該進程就被喚醒,同時將事件從內核裏面拷貝到用戶空間返回給該進程。
參考
[1] http://blog.csdn.net/chen19870707/article/details/42525887
[2] http://www.cnblogs.com/apprentice89/p/3234677.html