io複用之poll源碼分析（基於linux2.6.13.1）

原創

2020-06-19 21:07

poll系統調用是io複用早期的實現，和select、epoll類似。今天來分析一下他的原理。先看一下poll的聲明。

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

再看一下相關的數據結構。

struct pollfd {
    int   fd;        
    short events;     /* 用戶感興趣的事件 */
    short revents;    /* 系統觸發的事件 */
};

下面我們開始分析sys_poll函數（poll函數對應的系統調用）。

struct poll_wqueues table;
table->pt->qproc = __pollwait;
table->error = 0;
table->table = NULL

首先初始化一個結構體。poll_wqueues 定義如下。

接着申請內存把用戶傳遞的數據複製到內核。

	head = NULL;
	walk = NULL;
	i = nfds;
	err = -ENOMEM;
	while(i!=0) {
		struct poll_list *pp;
		// 申請一頁大小的內存，保存一個poll_list結構體和多個pollfd結構體，大於一頁的，再循環這個過程
		pp = kmalloc(sizeof(struct poll_list)+
				sizeof(struct pollfd)*
				(i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i),
					GFP_KERNEL);
		pp->next=NULL;
		// 記錄本次複製的pollfd結構體個數
		pp->len = (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i);
		// 構造鏈表
		if (head == NULL)
			head = pp;
		else
			walk->next = pp;
		// 執行當前的poll_list，poll_list形成一個鏈表
		walk = pp;
		// 複製用戶的數據到內核
		copy_from_user(pp->entries, ufds + nfds-i, sizeof(struct pollfd)*pp->len);
		// 剩下待複製的個數
		i -= pp->len;
	}

複製完成後結構如下。

接着開始poll文件描述符，看是否準備好了。

// nfds文件描述符個數，head保存了文件描述符和事件的鏈表頭指針，table用於掛起進程，timeout最多poll多久
fdcount = do_poll(nfds, head, &table, timeout);

下面看一下do_poll的實現（省略部分代碼）。

static int do_poll(unsigned int nfds,  struct poll_list *list,
			struct poll_wqueues *wait, long timeout)
{
	int count = 0;
	poll_table* pt = &wait->pt;
	// timeout爲空，說明即使沒有就緒事件也不需要阻塞
	if (!timeout)
		pt = NULL;
 
	for (;;) {
		struct poll_list *walk;
		walk = list;
		while(walk != NULL) {
			do_pollfd( walk->len, walk->entries, &pt, &count);
			walk = walk->next;
		}
		// count代表有沒有就緒事件，timeout 說明沒有設置超時或者已經超時，signal_pending代表有信號需要處理
		if (count || !timeout || signal_pending(current))
			break;
		// 掛起進程，timeout後被喚醒
		timeout = schedule_timeout(timeout);
	}
	return count;
}

就是遍歷剛纔構造的鏈表，如果沒有就緒的時候，並且設置了超時，也沒有信號需要處理，則掛起進程，等待喚醒（有就緒事件或者超時都會被喚醒）。我們看遍歷的時候，對每個pollfd結構體做了什麼。

static void do_pollfd(unsigned int num, struct pollfd * fdpage,
	poll_table ** pwait, int *count)
{
	int i;

	for (i = 0; i < num; i++) {
		int fd;
		unsigned int mask;
		struct pollfd *fdp;

		mask = 0;
		// 當前當處理的pollfd結構體
		fdp = fdpage+i;
		// 待處理文件描述符
		fd = fdp->fd;
		if (fd >= 0) {
			// 獲取fd對應的file結構體 
			struct file * file = fget(fd);
			mask = POLLNVAL;
			if (file != NULL) {
				mask = DEFAULT_POLLMASK;
				if (file->f_op && file->f_op->poll)
					// mask記錄就緒的事件
					mask = file->f_op->poll(file, *pwait);
				// 過濾掉不感興趣的
				mask &= fdp->events | POLLERR | POLLHUP;
				fput(file);
			}
			// 有就緒事件，記錄
			if (mask) {
				*pwait = NULL;
				(*count)++;
			}
		}
		// 記錄就緒的事件
		fdp->revents = mask;
	}
}

就是調用各個功能實現的poll函數。判斷是否有事件就緒。我們以pipe爲例看一下poll函數的大致實現。

	static unsigned int
pipe_poll(struct file *filp, poll_table *wait)
{
	...
	/* Reading only -- no need for acquiring the semaphore.  */
	nrbufs = info->nrbufs;
	mask = 0;
	if (filp->f_mode & FMODE_READ) {
		mask = (nrbufs > 0) ? POLLIN | POLLRDNORM : 0;
		if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode))
			mask |= POLLHUP;
	}

	if (filp->f_mode & FMODE_WRITE) {
		mask |= (nrbufs < PIPE_BUFFERS) ? POLLOUT | POLLWRNORM : 0;
		if (!PIPE_READERS(*inode))
			mask |= POLLERR;
	}

	return mask;
}

判斷一下是否有事件就緒了。如果沒有就緒事件，系統會做兩件事情。
1 把進程加入到inode的等待隊列。
2 定時掛起進程，等待超時喚醒。
如果在超時之前，就有就緒事件觸發，那進程會被喚醒。如果一直沒有事件觸發，直到超時，進程被喚醒，這時候sys_poll函數返回。sys_poll大致的邏輯就是這樣，整個流程比這個複雜，尤其是加入到等待隊列的邏輯。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

io複用之poll源碼分析（基於linux2.6.13.1）

嘗試爲nodejs貢獻代碼

通過linux源碼分析nodejs的keep-alive

使用wireshark分析tcp

imweb團隊招聘

理解websocket的原理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結