首页 > 代码库 > linux内核epoll实现分析

linux内核epoll实现分析

epoll与select/poll的区别

     select,poll,epoll都是IO多路复用的机制。I/O多路复用就通过一种机制,可以监视多个描述符,一旦某个描述符就绪,能够通知程序进行相应的操作。

     select的本质是采用32个整数的32位,即32*32= 1024来标识,fd值为1-1024。当fd的值超过1024限制时,就必须修改FD_SETSIZE的大小。这个时候就可以标识32*max值范围的fd。
     poll与select不同,通过一个pollfd数组向内核传递需要关注的事件,故没有描述符个数的限制,pollfd中的events字段和revents分别用于标示关注的事件和发生的事件,故pollfd数组只需要被初始化一次。
     epoll还是poll的一种优化,返回后不需要对所有的fd进行遍历,在内核中维持了fd的列表。select和poll是将这个内核列表维持在用户态,然后传递到内核中。与poll/select不同,epoll不再是一个单独的系统调用,而是由epoll_create/epoll_ctl/epoll_wait三个系统调用组成,后面将会看到这样做的好处。epoll在2.6以后的内核才支持。


select/poll的几大缺点:
1、每次调用select/poll,都需要把fd集合从用户态拷贝到内核态,这个开销在fd很多时会很大
2、同时每次调用select/poll都需要在内核遍历传递进来的所有fd,这个开销在fd很多时也很大
3、针对select支持的文件描述符数量太小了,默认是1024


为什么epoll相比select/poll更高效

     传统的poll函数相当于每次调用都重起炉灶,从用户空间完整读入ufds,完成后再次完全拷贝到用户空间,另外每次poll都需要对所有设备做至少做一次加入和删除等待队列操作,这些都是低效的原因。

     epoll的解决方案中。每次注册新的事件到epoll句柄中时(在epoll_ctl中指定EPOLL_CTL_ADD),会把所有的fd拷贝进内核,而不是在epoll_wait的时候重复拷贝。epoll保证了每个fd在整个过程中只会拷贝一次。select, poll和epoll都是使用waitqueue调用callback函数去wakeup你的异步等待线程的,如果设置了timeout的话就起一个hrtimer,select和poll的callback函数并没有做什么事情,但epoll的waitqueue callback函数把当前的有效fd加到ready list,然后唤醒异步等待进程,所以epoll函数返回的就是这个ready list, ready list中包含所有有效的fd,这样一来kernel不用去遍历所有的fd,用户空间程序也不用遍历所有的fd,而只是遍历返回有效fd链表。


为什么要实现eventpollfs

1、可以在内核里维护一些信息,这些信息在多次epoll_wait间是保持的,比如所有受监控的文件描述符
2、epoll本身也可以被poll/epoll


加入的epoll的FD需要满足什么条件

理论上只要是一个文件就可以加入epoll,Linux本身的一个设计思想也是一切皆文件。file结构也体现了对epoll的支持。

struct file {
	/*
	 * fu_list becomes invalid after file_free is called and queued via
	 * fu_rcuhead for RCU freeing
	 */
	union {
		struct list_head	fu_list;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
#define f_dentry	f_path.dentry //与文件相关的目录项对象
#define f_vfsmnt	f_path.mnt    //含有该文件的已安装文件系统
	<span style="color:#ff0000;">const struct file_operations	*f_op; //文件操作表指针</span>
	spinlock_t		f_lock;  /* f_ep_links, f_flags, no IRQ */
	atomic_long_t		f_count; //文件对象的引用计数器
	unsigned int 		f_flags; //当打开文件时所指定的标志
	fmode_t			f_mode;      //进程访问模式
	loff_t			f_pos;       //当前的文件偏移量
	struct fown_struct	f_owner; //通过信号进行I/O事件通知的数据
	const struct cred	*f_cred;
	struct file_ra_state	f_ra; //文件预读状态

	u64			f_version; //版本号,每次使用后自动递增
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;//指向特定文件系统或设备驱动程序所需要数据的指针

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	<span style="color:#ff0000;">struct list_head	f_ep_links;//文件的事件轮询等待着链表头</span>
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;//指向文件地址空间对象的指针
#ifdef CONFIG_DEBUG_WRITECOUNT
	unsigned long f_mnt_write_state;
#endif
};

但是加入文件能不能poll出事件,这就要符合两点要求:

1、文件对应的file_operations必须实现poll操作

2、在对应的文件描述符等待队列上注册回调函数用于唤醒等待进程

epoll关键数据结构关系如下:


epoll是如何poll出事件的

以tcp socket为例分析一下,是如何poll出事件的

当通过系统调用epoll_ctl将FD加入epoll时会执行ep_insert函数

/*
 * Must be called with "mtx" held.
 */
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
		     struct file *tfile, int fd)
{
	int error, revents, pwake = 0;
	unsigned long flags;
	struct epitem *epi;
	struct ep_pqueue epq;

	if (unlikely(atomic_read(&ep->user->epoll_watches) >=
		     max_user_watches))
		return -ENOSPC;
	if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
		return -ENOMEM;

	/* Item initialization follow here ... */
	INIT_LIST_HEAD(&epi->rdllink);
	INIT_LIST_HEAD(&epi->fllink);
	INIT_LIST_HEAD(&epi->pwqlist);
	epi->ep = ep;
	ep_set_ffd(&epi->ffd, tfile, fd);
	epi->event = *event;
	epi->nwait = 0;
	epi->next = EP_UNACTIVE_PTR;

	/* Initialize the poll table using the queue callback */
	epq.epi = epi;
	<span style="color:#ff0000;">init_poll_funcptr(&epq.pt, ep_ptable_queue_proc)</span>;

	/*
	 * Attach the item to the poll hooks and get current event bits.
	 * We can safely use the file* here because its usage count has
	 * been increased by the caller of this function. Note that after
	 * this operation completes, the poll callback can start hitting
	 * the new item.
	 */
	<span style="color:#ff0000;">revents = tfile->f_op->poll(tfile, &epq.pt)</span>;

	/*
	 * We have to check if something went wrong during the poll wait queue
	 * install process. Namely an allocation for a wait queue failed due
	 * high memory pressure.
	 */
	error = -ENOMEM;
	if (epi->nwait < 0)
		goto error_unregister;

	/* Add the current item to the list of active epoll hook for this file */
	spin_lock(&tfile->f_lock);
	list_add_tail(&epi->fllink, &tfile->f_ep_links);
	spin_unlock(&tfile->f_lock);

	/*
	 * Add the current item to the RB tree. All RB tree operations are
	 * protected by "mtx", and ep_insert() is called with "mtx" held.
	 */
	ep_rbtree_insert(ep, epi);

	/* We have to drop the new item inside our item list to keep track of it */
	spin_lock_irqsave(&ep->lock, flags);

	/* If the file is already "ready" we drop it inside the ready list */
	if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
		list_add_tail(&epi->rdllink, &ep->rdllist);

		/* Notify waiting tasks that events are available */
		if (waitqueue_active(&ep->wq))
			wake_up_locked(&ep->wq);
		if (waitqueue_active(&ep->poll_wait))
			pwake++;
	}
<span style="white-space:pre">	</span>...
	return error;
}
首先初始ep_pqueue这样一个结构将等待队列回调函数注册,然后通过poll函数执行注册的回调函数将等待队列节点加入对应的等待队列

/*
 * This is the callback that is used to add our wait queue to the
 * target file wakeup lists.
 */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
				 poll_table *pt)
{
	struct epitem *epi = ep_item_from_epqueue(pt);
	struct eppoll_entry *pwq;

	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
		init_waitqueue_func_entry(&pwq->wait, <span style="color:#ff0000;">ep_poll_callback</span>);
		pwq->whead = whead;
		pwq->base = epi;
		add_wait_queue(whead, &pwq->wait);
		list_add_tail(&pwq->llink, &epi->pwqlist);
		epi->nwait++;
	} else {
		/* We have to signal that an error occurred */
		epi->nwait = -1;
	}
}
ep_poll_callback为当对应的描述符状态发生改变或是有对应事件发生时执行的回调函数,以tcp为了,当状态发生改变时会有如下调用流程

sock_def_wakeup(sock_init_data对sock初始化)--->wake_up_interruptible_all-->__wake_up--->curr->func(对于加入epoll的文件描述符而言即ep_poll_callback)

/*
 * This is the callback that is passed to the wait queue wakeup
 * machanism. It is called by the stored file descriptors when they
 * have events to report.
 */
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
	int pwake = 0;
	unsigned long flags;
	struct epitem *epi = ep_item_from_wait(wait);
	struct eventpoll *ep = epi->ep;

	spin_lock_irqsave(&ep->lock, flags);

	/*
	 * If the event mask does not contain any poll(2) event, we consider the
	 * descriptor to be disabled. This condition is likely the effect of the
	 * EPOLLONESHOT bit that disables the descriptor when an event is received,
	 * until the next EPOLL_CTL_MOD will be issued.
	 */
	if (!(epi->event.events & ~EP_PRIVATE_BITS))
		goto out_unlock;

	/*
	 * Check the events coming with the callback. At this stage, not
	 * every device reports the events in the "key" parameter of the
	 * callback. We need to be able to handle both cases here, hence the
	 * test for "key" != NULL before the event match test.
	 */
	if (key && !((unsigned long) key & epi->event.events))
		goto out_unlock;

	/*
	 * If we are trasfering events to userspace, we can hold no locks
	 * (because we're accessing user memory, and because of linux f_op->poll()
	 * semantics). All the events that happens during that period of time are
	 * chained in ep->ovflist and requeued later on.
	 */
	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
		if (epi->next == EP_UNACTIVE_PTR) {
			epi->next = ep->ovflist;
			ep->ovflist = epi;
		}
		goto out_unlock;
	}

	/* If this file is already in the ready list we exit soon */
	if (!ep_is_linked(&epi->rdllink))
		list_add_tail(&epi->rdllink, &ep->rdllist);

	/*
	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
	 * wait list.
	 */
	if (waitqueue_active(&ep->wq))
		<span style="color:#ff0000;">wake_up_locked(&ep->wq)</span>;
	if (waitqueue_active(&ep->poll_wait))
		pwake++;

out_unlock:
	spin_unlock_irqrestore(&ep->lock, flags);

	/* We have to call this outside the lock */
	if (pwake)
		ep_poll_safewake(&ep->poll_wait);

	return 1;
}
ep_poll_callback所做的动作是将已ready的epitem加入到ep->rdlist中,然后唤醒等待对应描述符的进程。也即系统调用epoll_wait函数所执行的内容,wait函数会判断rdlist是否为空,如果不为空则跳出循环,扫描rdlist将以发生的event发送到用户态空间

目前linux系统中,pipefd,timerfd,signalfd,eventfd等这些都是可以加入epoll的,另外epoll本身也可以作为一个FD加入epoll。

libevent

libevent是一个事件触发的网络库,适用于windows、linux、bsd等多种平台,内部对select、epoll、kqueue等系统调用管理事件机制进行封装。详见http://libevent.org/

linux内核epoll实现分析