poll&&epoll之epoll完成

poll运转功率的两个瓶颈现已找出，现在的问题是怎样改善。首要，假设要监听1000个fd，每次poll都要把1000个fd 拷入内核，太不科学了，内核干嘛不自己保存现已拷入的fd呢？答对了，epoll便是自己保存拷入的fd，它的API就现已阐明晰这一点——不是 epoll_wait的时分才传入fd，而是经过epoll_ctl把一切fd传入内核再一起”wait”，这就省掉了不必要的重复复制。其次，在 epoll_wait时，也不是把current轮番的参加fd对应的设备等候行列，而是在设备等候行列醒来时调用一个回调函数（当然，这就需求“唤醒回调”机制），把发生事情的fd归入一个链表，然后回来这个链表上的fd。

别的，epoll机制完成了自己特有的文件体系eventpoll filesystem

1. 内核数据结构

(1) struct eventpoll {

spinlock_t lock;

struct mutex mtx;

wait_queue_head_t wq; /* Wait queue used by sys_epoll_wait() ,调用epoll_wait()时, 咱们便是”睡”在了这个等候行列上*/

wait_queue_head_t poll_wait; /* Wait queue used by file->poll() , 这个用于epollfd本事被poll的时分*/

struct list_head rdllist; /* List of ready file descriptors, 一切现已ready的epitem都在这个链表里边*/

structrb_root rbr; /* RB tree root used to store monitored fd structs, 一切要监听的epitem都在这儿*/

epitem *ovflist; /*寄存的epitem都是咱们在传递数据给用户空间时监听到了事情*/.

struct user_struct *user; /*这儿保存了一些用户变量,比方fd监听数量的最大值等*/

};

经过epoll_ctl接口参加该epoll描述符监听的套接字则归于socket filesystem，这点必定要留意。每个增加的待监听（这儿监听和listen调用不同）都对应于一个epitem结构体，该结构体已红黑树的结构安排，eventpoll结构中保存了树的根节点（rbr成员）。一起有监听事情到来的套接字的该结构以双向链表安排起来，链表头保存在eventpoll中（rdllist成员）。

* Each file descriptor added to the eventpoll interface will have an entry of this type linked to the “rbr” RB tree.

(2) struct epitem {

struct rb_node rbn; /* RB tree node used to link this structure to the eventpoll RB tree */

struct list_head rdllink; /* 链表节点, 一切现已ready的epitem都会被链到eventpoll的rdllist中 */

struct epitem *next;

struct epoll_filefd ffd; /* The file descriptor information this item refers to */

int nwait; /* Number of acTIve wait queue attached to poll operaTIons */

struct list_head pwqlist; /* List containing poll wait queues */

struct eventpoll *ep; /* The “container” of this item */

struct list_head fllink; /* List header used to link this item to the “struct file” items list */

struct epoll_event event; /*当时的epitem联系哪些events, 这个数据是调用epoll_ctl时从用户态传递过来 */

};

(3) struct epoll_filefd {

struct file *file;

int fd;};

(4) struct eppoll_entry { /* Wait structure used by the poll hooks */

struct list_head llink; /* List header used to link this structure to the “struct epitem” */

struct epitem *base; /* The “base” pointer is set to the container “struct epitem” */

wait_queue_t wait; / Wait queue item that will be linked to the target file wait queue head. /

wait_queue_head_t *whead;/The wait queue head that linked the “wait” wait queue item */

};//注：后两项相当于等候行列

(5) struct ep_pqueue {/* Wrapper struct used by poll queueing */

poll_table pt; // struct poll_table是一个函数指针的包裹

struct epitem *epi;

};

(6) struct ep_send_events_data {

/* Used by the ep_send_events() funcTIon as callback private data */

int maxevents;

struct epoll_event __user *events;

};

各个数据结构的联系如下图：

2. 函数调用剖析

epoll函数调用联系大局图：

3. 函数完成剖析

3.1 eventpoll_init

epoll是个module，所以先看看module的进口eventpoll_init
[fs/eventpoll.c–>evetpoll_init()]（简化后）
staTIc int __init eventpoll_init(void)
{
epi_cache = kmem_cache_create(“eventpoll_epi”, sizeof(struct epitem),
0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL);

pwq_cache = kmem_cache_create(“eventpoll_pwq”,
sizeof(struct eppoll_entry), 0, EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL);
//注册了一个新的文件体系，叫”eventpollfs”

error = register_filesystem(&eventpoll_fs_type);
eventpoll_mnt = kern_mount(&eventpoll_fs_type);;
}
很风趣，这个module在初始化时注册了一个新的文件体系，叫”eventpollfs”（在eventpoll_fs_type结构里），然后挂载此文件体系。别的创立两个内核cache（在内核编程中，假设需求频频分配小块内存，应该创立kmem_cahe来做“内存池”）,别离用于寄存struct epitem和eppoll_entry。

现在想想epoll_create为什么会回来一个新的fd？由于它便是在这个叫做”eventpollfs”的文件体系里创立了一个新文件！如下：

3.2 sys_epoll_create

[fs/eventpoll.c–>sys_epoll_create()]
asmlinkage long sys_epoll_create(int size)
{
int error, fd;
struct inode *inode;
struct file *file;
error = ep_getfd(&fd, &inode, &file);
/* Setup the file internal data structure ( “struct eventpoll” ) */
error = ep_file_init(file);

}
函数很简单，其间ep_getfd看上去是“get”，其实在第一次调用epoll_create时，它是要创立新inode、新的file、新的fd。而ep_file_init则要创立一个struct eventpoll结构，并把它放入file->private_data，留意，这个private_data后边还要用到的。

3.3 epoll_ctl

epoll_create好了，该epoll_ctl了，咱们省掉判别性的代码：
[fs/eventpoll.c–>sys_epoll_ctl()]
asmlinkage long
sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event)
{
struct file *file, *tfile;
struct eventpoll *ep;
struct epitem *epi;
struct epoll_event epds;
….
epi = ep_find(ep, tfile, fd);//tfile寄存要监听的fd对应在rb-tree中的epitem
switch (op) {//省掉了判空处理
case EPOLL_CTL_ADD: epds.events |= POLLERR | POLLHUP;

error = ep_insert(ep, &epds, tfile, fd); break;
case EPOLL_CTL_DEL: error = ep_remove(ep, epi); break;
case EPOLL_CTL_MOD: epds.events |= POLLERR | POLLHUP;

error = ep_modify(ep, epi, &epds); break;
}
本来便是在一个“大的结构”（struct eventpoll）里先ep_find，假设找到了struct epitem,而依据用户操作是ADD、DEL、MOD调用相应的函数，这些函数在epitem组成红黑树中增加、删去、修正相应节点（每一个监听fd对应一个节点）。很直白。那这个“大结构”是什么呢？看ep_find的调用办法，ep参数应该是指向这个“大结构”的指针，再看ep = file->private_data，咱们才理解，本来这个“大结构”便是那个在epoll_create时创立的struct eventpoll，详细再看看ep_find的完成，发现本来是struct eventpoll的rbr成员（struct rb_root），本来这是一个红黑树的根！而红黑树上挂的都是struct epitem。
现在清楚了，一个新创立的epoll文件带有一个struct eventpoll结构，这个结构上再挂一个红黑树，而这个红黑树便是每次epoll_ctl时fd寄存的当地！

3.4 sys_epoll_wait

现在数据结构都现已清楚了，咱们来看最中心的:
[fs/eventpoll.c–>sys_epoll_wait()]
asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events, int maxevents,

int timeout)
{
struct file *file;
struct eventpoll *ep;
/* Get the “struct file *” for the eventpoll file */
file = fget(epfd);
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.(所以假设这儿是一般的文件fd会犯错)
*/
if (!IS_FILE_EPOLL(file))
goto eexit_2;

ep = file->private_data;
error = ep_poll(ep, events, maxevents, timeout);

……

}

故伎重演，从file->private_data中拿到struct eventpoll，再调用ep_poll

3.5 ep_poll()

[fs/eventpoll.c–>sys_epoll_wait()->ep_poll()]
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents,

long timeout)
{
int res;
wait_queue_t wait;//等候行列项
if (list_empty(&ep->rdllist)) {
//ep->rdllist寄存的是已安排妥当(read)的fd，为空时阐明当时没有安排妥当的fd，所以需求将当时
init_waitqueue_entry(&wait, current);//创立一个等候行列项，并运用当时进程（current）初始化
add_wait_queue(&ep->wq, &wait);//将刚创立的等候行列项参加到ep中的等候行列（行将当时进程增加到等候行列）
for (;;) {
/*将进程状况设置为TASK_INTERRUPTIBLE，由于咱们不期望这期间ep_poll_callback()发信号唤醒进程的时分，进程还在sleep */
set_current_state(TASK_INTERRUPTIBLE);
if (!list_empty(&ep->rdllist) || !jtimeout)//假设ep->rdllist非空(即有安排妥当的fd)或时刻到则跳出循环

break;
if (signal_pending(current)) {
res = -EINTR;
break;
}
}
remove_wait_queue(&ep->wq, &wait);//将等候行列项移出等候行列(将当时进程移出)
set_current_state(TASK_RUNNING);
}
….
又是一个大循环，不过这个大循环比poll的那个好，由于细心一看——它竟然除了睡觉和判别ep->rdllist是否为空以外，啥也没做！什么也没做当然功率高了，但到底是谁来让ep->rdllist不为空呢？答案是ep_insert时设下的回调函数.

3.6 ep_insert()

[fs/eventpoll.c–>sys_epoll_ctl()–>ep_insert()]
static int ep_insert(struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd)
{

struct epitem *epi;
struct ep_pqueue epq;// 创立ep_pqueue目标
epi = EPI_MEM_ALLOC();//分配一个epitem
/* 初始化这个epitem … */
epi->ep = ep;//将创立的epitem增加到传进来的struct eventpoll

/*后几行是设置epitem的相应字段*/
EP_SET_FFD(&epi->ffd, tfile, fd);//即将监听的fd参加到刚创立的epitem
epi->event = *event;
epi->nwait = 0;

/* Initialize the poll table using the queue callback */
epq.epi = epi; //将一个epq和新刺进的epitem(epi)相关

//下面一句等价于&(epq.pt)->qproc = ep_ptable_queue_proc;

init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

revents = tfile->f_op->poll(tfile, &epq.pt); //tfile代表target file，即被监听的文件,poll()回来安排妥当事情的掩码，赋给revents.

list_add_tail(&epi->fllink, &tfile->f_ep_links);// 每个文件会将一切监听自己的epitem链起来

ep_rbtree_insert(ep, epi);// 都搞定后, 将epitem刺进到对应的eventpoll中去

……

}

紧接着 tfile->f_op->poll(tfile, &epq.pt)其实便是调用被监控文件（epoll里叫“target file”)的poll办法，而这个poll其实便是调用poll_wait（还记得poll_wait吗？每个支撑poll的设备驱动程序都要调用的），最终便是调用ep_ptable_queue_proc。（注：f_op->poll()一般来说仅仅个wrapper, 它会调用真实的poll完成, 拿UDP的socket来举例, 这儿便是这样的调用流程: f_op->poll(), sock_poll(), udp_poll(), datagram_poll(), sock_poll_wait()。）这是比较难解的一个调用联系，由于不是言语级的直接调用。ep_insert还把struct epitem放到struct file里的f_ep_links连表里，以便利查找，struct epitem里的fllink便是背负这个使命的。

3.7 ep_ptable_queue_proc

[fs/eventpoll.c–>ep_ptable_queue_proc()]
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt)
{
struct epitem *epi = EP_ITEM_FROM_EPQUEUE(pt);
struct eppoll_entry *pwq;
if (epi->nwait >= 0 && (pwq = PWQ_MEM_ALLOC())) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* We have to signal that an error occurred */
epi->nwait = -1;
}
}
上面的代码便是ep_insert中要做的最重要的事：创立struct eppoll_entry，设置其唤醒回调函数为ep_poll_callback，然后参加设备等候行列（留意这儿的whead便是上一章所说的每个设备驱动都要带的等候行列）。只需这样，当设备安排妥当，唤醒等候行列上的等候进程时，ep_poll_callback就会被调用。每次调用poll体系调用，操作体系都要把current（当时进程）挂到fd对应的一切设备的等候行列上，能够幻想，fd多到上千的时分，这样“挂”法很费事；而每次调用epoll_wait则没有这么罗嗦，epoll只在epoll_ctl时把current挂一遍（这第一遍是免不了的）并给每个fd一个指令“好了就调回调函数”，假设设备有事情了，经过回调函数，会把fd放入rdllist，而每次调用epoll_wait就仅仅搜集rdllist里的fd就能够了——epoll奇妙的利用回调函数，完成了更高效的事情驱动模型。
现在咱们猜也能猜出来ep_poll_callback会干什么了——肯定是把红黑树(ep->rbr)上的收到event的epitem（代表每个fd）刺进ep->rdllist中，这样，当epoll_wait回来时，rdllist里就都是安排妥当的fd了！

3.8 ep_poll_callback

[fs/eventpoll.c–>ep_poll_callback()]
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
struct epitem *epi = EP_ITEM_FROM_WAIT(wait);
struct eventpoll *ep = epi->ep;
/* If this file is already in the ready list we exit soon */
if (EP_IS_LINKED(&epi->rdllink))
goto is_linked;
list_add_tail(&epi->rdllink, &ep->rdllist);
is_linked:
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
if (waitqueue_active(&ep->wq))
wake_up(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}

4. epoll独有的EPOLLET

EPOLLET是epoll体系调用独有的flag，ET便是Edge Trigger（边际触发）的意思，详细意义和运用咱们可google之。有了EPOLLET，重复的事情就不会总是出来打扰程序的判别，故而常被运用。那EPOLLET的原理是什么呢？
上篇咱们讲到epoll把fd都挂上一个回调函数，当fd对应的设备有音讯时，回调函数就把fd放入rdllist链表，这样epoll_wait只需查看这个rdllist链表就能够知道哪些fd有事情了。咱们看看ep_poll的最终几行代码：

4.1 ep_poll() (接3.5)

[fs/eventpoll.c->ep_poll()]

/* Try to transfer events to user space. */
ep_events_transfer(ep, events, maxevents)
……
把rdllist里的fd拷到用户空间，这个使命是ep_events_transfer做的.

4.2 ep_events_transfer

[fs/eventpoll.c->ep_events_transfer()]
static int ep_events_transfer(struct eventpoll *ep, struct epoll_event __user *events,

int maxevents)
{
int eventcnt = 0;
struct list_head txlist;
INIT_LIST_HEAD(&txlist);
/* Collect/extract ready items */
if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {
/* Build result set in userspace */
eventcnt = ep_send_events(ep, &txlist, events);
/* Reinject ready items into the ready list */
ep_reinject_items(ep, &txlist);
}
up_read(&ep->sem);
return eventcnt;
}
代码很少，其间ep_collect_ready_items把rdllist里的fd挪到txlist里（挪完后rdllist就空了），接着ep_send_events把txlist里的fd拷给用户空间，然后ep_reinject_items把一部分fd从txlist里“返还”给rdllist以便下次还能从rdllist里发现它。
其间ep_send_events的完成：

4.3 ep_send_events()

[fs/eventpoll.c->ep_send_events()]
static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,

struct epoll_event __user *events)
{
int eventcnt = 0;
unsigned int revents;
struct list_head *lnk;
struct epitem *epi;
list_for_each(lnk, txlist) {
epi = list_entry(lnk, struct epitem, txlink);
revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);//调用每个监听文件的poll办法获取安排妥当事情（掩码），并赋值给revents

epi->revents = revents & epi->event.events;
if (epi->revents) {
     if (__put_user(epi->revents, &events[eventcnt].events) || __put_user(epi->event.data,
     &events[eventcnt].data))//将event从内核空间发送到用户空间
     return -EFAULT;
    if (epi->event.events & EPOLLONESHOT)
    epi->event.events &= EP_PRIVATE_BITS;
    eventcnt++;
     }     }
    return eventcnt; }
    这个复制完成其实没什么可看的，可是请留意赤色的一行，这个poll很狡猾，它把第二个参数置为NULL来调用。咱们先看一下设备驱动通常是怎样完成poll的：
static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
struct scull_pipe *dev = filp->private_data;
unsigned int mask = 0;
poll_wait(filp, &dev->inq, wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp)
mask |= POLLIN | POLLRDNORM; /* readable */
if (spacefree(dev))
mask |= POLLOUT | POLLWRNORM; /* writable */
return mask;
}
    上面这段代码摘自《linux设备驱动程序（第三版）》，肯定经典，设备先要把current（当时进程）挂在inq和outq两个行列上（这个“挂”操作是wait回调函数指针做的），然后等设备来唤醒，唤醒后就能经过mask拿到事情掩码了（留意那个mask参数，它便是担任拿事情掩码的）。那假设wait为NULL，poll_wait会做些什么呢？

4.4 poll_wait

[include/linux/poll.h->poll_wait]
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address,poll_table *p)
{
if (p && wait_address)
p->qproc(filp, wait_address, p);
}
喏，看见了，假设poll_table为空，什么也不做。咱们倒回ep_send_events，那句标红的poll，实际上便是“我不想休眠，我只想拿到事情掩码”的意思。然后再把拿到的事情掩码拷给用户空间。ep_send_events完成后，就轮到ep_reinject_items了。

4.5 p_reinject_items

[fs/eventpoll.c->ep_reinject_items]
static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
{
     int ricnt = 0, pwake = 0;
     unsigned long flags;
     struct epitem *epi;
     while (!list_empty(txlist)) {//遍历txlist（此刻txlist寄存的是已安排妥当的epitem）
     epi = list_entry(txlist->next, struct epitem, txlink);
     EP_LIST_DEL(&epi->txlink);//将当时的epitem从txlist中删去
     if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) &&
     (epi->revents & epi->event.events) && !EP_IS_LINKED(&epi->rdllink)) {

list_add_tail(&epi->rdllink, &ep->rdllist);//将当时epitem从头参加ep->rdllist
     ricnt++;// ep->rdllist中epitem的个数（即从头参加安排妥当的epitem的个数）
      }
    }
if (ricnt) {//假设ep->rdllist不空，从头唤醒等、等候行列的进程（current）
    if (waitqueue_active(&ep->wq))
    wake_up(&ep->wq);
    if (waitqueue_active(&ep->poll_wait))
    pwake++;
    }
   ……

}
ep_reinject_items把txlist里的一部分fd又放回rdllist，那么，是把哪一部分fd放回去呢？看上面那个判别——是那些“没有标上EPOLLET(即默许的LT)”（标红代码）且“事情被重视”（标蓝代码）的fd被从头放回了rdllist。那么下次epoll_wait当然会又把rdllist里的fd拿来拷给用户了。举个比方。假定一个socket，仅仅connect，还没有收发数据，那么它的poll事情掩码总是有POLLOUT的（拜见上面的驱动示例），每次调用epoll_wait总是回来POLLOUT事情（比较烦），由于它的fd就总是被放回rdllist；假设此刻有人往这个socket里写了一大堆数据，形成socket塞住（不行写了），那么标蓝色的判别就不成立了（没有POLLOUT了），fd不会放回rdllist，epoll_wait将不会再回来用户POLLOUT事情。现在咱们给这个socket加上EPOLLET，然后connect，没有收发数据，此刻，标红的判别又不成立了，所以epoll_wait只会回来一次POLLOUT告诉给用户（由于此fd不会再回到rdllist了），接下来的epoll_wait都不会有任何事情告诉了。

总结：

epoll函数调用联系大局图：

注：上述函数联系图中有个问题，当ep_reinject_items()将LT的前次安排妥当的eptiem从头放回安排妥当链表，下次ep_poll()直接回来，这不就形成了一个循环了吗？什么时分这些LT的epitem才不再参加安排妥当链表呢？这个问题的处理在4.3——ep_send_events()中，留意这个函数中标红的那个poll调用，咱们剖析过当传入NULL时，poll仅仅是拿到事情掩码，所以假设之前用户对事情的处理导致的文件的revents（状况）改动，那么这儿就会得到更新。例如：用户以可读监听，当读完数据后文件的会变为不行读，这时ep_send_events()中获取的revents中将不再有可读事情，也就不满足ep_reinject_items()中的蓝色判别，所以epitem不再被参加安排妥当链表（ep->rdllist）。可是假设只读部分数据，并不会引起文件状况改动（文件仍可读），所以仍会参加安排妥当链表告诉用户空间，这也便是假设是TL，就会一向告诉用户读事情，直到某些操作导致那个文件描述符不再为安排妥当状况了(比方，你在发送，接纳或许接纳恳求，或许发送接纳的数据少于必定量时导致了一个EWOULDBLOCK 过错）。

将上述调用增加到函数调用联系图后，如下（增加的为蓝线）：

epoll完成数据结构大局联系图：

扫一扫打开手机网站

微信扫一扫关注我们

poll&&epoll之epoll完成

联系我们

微信扫一扫关注我们

为您推荐

SMT表面贴装对PCB板有哪些要求

基于并行遗传算法的微电网控制方法研究

隔离式和非隔离式电源有何区别？各有什么优势？

探究IGBT开关过程及其驱动设计的关键因素

AI芯片的技术原理与架构

有源相控阵雷达天线技术揭秘

联系我们

微信扫一扫关注我们