Network programming: epoll


As mentioned earlier, the disadvantage of the IO multiplexing API, select and poll is that the performance is not enough. The more client connections, the more obvious the performance degradation. The emergence of epoll solves this problem. Reference The Linux Programming Interface A statistical comparison is as follows:

fd quantity      poll CPU time(second)    select CPU time(second)   epoll CPU time(second)
10              0.61                0.73                0.41
100             2.9                 3.0                 0.42
1000            35                  35                  0.53
10000           990                 930                 0.66

It can be seen that after fd reaches 100, select/poll is very slow, and epoll performs very well even if it reaches 10000, because:

  • Every time select/poll is called, the kernel must check all the descriptors passed in; For epoll, each time epoll is called_ CTL, the kernel will associate the relevant information with the underlying file description. When the IO event is ready, the kernel will add the information to the ready list of epoll. Then call epoll_. Wait, the kernel only needs to extract the information in the ready list and return it.
  • Each time select/poll is called, all file descriptors to be monitored should be passed to the kernel. When the function returns, the kernel should return the descriptors and identify which ones are ready. After the results are obtained, all descriptors should be judged one by one to determine which events are available; Epoll is calling epoll_ The monitoring list is maintained during CTL, epoll_wait does not need to pass in any information, and the returned result only contains ready descriptors, so there is no need to judge all descriptors.

Conceptually, epoll is understood to register the IO event of fd monitored to epoll (calling epoll_ctl), and then to call epoll's API waiting event to arrive (calling epoll_wait), and the kernel may maintain a read and write buffer for each fd.

  • If I monitor read events and there is data in the read buffer, epoll_wait will return, and I can call read to read the data.
  • If I monitor write events and the write cache is not full, epoll_wait will also return. At this time, I can call write to write data.
  • If fd some errors occur, epoll_wait will also return. At this time, I can know according to the returned flag bit.
  • If I monitor read events and a client connects, epoll_wait will return, and I can call accept to accept the client.

Introduction to epoll API

  • int epoll_create(int size);
    Create an epoll instance and return the file descriptor (fd) representing the instance. The size is ignored since Linux 2.6.8, but must be greater than 0.5
  • int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
    Epoll control interface. epfd is the file descriptor of epoll. fd is the file descriptor to be operated. op has the following types:
    • EPOLL_CTL_ADD registers fd the event. The event type is specified in event.
    • EPOLL_CTL_MOD modifies registered fd events.
    • EPOLL_CTL_DEL delete fd event.

epoll_event has an events member, which specifies the event type to be registered. The more important are:

    • EPOLLIN fd readable events
    • EPOLLOUT fd writable event
    • EPOLLERR fd has an error. This event is always monitored and does not need to be increased manually
    • When EPOLLHUP fd is suspended, this event is always monitored and does not need to be manually increased. This usually occurs when the socket is abnormally closed. At this time, read returns 0, and then clean up the socket resources normally.

epoll_event also has an epoll_data_t member. Custom data is set externally to facilitate subsequent processing.

  • int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
    Wait for an event to occur. If no event occurs, the thread will be suspended. Maxevents specifies the maximum number of events. The length of the event array passed in outside events should be equal to maxevents. When an event occurs, epoll will fill in the event information here. timeout specifies the maximum waiting time. 0 means to return immediately and - 1 means to wait indefinitely.
    epoll_wait returns the number of waiting events. When it returns, traverse events to process fd. When epoll is no longer in use, close should be called to close epollfd.

Horizontal trigger and edge trigger

epoll trigger events have two modes, the default is called horizontal trigger (LT), and the other is called edge trigger (ET):

  • LT mode: epoll if the read buffer of fd is not empty or the write buffer is not full_ Wait will always trigger the event (that is, return).
  • ET mode: when the monitored fd state changes (from never ready to ready), the event is triggered once. After that, the kernel will not notify unless a new event comes.

//Thank you  

@Huang Wei

  The description of the original ET mode is wrong. It has been corrected after carefully reading the man document.

LT is much simpler to process than et. the read event is triggered and only needs to be read once. If the data is not read, epoll the next time_ Wait will return, and writing is the same; The ET mode requires that when the event is triggered, it is always read and write until it is clearly known that the reading and writing has been completed (the error code of EAGIN or EWOULDBLOCK is returned).

The process of horizontally triggered server program is as follows:

  • accept a new connection, add the fd of the new connection to the epoll event, and listen to the epolin event.
  • When the EPOLLIN event arrives, the data in the fd is read.
  • If you want to write an event to this fd, add the EPOLLOUT event to epoll.
  • When the EPOLLOUT event arrives, write the data to fd. If the data is too large to be written out at one time, keep the EPOLLOUT event first and continue writing the next time the event arrives; If the write out is complete, delete the EPOLLOUT event from epoll.

A practical echo program:

This time, we will use epoll and non blocking socket to write a really practical echo server, call fcntl function and set o_ The Nonblock flag bit turns the file descriptor of the socket into a non blocking mode. Non blocking mode is more complex to handle than blocking mode:

  • Read, write and accept functions will not block. They will either succeed or return - 1 failure. errno records the reason for the failure. There are several error codes to pay attention to:
    • EAGAIN or EWOULDBLOCK occurs only when fd is non blocking, which means that there is no data to read, no space to write, or no client can accept. Come back next time. These two values may be the same or different. It is best to judge together.
    • EINTR indicates that it is interrupted by a signal, which can be called again.
    • Other errors indicate a real error.

  • It's troublesome to write data to an fd. We can't guarantee that all data will be written at one time, so we need to save it in the buffer first, then add a write event to epoll, and then write data to fd when the event is triggered. When the data is written, remove the event from epoll. This program saves the written data in the linked list.

We leave listening fd in blocking mode because epoll_ When the wait returns, it can be determined that there must be a client connected, so accept can generally succeed without worrying about blocking. The client connection uses a non blocking mode to ensure that there is no blocking when reading and writing is not completed.

The following is the code of this program. Some comments have been added in key places. It is more useful to look at the code carefully than to look at the text description:)

#include "socket_lib h"
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/epoll.h>

#define MAX_CLIENT 10000
#define MIN_RSIZE 124
#define BACKLOG 128
#define EVENT_NUM 64

// Cache node
struct twbuffer {
    struct twbuffer *next;      // Next cache
    void *buffer;        // cache
    char *ptr;           // Currently unsent cache, buffer= PTR indicates that only part of the message was sent
    int size;            // Cache size not currently sent

// Cache list
struct twblist {
    struct twbuffer *head;
    struct twbuffer *tail;

// Client connection information
struct tclient {
    int fd;             // Client fd
    int rsize;          // Current read cache size
    int wbsize;         // Cache size not yet written
    struct twblist wblist;  // Write cache linked list

// server information
struct tserver {
    int listenfd;       // Monitor fd
    int epollfd;        // epollfd
    struct tclient clients[MAX_CLIENT];     // Client structure array

// epoll add read event
void epoll_add(int efd, int fd, void *ud) {
    struct epoll_event ev; = EPOLLIN; = ud;
    epoll_ctl(efd, EPOLL_CTL_ADD, fd, &ev);

// epoll modify write event
void epoll_write(int efd, int fd, void *ud, int enabled) {
    struct epoll_event ev; = EPOLLIN | (enabled ? EPOLLOUT : 0); = ud;
    epoll_ctl(efd, EPOLL_CTL_MOD, fd, &ev);

// epoll delete fd
void epoll_del(int efd, int fd) {
    epoll_ctl(efd, EPOLL_CTL_DEL, fd, NULL);

// Set socket to non blocking
void set_nonblocking(int fd) {
    int flag = fcntl(fd, F_GETFL, 0);
    if (flag >= 0) {
        fcntl(fd, F_SETFL, flag | O_NONBLOCK);

// Increase write cache
void add_wbuffer(struct twblist *list, void *buffer, int sz) {
    struct twbuffer *wb = malloc(sizeof(*wb));
    wb->buffer = buffer;
    wb->ptr = buffer;
    wb->size = sz;
    wb->next = NULL;
    if (!list->head) {
        list->head = list->tail = wb;
    } else {
        list->tail->next = wb;
        list->tail = wb;

// Free write cache
void free_wblist(struct twblist *list) {
    struct twbuffer *wb = list->head;
    while (wb) {
        struct twbuffer *tmp = wb;
        wb = wb->next;
    list->head = NULL;
    list->tail = NULL;

// Create client information
struct tclient* create_client(struct tserver *server, int fd) {
    int i;
    struct tclient *client = NULL;
    for (i = 0; i < MAX_CLIENT; ++i) {
        if (server->clients[i].fd < 0) {
            client = &server->clients[i];
    if (client) {
        client->fd = fd;
        client->rsize = MIN_RSIZE;
        set_nonblocking(fd);        // Set to non blocking mode
        epoll_add(server->epollfd, fd, client);     // Add read event
        return client;
    } else {
        fprintf(stderr, "too many client: %d\n", fd);
        return NULL;

// Close client
void close_client(struct tserver *server, struct tclient *client) {
    assert(client->fd >= 0);
    epoll_del(server->epollfd, client->fd);
    if (close(client->fd) < 0) perror("close: ");
    client->fd = -1;
    client->wbsize = 0;

// Initialize service information
struct tserver* create_server(const char *host, const char *port) {
    struct tserver *server = malloc(sizeof(*server));
    memset(server, 0, sizeof(*server));
    for (int i = 0; i < MAX_CLIENT; ++i) {
        server->clients[i].fd = -1;
    server->epollfd = epoll_create(MAX_CLIENT);
    server->listenfd = tcpListen(host, port, BACKLOG);
    epoll_add(server->epollfd, server->listenfd, NULL);
    return server;

// Release server
void release_server(struct tserver *server) {
    for (int i = 0; i < MAX_CLIENT; ++i) {
        struct tclient *client = &server->clients[i];
        if (client->fd >= 0) {
            close_client(server, client);
    epoll_del(server->epollfd, server->listenfd);

// Processing acceptance
void handle_accept(struct tserver *server) {
    struct sockaddr_storage claddr;
    socklen_t addrlen = sizeof(struct sockaddr_storage);
    for (;;) {
        int cfd = accept(server->listenfd, (struct sockaddr*)&claddr, &addrlen);
        if (cfd < 0) {
            int no = errno;
            if (no == EINTR)
            perror("accept: ");
            exit(1);        // error
        char host[NI_MAXHOST];
        char service[NI_MAXSERV];
        if (getnameinfo((struct sockaddr *)&claddr, addrlen, host, NI_MAXHOST, service, NI_MAXSERV, 0) == 0)
            printf("client connect: fd=%d, (%s:%s)\n", cfd, host, service);
            printf("client connect: fd=%d, (?UNKNOWN?)\n", cfd);

        create_client(server, cfd);

// Processing read
void handle_read(struct tserver *server, struct tclient *client) {
    int sz = client->rsize;
    char *buf = malloc(sz);
    ssize_t n = read(client->fd, buf, sz);
    if (n < 0) {        // error
        int no = errno;
        if (no != EINTR && no != EAGAIN && no != EWOULDBLOCK) {
            perror("read: ");
            close_client(server, client);
    if (n == 0) {       // client close
        printf("client close: %d\n", client->fd);
        close_client(server, client);
    // Determines the size of the next read
    if (n == sz)
        client->rsize >>= 1;
    else if (sz > MIN_RSIZE && n *2 < sz)
        client->rsize <<= 1;
    // Add write cache
    add_wbuffer(&client->wblist, buf, n);
    // Add write event
    epoll_write(server->epollfd, client->fd, client, 1);

// Process write
void handle_write(struct tserver *server, struct tclient *client) {
    struct twblist *list = &client->wblist;
    while (list->head) {
        struct twbuffer *wb = list->head;
        for (;;) {
            ssize_t sz = write(client->fd, wb->ptr, wb->size);
            if (sz < 0) {
                int no = errno;
                if (no == EINTR)        // Signal interrupted, continue
                else if (no == EAGAIN || no == EWOULDBLOCK)   // The kernel buffer is full. Come back next time
                else {      // Other errors 
                    perror("write: ");
                    close_client(server, client);
            client->wbsize -= sz;
            if (sz != wb->size) {       // Not completely sent out. Come back next time
                wb->ptr += sz;
                wb->size -= sz;
        list->head = wb->next;
    list->tail = NULL;
    // Write all here and close the write event
    epoll_write(server->epollfd, client->fd, client, 0);

// Handle errors first
void handle_error(struct tserver *server, struct tclient *client) {
    perror("client error: ");
    close_client(server, client);

int main() {
    signal(SIGPIPE, SIG_IGN);

    struct tserver *server = create_server("", "3459");

    struct epoll_event events[EVENT_NUM];
    for (;;) {
        int nevent = epoll_wait(server->epollfd, events, EVENT_NUM, -1);
        if (nevent <= 0) {
            if (nevent < 0 && errno != EINTR) {
                perror("epoll_wait: ");
                return 1;
        int i = 0;
        for (i = 0; i < nevent; ++i) {
            struct epoll_event ev = events[i];
            if ( == NULL) {  // accept
            } else {
                if ( & (EPOLLIN | EPOLLHUP)) {  // read
                if ( & EPOLLOUT) {     // write
                if ( & EPOLLERR) {     // error

    return 0;

Turn from Network programming: epoll - Zhihu 

Tags: Netty network

Posted on Fri, 05 Nov 2021 21:19:10 -0400 by leeharvey09