epoll: The API That Powers the Modern Internet

The problem: 10,000 connections in 1999

In 1999, Dan Kegel published a paper titled “It’s time for web servers to handle ten thousand clients simultaneously, don’t you think?” The C10K problem wasn’t just about connection counts. It exposed a fundamental flaw in how Unix programs monitored file descriptors for I/O.

The dominant approach at the time was select() — a POSIX syscall that blocks until one or more file descriptors are ready for I/O. The API looked reasonable:

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

The problem was what happened inside the kernel on every call:

The entire fd_set bitmask was copied from user space to kernel space
The kernel scanned every fd from 0 to nfds to check readiness
The modified bitmask was copied back to user space
The application had to scan the bitmask again to find which fds fired
Repeat from step 1

For a server with 10,000 mostly-idle connections, this meant copying and scanning 10,000 entries on every event — even if only 3 had activity. O(n) work for O(k) useful results, where k ≪ n.

poll() addressed select()’s 1024-fd limit (FD_SETSIZE) by using an array of struct pollfd instead of bitmasks. But it kept the same O(n) scanning model. The copy happened every call. 10,000 connections still meant 10,000 entries traversed per wakeup.

The C10K problem needed a different approach entirely.

epoll: O(1) wait via persistent state

Davide Libenzi submitted the epoll patch in October 2002. It was merged into Linux 2.5.44 and reached production with Linux 2.6.0 in December 2003.

The core insight: move the interest list into the kernel and keep it there. Instead of rebuilding it on every call, register fds once via epoll_ctl. When you call epoll_wait, the kernel doesn’t scan anything — it just hands you the fds that already signaled readiness.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

The difference in behavior at scale is stark. With 10,000 connections and 3 active:

poll: copies 80KB array, scans 10,000 entries, returns count 3
epoll: copies nothing, walks a 3-element ready list, returns 3 events directly

+-----------------+----------------------------------+----------------------------------+---------------------------+
| Feature         | select                           | poll                             | epoll                     |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Max fds         | 1024 (FD_SETSIZE)                | Unlimited                        | Unlimited                 |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Wait complexity | O(n)                             | O(n)                             | O(1) / O(k ready)         |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Add/remove cost | Rebuild fd_set: O(n)             | Rebuild array: O(n)              | O(log n) per epoll_ctl    |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Kernel state    | Stateless — full copy every call | Stateless — full copy every call | Persistent red-black tree |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Copy on wait    | O(n) bitmask copied in + out     | O(n) pollfd array copied         | Zero copies on wait       |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Returns         | Count; app scans bitmasks        | Count; app scans revents         | Ready events directly     |
+-----------------+----------------------------------+----------------------------------+---------------------------+
| Portability     | POSIX everywhere                 | POSIX everywhere                 | Linux only                |
+-----------------+----------------------------------+----------------------------------+---------------------------+

The three syscalls

epoll’s entire API surface is three syscalls.

epoll_create1(flags)

int epfd = epoll_create1(EPOLL_CLOEXEC);

Creates an epoll instance and returns a file descriptor representing it. That fd is a real, closeable fd — it can even be watched by another epoll instance, enabling hierarchical event trees.

EPOLL_CLOEXEC sets FD_CLOEXEC on the returned fd, ensuring it is automatically closed when you exec() a child process. Always use this flag — without it, forked children inherit the epoll fd, which is almost never intentional.

The older epoll_create(int size) still exists. Since Linux 2.6.8, the size argument is completely ignored (the kernel dynamically sizes its internal structures), but it must be positive for historical reasons. Prefer epoll_create1.

epoll_ctl(epfd, op, fd, event)

struct epoll_event ev;
ev.events  = EPOLLIN | EPOLLET;
ev.data.fd = client_fd;

epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);   // register
epoll_ctl(epfd, EPOLL_CTL_MOD, client_fd, &ev);   // change mask
epoll_ctl(epfd, EPOLL_CTL_DEL, client_fd, NULL);  // remove

epoll_ctl is O(log n) — it operates on the kernel’s red-black tree (more on this below). This is the rare operation: you register once when a connection arrives, deregister when it closes. epoll_wait is the hot path.

The data union in struct epoll_event is opaque to the kernel — whatever you store there is returned verbatim on the next epoll_wait:

typedef union epoll_data {
    void     *ptr;   // point to your own connection struct
    int       fd;    // simplest: just store the fd number
    uint32_t  u32;
    uint64_t  u64;
} epoll_data_t;

Using data.ptr to point to a connection struct (instead of data.fd) is a common pattern — it avoids a lookup table and gives your handler direct access to per-connection state.

epoll_wait(epfd, events, maxevents, timeout)

struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, -1);  // -1 = block forever
for (int i = 0; i < n; i++) {
    handle(events[i].data.fd, events[i].events);
}

Returns up to maxevents ready events. If more events are ready than maxevents allows, the remainder stay in the ready list for the next call. timeout=0 returns immediately (non-blocking); timeout>0 is milliseconds.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Event flags

+----------------+-----------+----------------------------------------------+
| Flag           | Direction | Meaning                                      |
+----------------+-----------+----------------------------------------------+
| EPOLLIN        | watch     | fd has data to read                          |
+----------------+-----------+----------------------------------------------+
| EPOLLOUT       | watch     | fd can accept a write without blocking       |
+----------------+-----------+----------------------------------------------+
| EPOLLRDHUP     | watch     | Peer closed or shut down write half (Linux   |
|                |           | 2.6.17+)                                     |
+----------------+-----------+----------------------------------------------+
| EPOLLPRI       | watch     | Out-of-band / urgent data                    |
+----------------+-----------+----------------------------------------------+
| EPOLLERR       | auto      | Error — always reported, never needs to be   |
|                |           | set                                          |
+----------------+-----------+----------------------------------------------+
| EPOLLHUP       | auto      | Hang-up — always reported                    |
+----------------+-----------+----------------------------------------------+
| EPOLLET        | modifier  | Edge-triggered mode (default is              |
|                |           | level-triggered)                             |
+----------------+-----------+----------------------------------------------+
| EPOLLONESHOT   | modifier  | Single-fire; must re-arm with EPOLL_CTL_MOD  |
+----------------+-----------+----------------------------------------------+
| EPOLLEXCLUSIVE | modifier  | One waiter woken per ready event (Linux      |
|                |           | 4.5+)                                        |
+----------------+-----------+----------------------------------------------+

EPOLLERR and EPOLLHUP are always monitored by the kernel and always reported — you do not need to add them to your event mask, but you must handle them in your dispatch loop.

Inside the kernel: what actually happens

Understanding the internals is what separates using epoll from understanding epoll.

The interest list: a red-black tree

When you call epoll_ctl(EPOLL_CTL_ADD), the kernel inserts an epitem structure into a red-black tree (struct rb_root_cached). The key is (file description, fd).

A red-black tree because:

epoll_ctl(ADD) must reject duplicate registrations (EEXIST) — needs lookup
MOD and DEL need O(log n) find-by-key
Self-balancing — no pathological worst case

The tree persists between epoll_wait calls. This is the core of why epoll doesn’t copy anything on wait — the state is already in the kernel.

The ready list: a doubly-linked list

The kernel also maintains a ready list (rdllist) — a doubly-linked list of epitem structs that have pending events. When epoll_wait runs, it harvests this list: O(k) where k is the number of ready fds. It never touches the fds that have no pending events.

ep_poll_callback: the notification path

When you call epoll_ctl(ADD), the kernel registers a callback function (ep_poll_callback) on the target fd’s VFS wait queue — the same wait queue that poll() uses. This is how epoll hooks into the kernel’s existing notification infrastructure without any polling.

When a socket becomes readable (data arrives from the network):

NIC hardware interrupt fires
Kernel network stack processes the incoming packet
Socket’s wait queue is woken
ep_poll_callback() fires (from softirq context)
It adds the epitem to rdllist (the ready list)
It wakes any thread sleeping in epoll_wait
epoll_wait copies the ready events to user space and returns

The callback runs at interrupt time, so it uses spinlocks and is non-blocking. The entire path from NIC interrupt to your application code is a handful of function calls — no O(n) scanning anywhere.

The dup() trap

One subtlety: the interest list key is (file description, fd), not just fd. If you dup() a file descriptor, both the original and the duplicate refer to the same underlying file description. You can register both in epoll with different event masks, but the underlying file description is shared. Closing one fd does not remove the other from epoll — the entry persists until all fds pointing to that file description are closed, or until you explicitly call EPOLL_CTL_DEL.

Level-triggered vs edge-triggered

+---------------------------+---------------------------------+--------------------------------------+
| Dimension                 | Level-triggered (default)       | Edge-triggered (EPOLLET)             |
+---------------------------+---------------------------------+--------------------------------------+
| When it fires             | While data is available         | Only when state changes (new data    |
|                           |                                 | arrives)                             |
+---------------------------+---------------------------------+--------------------------------------+
| Partial reads OK?         | Yes — will fire again next call | No — you must drain to EAGAIN        |
+---------------------------+---------------------------------+--------------------------------------+
| Non-blocking fd required? | No                              | Yes — mandatory                      |
+---------------------------+---------------------------------+--------------------------------------+
| Who uses it               | Redis                           | Nginx, most high-perf servers        |
+---------------------------+---------------------------------+--------------------------------------+
| Thundering herd risk      | Higher (all waiters wake)       | Lower (fire-once semantics)          |
+---------------------------+---------------------------------+--------------------------------------+

Level-triggered (default)

The kernel delivers an event every time epoll_wait is called while the condition remains true. If 100 bytes are in the receive buffer and you only read 50, the next epoll_wait immediately returns EPOLLIN again.

Behavior is identical to poll(). Safe, easy. You can read partial data and resume on the next event.

Edge-triggered (`EPOLLET`)

The kernel delivers an event only when state changes — when new data arrives, not while old data sits unread. If 100 bytes are in the buffer and you only read 50, epoll_wait will not fire for that fd again unless new data arrives.

This requires discipline:

// Set the fd non-blocking — mandatory for EPOLLET
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

// Register with edge-triggered
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);

// When EPOLLIN fires, drain completely
while (1) {
    ssize_t n = read(fd, buf, sizeof(buf));
    if (n < 0) {
        if (errno == EAGAIN) break;   // buffer empty — done
        /* real error */ break;
    }
    if (n == 0) break;               // EOF
    process(buf, n);
}

Thundering herd and EPOLLEXCLUSIVE

Classic problem with level-triggered: N worker threads all epoll_wait() on the same epoll fd. A new connection arrives. All N threads wake up. Only one accept() succeeds; the rest return EAGAIN and go back to sleep. Wasted context switches at scale.

EPOLLEXCLUSIVE (Linux 4.5+) solves this properly. When multiple threads wait on the same epoll instance, only one is woken per ready event. The kernel uses WQ_FLAG_EXCLUSIVE on the wait queue. No mutex, no coordination — the kernel handles it.

EPOLLONESHOT is the single-connection variant: the fd is automatically disabled after firing once. You re-arm it with EPOLL_CTL_MOD. Guarantees at most one thread processes a given fd at a time.

What epoll cannot watch

+------------------------------+------------------+----------------------------------+
| Type                         | epoll_ctl result | Reason                           |
+------------------------------+------------------+----------------------------------+
| TCP/UDP socket               | Works            | Primary use case                 |
+------------------------------+------------------+----------------------------------+
| Unix domain socket           | Works            |                                  |
+------------------------------+------------------+----------------------------------+
| Pipe                         | Works            | VFS poll() implemented           |
+------------------------------+------------------+----------------------------------+
| stdin / stdout / tty         | Works            |                                  |
+------------------------------+------------------+----------------------------------+
| signalfd / timerfd / eventfd | Works            | Designed for epoll               |
+------------------------------+------------------+----------------------------------+
| Regular file                 | EPERM            | Always "ready" — no true async   |
|                              |                  | readiness                        |
+------------------------------+------------------+----------------------------------+
| Block device                 | EPERM            | Same reason as regular files     |
+------------------------------+------------------+----------------------------------+
| /proc, /sys                  | Usually EPERM    |                                  |
+------------------------------+------------------+----------------------------------+

Regular files are the important case. epoll_ctl(EPOLL_CTL_ADD) on a regular file returns EPERM. This is not a bug or limitation to be worked around — it reflects something true about disk I/O.

Regular files don’t have the concept of readiness. Data is either in the page cache (available instantly) or on disk (the kernel blocks waiting for the read to complete — there is no intermediate “not yet ready, notify me later” state at the VFS level). There is no notification path to hook ep_poll_callback into.

This is why Node.js and libuv maintain a thread pool for file system operations. File reads run on worker threads that block on disk; when complete, they signal the main event loop via an eventfd. Network I/O goes through epoll; file I/O goes through the thread pool.

io_uring (Linux 5.1+), a completion-based I/O interface, does work with regular files. Cloudflare has written about io_uring as the epoll alternative that handles the cases epoll cannot. For new Linux servers, it is worth evaluating.

How Redis uses epoll

Redis implements its event loop in src/ae.c (async events). The ae layer abstracts over four platform I/O backends: evport (Solaris), epoll (Linux), kqueue (BSD/macOS), select (fallback). On Linux, it uses epoll.

The state is minimal:

typedef struct aeApiState {
    int epfd;
    struct epoll_event *events;   // pre-allocated result array
} aeApiState;

The hot path is aeApiPoll, called on every event loop iteration:

static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
    aeApiState *state = eventLoop->apidata;
    int timeout = tvp ? (tvp->tv_sec * 1000 + tvp->tv_usec / 1000) : -1;

    int n = epoll_wait(state->epfd, state->events, eventLoop->setsize, timeout);

    for (int j = 0; j < n; j++) {
        int mask = 0;
        struct epoll_event *e = state->events + j;
        if (e->events & EPOLLIN)  mask |= AE_READABLE;
        if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
        if (e->events & EPOLLERR) mask |= AE_WRITABLE | AE_READABLE;
        if (e->events & EPOLLHUP) mask |= AE_WRITABLE | AE_READABLE;
        eventLoop->fired[j].fd   = e->data.fd;
        eventLoop->fired[j].mask = mask;
    }
    return n;
}

Redis uses level-triggered epoll (no EPOLLET). It registers a new client fd with EPOLL_CTL_ADD when the connection is accepted, and either EPOLL_CTL_ADD or EPOLL_CTL_MOD (depending on whether the fd is already in the tree) when switching between read and write interest.

The single-threaded model

From Redis 1.0 through 5.x: one thread. Every client connection, every command read, every response write — all in one epoll loop. No locks, no contention. Redis commands are typically O(1) or O(n) over small data; the bottleneck is network I/O, not CPU. A single-threaded event loop processes hundreds of thousands of commands per second.

Redis 6.0 (2020) added I/O threading for reading from sockets and writing responses. Command execution remains single-threaded — the main thread still calls epoll_wait and processes all commands in order. This delivered 37–112% throughput improvement on high-core-count systems, confirming that network I/O, not command execution, was the bottleneck.

How Nginx uses epoll

Nginx’s epoll backend lives in src/event/modules/ngx_epoll_module.c.

Nginx pre-forks N worker processes (typically equal to CPU core count). Each worker has its own epoll instance — no shared epoll fd across processes. Workers pre-allocate a fixed struct epoll_event[worker_connections] array to avoid per-call allocation.

Nginx registers connections with edge-triggered mode:

// From ngx_epoll_add_connection()
ee.events = EPOLLIN | EPOLLOUT | EPOLLET | EPOLLRDHUP;

Both read and write interest are registered upfront, not toggled per direction. This works because Nginx’s handlers always drain to EAGAIN, as required by EPOLLET semantics.

The thundering herd story in Nginx

Old problem: all workers add the listening socket to their epoll instance. A new connection arrives → all workers wake up → only one accept() succeeds → the rest burn a context switch.

accept_mutex (Nginx’s original solution): a cross-process mutex. Only the mutex holder adds the listening socket to its epoll instance. Serializes accepts completely. Safe, but adds latency under high connection rates.

SO_REUSEPORT (Linux 3.9+): each worker creates its own listening socket on the same port. The kernel distributes connections across sockets using a 4-tuple hash. No mutex needed. Nginx added support in 1.9.1. This is the recommended modern configuration.

EPOLLEXCLUSIVE (Linux 4.5+): Nginx 1.11.3 added support. Add the listening socket to all workers with EPOLLEXCLUSIVE — only one worker wakes per new connection, without any mutex. Cleaner than accept_mutex and doesn’t require per-worker listen sockets.

How Node.js uses epoll

Node.js uses libuv as its cross-platform async I/O library. On Linux, libuv’s uv__io_poll() function calls epoll_wait. On macOS/BSD, it calls kqueue. On Windows, IOCP.

The Node.js event loop runs in phases:

   timers          → setTimeout, setInterval
   pending         → I/O callbacks deferred from previous iteration
   idle / prepare  → internal libuv housekeeping
   poll (I/O)      → epoll_wait() blocks here; dispatches I/O callbacks
   check           → setImmediate()
   close           → socket.on('close', ...) callbacks

The poll phase is where epoll_wait runs. libuv calculates the timeout: 0 if setImmediate() callbacks are queued (don’t block), otherwise the time until the next setTimeout fires.

Every net.Socket, net.Server, and dgram.Socket in Node.js is backed by a uv_tcp_t or uv_udp_t handle. When you do server.listen(3000), libuv:

Creates a TCP socket, sets O_NONBLOCK
Calls epoll_ctl(EPOLL_CTL_ADD) for EPOLLIN on the listening fd
On each poll phase, epoll_wait returns when a connection arrives
libuv calls accept(), wraps the client fd in a new uv_tcp_t, registers it with epoll
Your connection callback fires

File system operations (fs.readFile, fs.writeFile, etc.) do not go through epoll — they run on a thread pool (default: 4 threads, configurable via UV_THREADPOOL_SIZE). When a worker thread completes a file operation, it writes to a uv_async_t handle (which is an eventfd under epoll) to wake the main loop.

This is a practical demonstration of epoll’s limitation with regular files: libuv simply routes the two types of I/O through two different mechanisms.

Building a crude event server in Python

Python exposes epoll via the select module on Linux. No third-party packages needed.

import socket
import select

def run_server(host='', port=8080):
    # Create and configure the listening socket
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server.bind((host, port))
    server.listen(128)
    server.setblocking(False)

    # Create epoll instance — equivalent to epoll_create1(EPOLL_CLOEXEC)
    ep = select.epoll()
    # Register listening socket for incoming connections (level-triggered)
    ep.register(server.fileno(), select.EPOLLIN)

    connections = {}   # fd -> socket object
    requests    = {}   # fd -> bytes accumulated so far
    responses   = {}   # fd -> bytes remaining to send

    print(f'Listening on :{port}')
    try:
        while True:
            # epoll_wait — blocks up to 1 second
            events = ep.poll(timeout=1)

            for fd, event in events:

                if fd == server.fileno():
                    # EPOLLIN on the listening socket = new connection
                    conn, addr = server.accept()
                    conn.setblocking(False)
                    # Register new client fd for reading
                    ep.register(conn.fileno(), select.EPOLLIN)
                    connections[conn.fileno()] = conn
                    requests[conn.fileno()]    = b''

                elif event & select.EPOLLIN:
                    # Data ready to read on a client fd
                    data = connections[fd].recv(4096)
                    if data:
                        requests[fd] += data
                        # Got a complete HTTP request?
                        if b'\r\n\r\n' in requests[fd]:
                            body = b'Hello from epoll!\r\n'
                            responses[fd] = (
                                b'HTTP/1.1 200 OK\r\n'
                                b'Content-Type: text/plain\r\n'
                                b'Content-Length: '
                                + str(len(body)).encode()
                                + b'\r\nConnection: close\r\n\r\n'
                                + body
                            )
                            # Switch fd to write mode
                            ep.modify(fd, select.EPOLLOUT)
                    else:
                        # Empty read = peer closed connection
                        ep.unregister(fd)
                        connections[fd].close()
                        del connections[fd], requests[fd]

                elif event & select.EPOLLOUT:
                    # Socket ready to write — send remaining response bytes
                    if fd in responses and responses[fd]:
                        sent = connections[fd].send(responses[fd])
                        responses[fd] = responses[fd][sent:]
                    if not responses.get(fd):
                        # All sent — close and clean up
                        ep.unregister(fd)
                        connections[fd].shutdown(socket.SHUT_RDWR)
                        connections[fd].close()
                        del connections[fd], requests[fd]
                        responses.pop(fd, None)

                elif event & select.EPOLLHUP:
                    # Remote end hung up
                    ep.unregister(fd)
                    connections[fd].close()
                    del connections[fd]
                    requests.pop(fd, None)
                    responses.pop(fd, None)

    finally:
        ep.close()
        server.close()

if __name__ == '__main__':
    run_server()

What this demonstrates:

State machine per connection. Each fd moves through states: reading → writing → closed. ep.modify(fd, select.EPOLLOUT) is the transition — it calls epoll_ctl(EPOLL_CTL_MOD) under the hood, switching interest from EPOLLIN to EPOLLOUT. The kernel updates the interest list in O(log n).

No threads. One process, one loop, thousands of simultaneous connections. The event loop is non-blocking throughout — recv() and send() on non-blocking sockets return immediately with partial data or EAGAIN. Partial writes are handled by tracking remaining bytes in responses[fd] and re-entering the EPOLLOUT handler on the next iteration.

The fd → socket mapping. epoll returns file descriptor numbers. We maintain connections[fd] to map back to the socket object. Using ev.data.ptr in C to point directly to a connection struct eliminates this lookup.

Try it:

python3 server.py &
curl http://localhost:8080/
# Hello from epoll!

Load test it with wrk or ab — a single Python process will handle thousands of concurrent connections without threading.

The numbers

What “C10K” means today has shifted considerably. The original 1999 bar of 10,000 connections was solved with epoll. The modern baseline:

Nginx handles 100,000–1,000,000 concurrent connections per server
HAProxy 2.x reaches 2 million concurrent connections on commodity hardware
MigratoryData demonstrated 10–12 million concurrent connections on a single Linux server (the C10M problem)

The theoretical ceiling on Linux is bounded by per-socket kernel memory (~1–4 KB), fs.file-max, ulimit -n, and network throughput — not epoll’s algorithmic complexity.

A single-threaded event loop using epoll processes roughly 100,000–500,000 small requests per second on modern hardware. The bottleneck is memory bandwidth and network I/O. epoll’s overhead is negligible.