epoll: The API That Powers the Modern Internet
How Linux solved the C10K problem — the red-black tree, the ready list, ep_poll_callback, and why Redis, Nginx, and Node.js all converge on the same three syscalls.
- DATE:
- APR.29.2026
- READ:
- 24 MIN
The problem: 10,000 connections in 1999
In 1999, Dan Kegel published a paper titled “It’s time for web servers to handle ten thousand clients simultaneously, don’t you think?” The C10K problem wasn’t just about connection counts. It exposed a fundamental flaw in how Unix programs monitored file descriptors for I/O.
The dominant approach at the time was select() — a POSIX syscall that blocks until one or more file descriptors are ready for I/O. The API looked reasonable:
int select(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timeval *timeout);The problem was what happened inside the kernel on every call:
- The entire
fd_setbitmask was copied from user space to kernel space - The kernel scanned every fd from 0 to
nfdsto check readiness - The modified bitmask was copied back to user space
- The application had to scan the bitmask again to find which fds fired
- Repeat from step 1
For a server with 10,000 mostly-idle connections, this meant copying and scanning 10,000 entries on every event — even if only 3 had activity. O(n) work for O(k) useful results, where k ≪ n.
poll() addressed select()’s 1024-fd limit (FD_SETSIZE) by using an array of struct pollfd instead of bitmasks. But it kept the same O(n) scanning model. The copy happened every call. 10,000 connections still meant 10,000 entries traversed per wakeup.
The C10K problem needed a different approach entirely.
epoll: O(1) wait via persistent state
Davide Libenzi submitted the epoll patch in October 2002. It was merged into Linux 2.5.44 and reached production with Linux 2.6.0 in December 2003.
The core insight: move the interest list into the kernel and keep it there. Instead of rebuilding it on every call, register fds once via epoll_ctl. When you call epoll_wait, the kernel doesn’t scan anything — it just hands you the fds that already signaled readiness.
The difference in behavior at scale is stark. With 10,000 connections and 3 active:
- poll: copies 80KB array, scans 10,000 entries, returns count 3
- epoll: copies nothing, walks a 3-element ready list, returns 3 events directly
+-----------------+----------------------------------+----------------------------------+---------------------------+ | Feature | select | poll | epoll | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Max fds | 1024 (FD_SETSIZE) | Unlimited | Unlimited | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Wait complexity | O(n) | O(n) | O(1) / O(k ready) | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Add/remove cost | Rebuild fd_set: O(n) | Rebuild array: O(n) | O(log n) per epoll_ctl | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Kernel state | Stateless — full copy every call | Stateless — full copy every call | Persistent red-black tree | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Copy on wait | O(n) bitmask copied in + out | O(n) pollfd array copied | Zero copies on wait | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Returns | Count; app scans bitmasks | Count; app scans revents | Ready events directly | +-----------------+----------------------------------+----------------------------------+---------------------------+ | Portability | POSIX everywhere | POSIX everywhere | Linux only | +-----------------+----------------------------------+----------------------------------+---------------------------+
The three syscalls
epoll’s entire API surface is three syscalls.
epoll_create1(flags)
int epfd = epoll_create1(EPOLL_CLOEXEC);Creates an epoll instance and returns a file descriptor representing it. That fd is a real, closeable fd — it can even be watched by another epoll instance, enabling hierarchical event trees.
EPOLL_CLOEXEC sets FD_CLOEXEC on the returned fd, ensuring it is automatically closed when you exec() a child process. Always use this flag — without it, forked children inherit the epoll fd, which is almost never intentional.
The older epoll_create(int size) still exists. Since Linux 2.6.8, the size argument is completely ignored (the kernel dynamically sizes its internal structures), but it must be positive for historical reasons. Prefer epoll_create1.
epoll_ctl(epfd, op, fd, event)
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev); // register
epoll_ctl(epfd, EPOLL_CTL_MOD, client_fd, &ev); // change mask
epoll_ctl(epfd, EPOLL_CTL_DEL, client_fd, NULL); // removeepoll_ctl is O(log n) — it operates on the kernel’s red-black tree (more on this below). This is the rare operation: you register once when a connection arrives, deregister when it closes. epoll_wait is the hot path.
The data union in struct epoll_event is opaque to the kernel — whatever you store there is returned verbatim on the next epoll_wait:
typedef union epoll_data {
void *ptr; // point to your own connection struct
int fd; // simplest: just store the fd number
uint32_t u32;
uint64_t u64;
} epoll_data_t;Using data.ptr to point to a connection struct (instead of data.fd) is a common pattern — it avoids a lookup table and gives your handler direct access to per-connection state.
epoll_wait(epfd, events, maxevents, timeout)
struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, -1); // -1 = block forever
for (int i = 0; i < n; i++) {
handle(events[i].data.fd, events[i].events);
}Returns up to maxevents ready events. If more events are ready than maxevents allows, the remainder stay in the ready list for the next call. timeout=0 returns immediately (non-blocking); timeout>0 is milliseconds.
Event flags
+----------------+-----------+----------------------------------------------+ | Flag | Direction | Meaning | +----------------+-----------+----------------------------------------------+ | EPOLLIN | watch | fd has data to read | +----------------+-----------+----------------------------------------------+ | EPOLLOUT | watch | fd can accept a write without blocking | +----------------+-----------+----------------------------------------------+ | EPOLLRDHUP | watch | Peer closed or shut down write half (Linux | | | | 2.6.17+) | +----------------+-----------+----------------------------------------------+ | EPOLLPRI | watch | Out-of-band / urgent data | +----------------+-----------+----------------------------------------------+ | EPOLLERR | auto | Error — always reported, never needs to be | | | | set | +----------------+-----------+----------------------------------------------+ | EPOLLHUP | auto | Hang-up — always reported | +----------------+-----------+----------------------------------------------+ | EPOLLET | modifier | Edge-triggered mode (default is | | | | level-triggered) | +----------------+-----------+----------------------------------------------+ | EPOLLONESHOT | modifier | Single-fire; must re-arm with EPOLL_CTL_MOD | +----------------+-----------+----------------------------------------------+ | EPOLLEXCLUSIVE | modifier | One waiter woken per ready event (Linux | | | | 4.5+) | +----------------+-----------+----------------------------------------------+
EPOLLERR and EPOLLHUP are always monitored by the kernel and always reported — you do not need to add them to your event mask, but you must handle them in your dispatch loop.
Inside the kernel: what actually happens
Understanding the internals is what separates using epoll from understanding epoll.
The interest list: a red-black tree
When you call epoll_ctl(EPOLL_CTL_ADD), the kernel inserts an epitem structure into a red-black tree (struct rb_root_cached). The key is (file description, fd).
A red-black tree because:
epoll_ctl(ADD)must reject duplicate registrations (EEXIST) — needs lookupMODandDELneed O(log n) find-by-key- Self-balancing — no pathological worst case
The tree persists between epoll_wait calls. This is the core of why epoll doesn’t copy anything on wait — the state is already in the kernel.
The ready list: a doubly-linked list
The kernel also maintains a ready list (rdllist) — a doubly-linked list of epitem structs that have pending events. When epoll_wait runs, it harvests this list: O(k) where k is the number of ready fds. It never touches the fds that have no pending events.
ep_poll_callback: the notification path
When you call epoll_ctl(ADD), the kernel registers a callback function (ep_poll_callback) on the target fd’s VFS wait queue — the same wait queue that poll() uses. This is how epoll hooks into the kernel’s existing notification infrastructure without any polling.
When a socket becomes readable (data arrives from the network):
- NIC hardware interrupt fires
- Kernel network stack processes the incoming packet
- Socket’s wait queue is woken
ep_poll_callback()fires (from softirq context)- It adds the
epitemtordllist(the ready list) - It wakes any thread sleeping in
epoll_wait epoll_waitcopies the ready events to user space and returns
The callback runs at interrupt time, so it uses spinlocks and is non-blocking. The entire path from NIC interrupt to your application code is a handful of function calls — no O(n) scanning anywhere.
The dup() trap
One subtlety: the interest list key is (file description, fd), not just fd. If you dup() a file descriptor, both the original and the duplicate refer to the same underlying file description. You can register both in epoll with different event masks, but the underlying file description is shared. Closing one fd does not remove the other from epoll — the entry persists until all fds pointing to that file description are closed, or until you explicitly call EPOLL_CTL_DEL.
Level-triggered vs edge-triggered
+---------------------------+---------------------------------+--------------------------------------+ | Dimension | Level-triggered (default) | Edge-triggered (EPOLLET) | +---------------------------+---------------------------------+--------------------------------------+ | When it fires | While data is available | Only when state changes (new data | | | | arrives) | +---------------------------+---------------------------------+--------------------------------------+ | Partial reads OK? | Yes — will fire again next call | No — you must drain to EAGAIN | +---------------------------+---------------------------------+--------------------------------------+ | Non-blocking fd required? | No | Yes — mandatory | +---------------------------+---------------------------------+--------------------------------------+ | Who uses it | Redis | Nginx, most high-perf servers | +---------------------------+---------------------------------+--------------------------------------+ | Thundering herd risk | Higher (all waiters wake) | Lower (fire-once semantics) | +---------------------------+---------------------------------+--------------------------------------+
Level-triggered (default)
The kernel delivers an event every time epoll_wait is called while the condition remains true. If 100 bytes are in the receive buffer and you only read 50, the next epoll_wait immediately returns EPOLLIN again.
Behavior is identical to poll(). Safe, easy. You can read partial data and resume on the next event.
Edge-triggered (EPOLLET)
The kernel delivers an event only when state changes — when new data arrives, not while old data sits unread. If 100 bytes are in the buffer and you only read 50, epoll_wait will not fire for that fd again unless new data arrives.
This requires discipline:
// Set the fd non-blocking — mandatory for EPOLLET
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
// Register with edge-triggered
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
// When EPOLLIN fires, drain completely
while (1) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n < 0) {
if (errno == EAGAIN) break; // buffer empty — done
/* real error */ break;
}
if (n == 0) break; // EOF
process(buf, n);
}Thundering herd and EPOLLEXCLUSIVE
Classic problem with level-triggered: N worker threads all epoll_wait() on the same epoll fd. A new connection arrives. All N threads wake up. Only one accept() succeeds; the rest return EAGAIN and go back to sleep. Wasted context switches at scale.
EPOLLEXCLUSIVE (Linux 4.5+) solves this properly. When multiple threads wait on the same epoll instance, only one is woken per ready event. The kernel uses WQ_FLAG_EXCLUSIVE on the wait queue. No mutex, no coordination — the kernel handles it.
EPOLLONESHOT is the single-connection variant: the fd is automatically disabled after firing once. You re-arm it with EPOLL_CTL_MOD. Guarantees at most one thread processes a given fd at a time.
What epoll cannot watch
+------------------------------+------------------+----------------------------------+ | Type | epoll_ctl result | Reason | +------------------------------+------------------+----------------------------------+ | TCP/UDP socket | Works | Primary use case | +------------------------------+------------------+----------------------------------+ | Unix domain socket | Works | | +------------------------------+------------------+----------------------------------+ | Pipe | Works | VFS poll() implemented | +------------------------------+------------------+----------------------------------+ | stdin / stdout / tty | Works | | +------------------------------+------------------+----------------------------------+ | signalfd / timerfd / eventfd | Works | Designed for epoll | +------------------------------+------------------+----------------------------------+ | Regular file | EPERM | Always "ready" — no true async | | | | readiness | +------------------------------+------------------+----------------------------------+ | Block device | EPERM | Same reason as regular files | +------------------------------+------------------+----------------------------------+ | /proc, /sys | Usually EPERM | | +------------------------------+------------------+----------------------------------+
Regular files are the important case. epoll_ctl(EPOLL_CTL_ADD) on a regular file returns EPERM. This is not a bug or limitation to be worked around — it reflects something true about disk I/O.
Regular files don’t have the concept of readiness. Data is either in the page cache (available instantly) or on disk (the kernel blocks waiting for the read to complete — there is no intermediate “not yet ready, notify me later” state at the VFS level). There is no notification path to hook ep_poll_callback into.
This is why Node.js and libuv maintain a thread pool for file system operations. File reads run on worker threads that block on disk; when complete, they signal the main event loop via an eventfd. Network I/O goes through epoll; file I/O goes through the thread pool.
io_uring (Linux 5.1+), a completion-based I/O interface, does work with regular files. Cloudflare has written about io_uring as the epoll alternative that handles the cases epoll cannot. For new Linux servers, it is worth evaluating.
How Redis uses epoll
Redis implements its event loop in src/ae.c (async events). The ae layer abstracts over four platform I/O backends: evport (Solaris), epoll (Linux), kqueue (BSD/macOS), select (fallback). On Linux, it uses epoll.
The state is minimal:
typedef struct aeApiState {
int epfd;
struct epoll_event *events; // pre-allocated result array
} aeApiState;The hot path is aeApiPoll, called on every event loop iteration:
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
aeApiState *state = eventLoop->apidata;
int timeout = tvp ? (tvp->tv_sec * 1000 + tvp->tv_usec / 1000) : -1;
int n = epoll_wait(state->epfd, state->events, eventLoop->setsize, timeout);
for (int j = 0; j < n; j++) {
int mask = 0;
struct epoll_event *e = state->events + j;
if (e->events & EPOLLIN) mask |= AE_READABLE;
if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
if (e->events & EPOLLERR) mask |= AE_WRITABLE | AE_READABLE;
if (e->events & EPOLLHUP) mask |= AE_WRITABLE | AE_READABLE;
eventLoop->fired[j].fd = e->data.fd;
eventLoop->fired[j].mask = mask;
}
return n;
}Redis uses level-triggered epoll (no EPOLLET). It registers a new client fd with EPOLL_CTL_ADD when the connection is accepted, and either EPOLL_CTL_ADD or EPOLL_CTL_MOD (depending on whether the fd is already in the tree) when switching between read and write interest.
The single-threaded model
From Redis 1.0 through 5.x: one thread. Every client connection, every command read, every response write — all in one epoll loop. No locks, no contention. Redis commands are typically O(1) or O(n) over small data; the bottleneck is network I/O, not CPU. A single-threaded event loop processes hundreds of thousands of commands per second.
Redis 6.0 (2020) added I/O threading for reading from sockets and writing responses. Command execution remains single-threaded — the main thread still calls epoll_wait and processes all commands in order. This delivered 37–112% throughput improvement on high-core-count systems, confirming that network I/O, not command execution, was the bottleneck.
How Nginx uses epoll
Nginx’s epoll backend lives in src/event/modules/ngx_epoll_module.c.
Nginx pre-forks N worker processes (typically equal to CPU core count). Each worker has its own epoll instance — no shared epoll fd across processes. Workers pre-allocate a fixed struct epoll_event[worker_connections] array to avoid per-call allocation.
Nginx registers connections with edge-triggered mode:
// From ngx_epoll_add_connection()
ee.events = EPOLLIN | EPOLLOUT | EPOLLET | EPOLLRDHUP;Both read and write interest are registered upfront, not toggled per direction. This works because Nginx’s handlers always drain to EAGAIN, as required by EPOLLET semantics.
The thundering herd story in Nginx
Old problem: all workers add the listening socket to their epoll instance. A new connection arrives → all workers wake up → only one accept() succeeds → the rest burn a context switch.
accept_mutex (Nginx’s original solution): a cross-process mutex. Only the mutex holder adds the listening socket to its epoll instance. Serializes accepts completely. Safe, but adds latency under high connection rates.
SO_REUSEPORT (Linux 3.9+): each worker creates its own listening socket on the same port. The kernel distributes connections across sockets using a 4-tuple hash. No mutex needed. Nginx added support in 1.9.1. This is the recommended modern configuration.
EPOLLEXCLUSIVE (Linux 4.5+): Nginx 1.11.3 added support. Add the listening socket to all workers with EPOLLEXCLUSIVE — only one worker wakes per new connection, without any mutex. Cleaner than accept_mutex and doesn’t require per-worker listen sockets.
How Node.js uses epoll
Node.js uses libuv as its cross-platform async I/O library. On Linux, libuv’s uv__io_poll() function calls epoll_wait. On macOS/BSD, it calls kqueue. On Windows, IOCP.
The Node.js event loop runs in phases:
timers → setTimeout, setInterval
pending → I/O callbacks deferred from previous iteration
idle / prepare → internal libuv housekeeping
poll (I/O) → epoll_wait() blocks here; dispatches I/O callbacks
check → setImmediate()
close → socket.on('close', ...) callbacksThe poll phase is where epoll_wait runs. libuv calculates the timeout: 0 if setImmediate() callbacks are queued (don’t block), otherwise the time until the next setTimeout fires.
Every net.Socket, net.Server, and dgram.Socket in Node.js is backed by a uv_tcp_t or uv_udp_t handle. When you do server.listen(3000), libuv:
- Creates a TCP socket, sets
O_NONBLOCK - Calls
epoll_ctl(EPOLL_CTL_ADD)forEPOLLINon the listening fd - On each poll phase,
epoll_waitreturns when a connection arrives - libuv calls
accept(), wraps the client fd in a newuv_tcp_t, registers it with epoll - Your
connectioncallback fires
File system operations (fs.readFile, fs.writeFile, etc.) do not go through epoll — they run on a thread pool (default: 4 threads, configurable via UV_THREADPOOL_SIZE). When a worker thread completes a file operation, it writes to a uv_async_t handle (which is an eventfd under epoll) to wake the main loop.
This is a practical demonstration of epoll’s limitation with regular files: libuv simply routes the two types of I/O through two different mechanisms.
Building a crude event server in Python
Python exposes epoll via the select module on Linux. No third-party packages needed.
import socket
import select
def run_server(host='', port=8080):
# Create and configure the listening socket
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind((host, port))
server.listen(128)
server.setblocking(False)
# Create epoll instance — equivalent to epoll_create1(EPOLL_CLOEXEC)
ep = select.epoll()
# Register listening socket for incoming connections (level-triggered)
ep.register(server.fileno(), select.EPOLLIN)
connections = {} # fd -> socket object
requests = {} # fd -> bytes accumulated so far
responses = {} # fd -> bytes remaining to send
print(f'Listening on :{port}')
try:
while True:
# epoll_wait — blocks up to 1 second
events = ep.poll(timeout=1)
for fd, event in events:
if fd == server.fileno():
# EPOLLIN on the listening socket = new connection
conn, addr = server.accept()
conn.setblocking(False)
# Register new client fd for reading
ep.register(conn.fileno(), select.EPOLLIN)
connections[conn.fileno()] = conn
requests[conn.fileno()] = b''
elif event & select.EPOLLIN:
# Data ready to read on a client fd
data = connections[fd].recv(4096)
if data:
requests[fd] += data
# Got a complete HTTP request?
if b'\r\n\r\n' in requests[fd]:
body = b'Hello from epoll!\r\n'
responses[fd] = (
b'HTTP/1.1 200 OK\r\n'
b'Content-Type: text/plain\r\n'
b'Content-Length: '
+ str(len(body)).encode()
+ b'\r\nConnection: close\r\n\r\n'
+ body
)
# Switch fd to write mode
ep.modify(fd, select.EPOLLOUT)
else:
# Empty read = peer closed connection
ep.unregister(fd)
connections[fd].close()
del connections[fd], requests[fd]
elif event & select.EPOLLOUT:
# Socket ready to write — send remaining response bytes
if fd in responses and responses[fd]:
sent = connections[fd].send(responses[fd])
responses[fd] = responses[fd][sent:]
if not responses.get(fd):
# All sent — close and clean up
ep.unregister(fd)
connections[fd].shutdown(socket.SHUT_RDWR)
connections[fd].close()
del connections[fd], requests[fd]
responses.pop(fd, None)
elif event & select.EPOLLHUP:
# Remote end hung up
ep.unregister(fd)
connections[fd].close()
del connections[fd]
requests.pop(fd, None)
responses.pop(fd, None)
finally:
ep.close()
server.close()
if __name__ == '__main__':
run_server()What this demonstrates:
State machine per connection. Each fd moves through states: reading → writing → closed. ep.modify(fd, select.EPOLLOUT) is the transition — it calls epoll_ctl(EPOLL_CTL_MOD) under the hood, switching interest from EPOLLIN to EPOLLOUT. The kernel updates the interest list in O(log n).
No threads. One process, one loop, thousands of simultaneous connections. The event loop is non-blocking throughout — recv() and send() on non-blocking sockets return immediately with partial data or EAGAIN. Partial writes are handled by tracking remaining bytes in responses[fd] and re-entering the EPOLLOUT handler on the next iteration.
The fd → socket mapping. epoll returns file descriptor numbers. We maintain connections[fd] to map back to the socket object. Using ev.data.ptr in C to point directly to a connection struct eliminates this lookup.
Try it:
python3 server.py &
curl http://localhost:8080/
# Hello from epoll!Load test it with wrk or ab — a single Python process will handle thousands of concurrent connections without threading.
The numbers
What “C10K” means today has shifted considerably. The original 1999 bar of 10,000 connections was solved with epoll. The modern baseline:
- Nginx handles 100,000–1,000,000 concurrent connections per server
- HAProxy 2.x reaches 2 million concurrent connections on commodity hardware
- MigratoryData demonstrated 10–12 million concurrent connections on a single Linux server (the C10M problem)
The theoretical ceiling on Linux is bounded by per-socket kernel memory (~1–4 KB), fs.file-max, ulimit -n, and network throughput — not epoll’s algorithmic complexity.
A single-threaded event loop using epoll processes roughly 100,000–500,000 small requests per second on modern hardware. The bottleneck is memory bandwidth and network I/O. epoll’s overhead is negligible.
Further reading
The foundational papers:
- Dan Kegel — The C10K Problem (1999, updated 2014)
- Davide Libenzi — epoll more scalable than poll (LWN, 2002)
- sys_epoll — making poll fast (LWN, 2002)
Kernel source:
- fs/eventpoll.c — the epoll implementation (search for
ep_poll_callback,rbr,rdllist)
Man pages (the authoritative spec):
- epoll(7) — overview and semantics
- epoll_create(2)
- epoll_ctl(2)
- epoll_wait(2)
Real-world implementations:
- Redis ae_epoll.c — 100 lines that power millions of deployments
- Nginx ngx_epoll_module.c
- libuv design overview
Deep dives:
- Graham King — epoll: The API that powers the modern internet
- Graham King — Linux: What can you epoll?
- Cloudflare — The sad state of Linux socket balancing
- Cloudflare — io_submit: the epoll alternative you’ve never heard about
- Marek Majkowski — Epoll is fundamentally broken
- Python select module
- Beej’s Guide to Network Programming
select and poll made you ask the kernel “which fds are ready?” on every call. epoll inverts the relationship: you register interest once, and the kernel notifies you when things change. That inversion — from polling to notification, from O(n) to O(1) — is why Redis, Nginx, and Node.js all converge on the same three syscalls, and why a single server core can hold a million simultaneous connections today.