James Lee

Posted on May 17

Linux Server Performance Optimization: I/O Models, epoll Internals & Concurrency Strategies

#architecture #backend #linux #performance

Throughput & Benchmarking

Throughput here specifically refers to the number of requests a web server processes per unit of time.

A meaningful benchmark requires three inputs:

Concurrent user count
Total request count
Request resource description

Two key metrics to distinguish:

User average wait time — measures service quality for a single user under a given concurrency level
Server average request processing time — measures overall server quality; it is the reciprocal of throughput

Throughput is a composite metric — it's not simply a measure of concurrency capacity. For example, with a fixed total request count, more concurrent users can sometimes increase throughput. And serving a 2KB file vs a 2MB file produces very different completion times.

For HTTP requests with Connection: Keep-Alive, enabling persistent connection support reduces the number of accept() syscalls, directly lowering connection establishment overhead.

CPU Concurrency & Context Switching

The Cost of Context Switching

The essence of suspending a process is moving its CPU register data into kernel-space stack storage. Resuming it means loading that data back. This saved/restored data is called the hardware context.

Process suspend:  [CPU registers] ──copy──▶ [kernel stack]
Process resume:   [kernel stack]  ──load──▶ [CPU registers]

To maximize concurrency, minimize context switches — reduce process count, prefer threads, and combine with appropriate I/O models.

Web servers using a multi-threaded model generally outperform multi-process models due to lower context switch overhead.

System Calls

A system call switches execution from user mode to kernel mode (e.g., opening a file, reading from disk, sending network data).

User mode  ──syscall──▶  Kernel mode
                          (disk / network access)
           ◀──return───

This mode switch involves partial memory space exchange — making syscalls relatively expensive. Reducing syscall frequency is a key performance optimization.

Memory Management

Nginx uses a stage-based memory allocation strategy: allocate on demand, release promptly, keeping total memory usage minimal.

10,000 inactive HTTP persistent connections require only 2.5 MB of memory in Nginx.

I/O Models

Since CPU speed far exceeds I/O speed, I/O latency is often the performance bottleneck. Choosing the right I/O model is critical.

┌──────────────────────────────────────────────────────────────┐
│  Synchronous Blocking I/O                                    │
│  process ──read()──▶ [suspended, waiting for data] ──▶ done  │
├──────────────────────────────────────────────────────────────┤
│  Synchronous Non-Blocking I/O                                │
│  process ──read()──▶ [not ready, returns immediately]        │
│           ──poll──▶  [not ready, returns immediately]        │
│           ──poll──▶  [ready!] ──▶ process data               │
│  Problem: wastes CPU polling fds that have no data           │
├──────────────────────────────────────────────────────────────┤
│  I/O Multiplexing (select / poll / epoll)                    │
│  process ──epoll_wait()──▶ monitors N fds simultaneously     │
│                            kernel callback on ready ──▶ act  │
│  Only processes fds that actually have data ✅               │
└──────────────────────────────────────────────────────────────┘

For web servers handling thousands of concurrent connections, polling every socket with non-blocking I/O wastes enormous CPU time. I/O multiplexing solves this by letting the kernel notify the process only when fds are actually ready.

epoll: LT and ET Modes

epoll supports two triggering modes:

Mode	Behavior
LT (Level Triggered, default)	Returns the fd on every `epoll_wait` call until the event is fully handled. Safe and easy.
ET (Edge Triggered, `EPOLLET`)	Returns the fd only once on the first event. Higher performance, but any missed event causes errors.

Memory Mapping (mmap)

Linux can map a disk file directly to a memory address space. Accessing that memory region is equivalent to accessing the file — no read()/write() syscalls needed.

Without mmap:  process ──read()──▶ kernel buffer ──copy──▶ user buffer
With mmap:     process accesses memory directly ↔ kernel maps to disk

epoll also uses mmap internally to eliminate fd copying overhead between kernel and user space.

Direct I/O

Normally, writes go through the kernel buffer first, then flush to disk lazily (improving write efficiency). For applications like databases that manage their own cache, Linux's open() supports the O_DIRECT flag to bypass the kernel buffer entirely.

sendfile — Zero-Copy File Serving

Normal file-send flow:

disk ──▶ kernel buffer ──copy──▶ user process ──copy──▶ NIC kernel buffer
                         ↑ unnecessary round-trip through user space

With sendfile:

disk ──▶ kernel buffer ──────────────────────────▶ NIC kernel buffer
                         (zero user-space copy) ✅

sendfile eliminates the redundant copy through user space, significantly improving static file serving performance.

epoll Internals

epoll is Linux's high-performance I/O multiplexing mechanism, capable of efficiently handling millions of socket file descriptors.

Three Core Syscalls

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Why epoll Beats select/poll

select/poll:
  Every call copies the entire fd list: user space → kernel space
  10,000 fds → ~100KB copied per call ❌

epoll:
  epoll_ctl registers fds once into kernel
  epoll_wait only returns ready fds → typically very few ✅

Internal Architecture

┌──────────────────────────────────────────────────────┐
│  epoll instance                                      │
│                                                      │
│  ┌─────────────────────┐   ┌──────────────────────┐  │
│  │   Red-Black Tree     │   │    Ready List        │  │
│  │  (all watched fds)   │   │  (fds with events)   │  │
│  │                      │   │                      │  │
│  │  O(log N) insert/    │   │  ← kernel interrupt  │  │
│  │  delete/search       │   │    callback writes   │  │
│  └─────────────────────┘   │    here automatically │  │
│                             └──────────────────────┘  │
└──────────────────────────────────────────────────────┘

When a socket receives data, the kernel interrupt handler fires the registered callback, inserting the fd into the ready list. epoll_wait simply checks this list — return immediately if data exists, otherwise sleep until timeout.

The kernel pre-allocates a slab cache for fast epoll object allocation:

static int __init eventpoll_init(void) {
    epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), ...);
    pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), ...);
}

Full epoll Lifecycle

epoll_create  →  creates red-black tree + ready list in kernel
epoll_ctl     →  inserts fd into red-black tree
                 registers interrupt callback for that fd
[data arrives] → kernel interrupt fires callback
                 fd inserted into ready list
epoll_wait    →  reads ready list, copies only ready fds to user space
                 returns immediately if list non-empty, else sleeps

epoll is a classic space-for-time trade-off: kernel memory (red-black tree + ready list) eliminates repeated fd list copying, enabling O(1) event notification regardless of total fd count.

Server Concurrency Strategies

Strategy	I/O Model	Servers	Pros	Cons
1 process per connection	Non-blocking	Apache (prefork)	Stable, isolated	Context switch grows linearly with connections
1 thread per connection	Non-blocking	Apache (worker)	More connections	Context switch still a bottleneck
1 process handles N connections	Non-blocking + epoll/kqueue	Nginx, lighttpd	Massive concurrency	More complex implementation

Worker Process Tuning (Nginx)

For PHP/FastCGI or reverse proxy scenarios, worker processes mostly forward requests rather than doing heavy computation — so increasing worker count can improve concurrency.

However, too many workers brings diminishing returns:

More context switches
More memory overhead
Longer average response time across all connections

There is no one-size-fits-all answer. The optimal concurrency strategy depends on actual concurrent load. In general: fewer processes + more threads + epoll = best scalability.

DEV Community