DEV Community

James Lee
James Lee

Posted on

Linux Server Performance Optimization: I/O Models, epoll Internals & Concurrency Strategies

Throughput & Benchmarking

Throughput here specifically refers to the number of requests a web server processes per unit of time.

A meaningful benchmark requires three inputs:

  1. Concurrent user count
  2. Total request count
  3. Request resource description

Two key metrics to distinguish:

  • User average wait time — measures service quality for a single user under a given concurrency level
  • Server average request processing time — measures overall server quality; it is the reciprocal of throughput

Throughput is a composite metric — it's not simply a measure of concurrency capacity. For example, with a fixed total request count, more concurrent users can sometimes increase throughput. And serving a 2KB file vs a 2MB file produces very different completion times.

For HTTP requests with Connection: Keep-Alive, enabling persistent connection support reduces the number of accept() syscalls, directly lowering connection establishment overhead.


CPU Concurrency & Context Switching

The Cost of Context Switching

The essence of suspending a process is moving its CPU register data into kernel-space stack storage. Resuming it means loading that data back. This saved/restored data is called the hardware context.

Process suspend:  [CPU registers] ──copy──▶ [kernel stack]
Process resume:   [kernel stack]  ──load──▶ [CPU registers]
Enter fullscreen mode Exit fullscreen mode

To maximize concurrency, minimize context switches — reduce process count, prefer threads, and combine with appropriate I/O models.

Web servers using a multi-threaded model generally outperform multi-process models due to lower context switch overhead.


System Calls

A system call switches execution from user mode to kernel mode (e.g., opening a file, reading from disk, sending network data).

User mode  ──syscall──▶  Kernel mode
                          (disk / network access)
           ◀──return───
Enter fullscreen mode Exit fullscreen mode

This mode switch involves partial memory space exchange — making syscalls relatively expensive. Reducing syscall frequency is a key performance optimization.


Memory Management

Nginx uses a stage-based memory allocation strategy: allocate on demand, release promptly, keeping total memory usage minimal.

10,000 inactive HTTP persistent connections require only 2.5 MB of memory in Nginx.


I/O Models

Since CPU speed far exceeds I/O speed, I/O latency is often the performance bottleneck. Choosing the right I/O model is critical.

┌──────────────────────────────────────────────────────────────┐
│  Synchronous Blocking I/O                                    │
│  process ──read()──▶ [suspended, waiting for data] ──▶ done  │
├──────────────────────────────────────────────────────────────┤
│  Synchronous Non-Blocking I/O                                │
│  process ──read()──▶ [not ready, returns immediately]        │
│           ──poll──▶  [not ready, returns immediately]        │
│           ──poll──▶  [ready!] ──▶ process data               │
│  Problem: wastes CPU polling fds that have no data           │
├──────────────────────────────────────────────────────────────┤
│  I/O Multiplexing (select / poll / epoll)                    │
│  process ──epoll_wait()──▶ monitors N fds simultaneously     │
│                            kernel callback on ready ──▶ act  │
│  Only processes fds that actually have data ✅               │
└──────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

For web servers handling thousands of concurrent connections, polling every socket with non-blocking I/O wastes enormous CPU time. I/O multiplexing solves this by letting the kernel notify the process only when fds are actually ready.

epoll: LT and ET Modes

epoll supports two triggering modes:

Mode Behavior
LT (Level Triggered, default) Returns the fd on every epoll_wait call until the event is fully handled. Safe and easy.
ET (Edge Triggered, EPOLLET) Returns the fd only once on the first event. Higher performance, but any missed event causes errors.

Memory Mapping (mmap)

Linux can map a disk file directly to a memory address space. Accessing that memory region is equivalent to accessing the file — no read()/write() syscalls needed.

Without mmap:  process ──read()──▶ kernel buffer ──copy──▶ user buffer
With mmap:     process accesses memory directly ↔ kernel maps to disk
Enter fullscreen mode Exit fullscreen mode

epoll also uses mmap internally to eliminate fd copying overhead between kernel and user space.

Direct I/O

Normally, writes go through the kernel buffer first, then flush to disk lazily (improving write efficiency). For applications like databases that manage their own cache, Linux's open() supports the O_DIRECT flag to bypass the kernel buffer entirely.

sendfile — Zero-Copy File Serving

Normal file-send flow:

disk ──▶ kernel buffer ──copy──▶ user process ──copy──▶ NIC kernel buffer
                         ↑ unnecessary round-trip through user space
Enter fullscreen mode Exit fullscreen mode

With sendfile:

disk ──▶ kernel buffer ──────────────────────────▶ NIC kernel buffer
                         (zero user-space copy) ✅
Enter fullscreen mode Exit fullscreen mode

sendfile eliminates the redundant copy through user space, significantly improving static file serving performance.


epoll Internals

epoll is Linux's high-performance I/O multiplexing mechanism, capable of efficiently handling millions of socket file descriptors.

Three Core Syscalls

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
Enter fullscreen mode Exit fullscreen mode

Why epoll Beats select/poll

select/poll:
  Every call copies the entire fd list: user space → kernel space
  10,000 fds → ~100KB copied per call ❌

epoll:
  epoll_ctl registers fds once into kernel
  epoll_wait only returns ready fds → typically very few ✅
Enter fullscreen mode Exit fullscreen mode

Internal Architecture

┌──────────────────────────────────────────────────────┐
│  epoll instance                                      │
│                                                      │
│  ┌─────────────────────┐   ┌──────────────────────┐  │
│  │   Red-Black Tree     │   │    Ready List        │  │
│  │  (all watched fds)   │   │  (fds with events)   │  │
│  │                      │   │                      │  │
│  │  O(log N) insert/    │   │  ← kernel interrupt  │  │
│  │  delete/search       │   │    callback writes   │  │
│  └─────────────────────┘   │    here automatically │  │
│                             └──────────────────────┘  │
└──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

When a socket receives data, the kernel interrupt handler fires the registered callback, inserting the fd into the ready list. epoll_wait simply checks this list — return immediately if data exists, otherwise sleep until timeout.

The kernel pre-allocates a slab cache for fast epoll object allocation:

static int __init eventpoll_init(void) {
    epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), ...);
    pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), ...);
}
Enter fullscreen mode Exit fullscreen mode

Full epoll Lifecycle

epoll_create  →  creates red-black tree + ready list in kernel
epoll_ctl     →  inserts fd into red-black tree
                 registers interrupt callback for that fd
[data arrives] → kernel interrupt fires callback
                 fd inserted into ready list
epoll_wait    →  reads ready list, copies only ready fds to user space
                 returns immediately if list non-empty, else sleeps
Enter fullscreen mode Exit fullscreen mode

epoll is a classic space-for-time trade-off: kernel memory (red-black tree + ready list) eliminates repeated fd list copying, enabling O(1) event notification regardless of total fd count.


Server Concurrency Strategies

Strategy I/O Model Servers Pros Cons
1 process per connection Non-blocking Apache (prefork) Stable, isolated Context switch grows linearly with connections
1 thread per connection Non-blocking Apache (worker) More connections Context switch still a bottleneck
1 process handles N connections Non-blocking + epoll/kqueue Nginx, lighttpd Massive concurrency More complex implementation

Worker Process Tuning (Nginx)

For PHP/FastCGI or reverse proxy scenarios, worker processes mostly forward requests rather than doing heavy computation — so increasing worker count can improve concurrency.

However, too many workers brings diminishing returns:

  • More context switches
  • More memory overhead
  • Longer average response time across all connections

There is no one-size-fits-all answer. The optimal concurrency strategy depends on actual concurrent load. In general: fewer processes + more threads + epoll = best scalability.

Top comments (0)