Throughput & Benchmarking
Throughput here specifically refers to the number of requests a web server processes per unit of time.
A meaningful benchmark requires three inputs:
- Concurrent user count
- Total request count
- Request resource description
Two key metrics to distinguish:
- User average wait time — measures service quality for a single user under a given concurrency level
- Server average request processing time — measures overall server quality; it is the reciprocal of throughput
Throughput is a composite metric — it's not simply a measure of concurrency capacity. For example, with a fixed total request count, more concurrent users can sometimes increase throughput. And serving a 2KB file vs a 2MB file produces very different completion times.
For HTTP requests with
Connection: Keep-Alive, enabling persistent connection support reduces the number ofaccept()syscalls, directly lowering connection establishment overhead.
CPU Concurrency & Context Switching
The Cost of Context Switching
The essence of suspending a process is moving its CPU register data into kernel-space stack storage. Resuming it means loading that data back. This saved/restored data is called the hardware context.
Process suspend: [CPU registers] ──copy──▶ [kernel stack]
Process resume: [kernel stack] ──load──▶ [CPU registers]
To maximize concurrency, minimize context switches — reduce process count, prefer threads, and combine with appropriate I/O models.
Web servers using a multi-threaded model generally outperform multi-process models due to lower context switch overhead.
System Calls
A system call switches execution from user mode to kernel mode (e.g., opening a file, reading from disk, sending network data).
User mode ──syscall──▶ Kernel mode
(disk / network access)
◀──return───
This mode switch involves partial memory space exchange — making syscalls relatively expensive. Reducing syscall frequency is a key performance optimization.
Memory Management
Nginx uses a stage-based memory allocation strategy: allocate on demand, release promptly, keeping total memory usage minimal.
10,000 inactive HTTP persistent connections require only 2.5 MB of memory in Nginx.
I/O Models
Since CPU speed far exceeds I/O speed, I/O latency is often the performance bottleneck. Choosing the right I/O model is critical.
┌──────────────────────────────────────────────────────────────┐
│ Synchronous Blocking I/O │
│ process ──read()──▶ [suspended, waiting for data] ──▶ done │
├──────────────────────────────────────────────────────────────┤
│ Synchronous Non-Blocking I/O │
│ process ──read()──▶ [not ready, returns immediately] │
│ ──poll──▶ [not ready, returns immediately] │
│ ──poll──▶ [ready!] ──▶ process data │
│ Problem: wastes CPU polling fds that have no data │
├──────────────────────────────────────────────────────────────┤
│ I/O Multiplexing (select / poll / epoll) │
│ process ──epoll_wait()──▶ monitors N fds simultaneously │
│ kernel callback on ready ──▶ act │
│ Only processes fds that actually have data ✅ │
└──────────────────────────────────────────────────────────────┘
For web servers handling thousands of concurrent connections, polling every socket with non-blocking I/O wastes enormous CPU time. I/O multiplexing solves this by letting the kernel notify the process only when fds are actually ready.
epoll: LT and ET Modes
epoll supports two triggering modes:
| Mode | Behavior |
|---|---|
| LT (Level Triggered, default) | Returns the fd on every epoll_wait call until the event is fully handled. Safe and easy. |
ET (Edge Triggered, EPOLLET) |
Returns the fd only once on the first event. Higher performance, but any missed event causes errors. |
Memory Mapping (mmap)
Linux can map a disk file directly to a memory address space. Accessing that memory region is equivalent to accessing the file — no read()/write() syscalls needed.
Without mmap: process ──read()──▶ kernel buffer ──copy──▶ user buffer
With mmap: process accesses memory directly ↔ kernel maps to disk
epoll also uses mmap internally to eliminate fd copying overhead between kernel and user space.
Direct I/O
Normally, writes go through the kernel buffer first, then flush to disk lazily (improving write efficiency). For applications like databases that manage their own cache, Linux's open() supports the O_DIRECT flag to bypass the kernel buffer entirely.
sendfile — Zero-Copy File Serving
Normal file-send flow:
disk ──▶ kernel buffer ──copy──▶ user process ──copy──▶ NIC kernel buffer
↑ unnecessary round-trip through user space
With sendfile:
disk ──▶ kernel buffer ──────────────────────────▶ NIC kernel buffer
(zero user-space copy) ✅
sendfile eliminates the redundant copy through user space, significantly improving static file serving performance.
epoll Internals
epoll is Linux's high-performance I/O multiplexing mechanism, capable of efficiently handling millions of socket file descriptors.
Three Core Syscalls
int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
Why epoll Beats select/poll
select/poll:
Every call copies the entire fd list: user space → kernel space
10,000 fds → ~100KB copied per call ❌
epoll:
epoll_ctl registers fds once into kernel
epoll_wait only returns ready fds → typically very few ✅
Internal Architecture
┌──────────────────────────────────────────────────────┐
│ epoll instance │
│ │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Red-Black Tree │ │ Ready List │ │
│ │ (all watched fds) │ │ (fds with events) │ │
│ │ │ │ │ │
│ │ O(log N) insert/ │ │ ← kernel interrupt │ │
│ │ delete/search │ │ callback writes │ │
│ └─────────────────────┘ │ here automatically │ │
│ └──────────────────────┘ │
└──────────────────────────────────────────────────────┘
When a socket receives data, the kernel interrupt handler fires the registered callback, inserting the fd into the ready list. epoll_wait simply checks this list — return immediately if data exists, otherwise sleep until timeout.
The kernel pre-allocates a slab cache for fast epoll object allocation:
static int __init eventpoll_init(void) {
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), ...);
pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), ...);
}
Full epoll Lifecycle
epoll_create → creates red-black tree + ready list in kernel
epoll_ctl → inserts fd into red-black tree
registers interrupt callback for that fd
[data arrives] → kernel interrupt fires callback
fd inserted into ready list
epoll_wait → reads ready list, copies only ready fds to user space
returns immediately if list non-empty, else sleeps
epoll is a classic space-for-time trade-off: kernel memory (red-black tree + ready list) eliminates repeated fd list copying, enabling O(1) event notification regardless of total fd count.
Server Concurrency Strategies
| Strategy | I/O Model | Servers | Pros | Cons |
|---|---|---|---|---|
| 1 process per connection | Non-blocking | Apache (prefork) | Stable, isolated | Context switch grows linearly with connections |
| 1 thread per connection | Non-blocking | Apache (worker) | More connections | Context switch still a bottleneck |
| 1 process handles N connections | Non-blocking + epoll/kqueue | Nginx, lighttpd | Massive concurrency | More complex implementation |
Worker Process Tuning (Nginx)
For PHP/FastCGI or reverse proxy scenarios, worker processes mostly forward requests rather than doing heavy computation — so increasing worker count can improve concurrency.
However, too many workers brings diminishing returns:
- More context switches
- More memory overhead
- Longer average response time across all connections
There is no one-size-fits-all answer. The optimal concurrency strategy depends on actual concurrent load. In general: fewer processes + more threads + epoll = best scalability.
Top comments (0)