nginx Event Loop — Complete Lifecycle Reference
A precise, bottom-up reference covering every buffer, syscall, interrupt, and data movement from the moment a TCP packet hits the NIC to the moment a response is sent back. Two concurrent users are used throughout as a concrete example.
Table of Contents
- Foundations — fd and Socket
- Hardware Layer — NIC, DMA, Interrupts
- Kernel Structures and All Buffers
- epoll — How the Worker Waits Efficiently
- nginx Startup Sequence
- Complete Request Lifecycle — Two Concurrent Users
- What Happens While Worker is Busy
- All Buffers — Master Reference
- All Syscalls — Master Reference
- Failure Modes
1. Foundations
1.1 Everything is a File
Linux's core philosophy: every I/O resource — files on disk, network connections, pipes, terminals, devices — is represented as a file. This means one unified API (read, write, close) works on all of them. The kernel manages the actual resource. Your process holds a token.
1.2 File Descriptor (fd)
A file descriptor is just an integer. It is a per-process token that refers to a kernel-managed resource. The kernel maintains a table per process called the fd table — a simple array where the index is the fd and the value is a pointer into the kernel.
Process fd table:
┌─────┬───────────────────────────────┐
│ fd │ points to │
├─────┼───────────────────────────────┤
│ 0 │ stdin │
│ 1 │ stdout │
│ 2 │ stderr │
│ 3 │ listen socket (nginx) │
│ 5 │ User A client connection │
│ 6 │ User B client connection │
│ 12 │ backend connection for User A │
│ 13 │ backend connection for User B │
└─────┴───────────────────────────────┘
0, 1, 2 are always pre-assigned. Application fds start from 3 upward. The fd is meaningless on its own. It only means something when passed to a syscall — the kernel uses it to look up the real resource.
1.3 Socket
A socket is the kernel's internal data structure representing one end of a network connection. Created when your process calls socket(). Lives entirely in kernel RAM. Your process never touches it directly.
Socket struct (kernel RAM):
┌────────────────────────────────────┐
│ local_ip: 192.168.1.10 │
│ local_port: 80 │
│ remote_ip: 203.0.113.5 │
│ remote_port: 54231 │
│ state: ESTABLISHED │
│ recv_buffer: [ ...incoming bytes ] │
│ send_buffer: [ ...outgoing bytes ] │
│ tcp_state_machine: ... │
│ timers, seq numbers, window size │
└────────────────────────────────────┘
1.4 How fd and Socket Relate
PROCESS (user space) KERNEL (kernel space)
fd table Open file table Socket struct
┌──────┐ ┌─────────────┐ ┌──────────────┐
│ fd=5 │ ─────────────────────► │ file entry │ ──────► │ socket { │
└──────┘ │ (flags, │ │ recv buffer │
│ offset) │ │ send buffer │
└─────────────┘ │ state ... │
└──────────────┘
This two-level indirection exists so that two processes (parent/child after fork) can share the same socket with different fd numbers.
2. Hardware Layer
2.1 NIC (Network Interface Card)
A physical chip on the motherboard (or PCIe slot). It receives electrical/optical signals from the network cable, decodes them into bytes, and writes them into RAM. The CPU has no involvement in receiving the bytes themselves — the NIC does it autonomously.
2.2 DMA (Direct Memory Access)
DMA is the mechanism that lets the NIC write bytes directly into RAM without involving the CPU. The kernel sets up a region of RAM called the NIC Ring Buffer at boot time and tells the NIC its address. The NIC writes incoming packets straight there. Zero CPU cycles spent moving bytes.
2.3 NIC Ring Buffer
A circular array of fixed-size slots in kernel RAM, allocated at NIC driver initialization.
NIC Ring Buffer (in kernel RAM):
┌──────────┬──────────┬──────────┬──────────┐
│ slot 0 │ slot 1 │ slot 2 │ slot 3 │ ← NIC writes here via DMA
│ pkt data │ pkt data │ (empty) │ (empty) │
└──────────┴──────────┴──────────┴──────────┘
▲ ▲
NIC writes kernel reads (via interrupt)
- Size: 256 to 4096 slots typically. Each slot holds one packet (up to 1500 bytes for standard Ethernet MTU).
- Who writes: NIC hardware via DMA.
- Who reads: Kernel interrupt handler (ISR).
-
What happens when full: Packets are dropped. Visible via
ethtool -S eth0 | grep droporip -s link.
2.4 Hardware Interrupt (IRQ — Interrupt Request)
When the NIC finishes writing one or more packets into the Ring Buffer via DMA, it sends an electrical signal on a dedicated wire to the CPU. This signal is an IRQ (Interrupt Request).
The CPU responds by:
- Finishing the current instruction (not mid-instruction).
- Saving its register state (so it can resume later).
- Looking up the interrupt number in the IDT (Interrupt Descriptor Table) — a kernel table mapping interrupt numbers to handler functions.
- Jumping to the handler function.
- Restoring state and resuming what it was doing before.
This entire sequence takes microseconds. The user process (nginx worker) does not know it happened.
2.5 ISR (Interrupt Service Routine)
The kernel function registered to handle NIC interrupts. When called:
- Reads packet(s) from the NIC Ring Buffer.
- Parses the Ethernet frame, IP header, TCP header.
- Identifies which socket this packet belongs to (by matching
src_ip:src_port:dst_ip:dst_port). - Copies payload bytes into that socket's recv buffer in kernel RAM.
- Updates TCP sequence numbers, ACKs, window size.
- Marks that socket's fd as ready (
EPOLLIN) in the epoll ready list. - If the process was sleeping in
epoll_wait, wakes it up. - Returns. CPU resumes what it was doing.
3. Kernel Structures and All Buffers
3.1 SYN Queue (Incomplete Connections Queue)
When a client sends a SYN packet to initiate a TCP connection:
- Kernel puts this half-open connection into the SYN Queue.
- Sends back SYN-ACK.
- Waits for the client's final ACK to complete the handshake.
Tunable: /proc/sys/net/ipv4/tcp_max_syn_backlog
Default: 128 (older kernels) / 1024 (newer)
Who writes: Kernel TCP stack on receiving SYN.
Who reads: Kernel TCP stack on receiving final ACK — moves entry to Accept Queue.
What happens when full: Kernel drops incoming SYNs silently. Client retries. Appears as connection timeout.
SYN flood attacks fill this queue deliberately.
tcp_syncookiesis the mitigation.
3.2 Accept Queue (Complete Connections Queue)
After the 3-way handshake completes, the connection moves from SYN Queue to the Accept Queue. These connections are fully established and waiting for your process to call accept().
Effective size = min(listen() backlog argument, net.core.somaxconn)
net.core.somaxconn default: 128 (kernel < 5.4) / 4096 (kernel >= 5.4)
nginx default backlog: 511
So on modern Linux: min(511, 4096) = 511 slots
Who writes: Kernel TCP stack when handshake completes.
Who reads: Your process via accept() syscall.
What happens when full: Kernel drops incoming SYNs. Client sees connection timeout.
3.3 Socket Receive Buffer
Per-connection buffer in kernel RAM. Incoming payload bytes (from ISR) land here after the TCP header is stripped.
Default size: ~208 KB (/proc/sys/net/core/rmem_default = 212992)
Maximum size: 128 MB (/proc/sys/net/core/rmem_max = 134217728)
Auto-tuning: Linux adjusts size dynamically between rmem_default and rmem_max
Who writes: Kernel ISR (on hardware interrupt).
Who reads: Your process via read() or recv() syscall.
What happens when full: Kernel stops sending TCP ACKs to the sender. Sender's TCP stack sees its window shrink to zero and stops sending. This is TCP flow control. No data loss — sender just waits.
3.4 Socket Send Buffer
Per-connection buffer in kernel RAM. When your process calls write(), bytes go here. Kernel TCP stack drains this buffer by constructing TCP packets and sending them out via the NIC.
Default size: ~208 KB (/proc/sys/net/core/wmem_default = 212992)
Maximum size: 128 MB (/proc/sys/net/core/wmem_max = 134217728)
Who writes: Your process via write() or send() syscall.
Who reads: Kernel TCP stack (constructs packets and hands to NIC).
What happens when full: write() blocks (in blocking mode) or returns EAGAIN (in non-blocking mode). nginx uses non-blocking — it registers EPOLLOUT on the fd and retries when kernel signals the buffer has drained.
3.5 nginx Worker Memory (User Space)
When the worker calls read(fd, buf, size), the kernel copies bytes from the socket recv buffer (kernel RAM) into buf (process RAM). This is the boundary crossing — from kernel space to user space.
The worker parses HTTP in this memory and builds request_context objects:
request_context {
client_fd: 5 ← incoming connection fd
backend_fd: 12 ← outgoing connection fd
state: WAITING_BACKEND
method: GET
path: /api/users
headers: { host, auth, content-type ... }
body: (if POST)
response: (filled later)
}
This is the only place that links a client fd to its backend fd. The kernel has no concept of this pairing.
4. epoll
4.1 What is epoll
epoll is a Linux kernel subsystem that lets a single process monitor thousands of fds simultaneously without scanning all of them. It is the engine behind nginx's ability to handle 10,000+ concurrent connections in a single thread.
Three kernel structures make up epoll:
4.2 epoll Instance
Created via epoll_create1(). Returns an fd that represents the epoll instance itself. The instance contains:
- Interest List — which fds to watch and what events.
- Ready List — which fds currently have events pending.
4.3 Interest List
A red-black tree (O(log n) insertion/deletion) inside the kernel. Each node represents one watched fd and what event to watch for.
Interest list (red-black tree in kernel RAM):
┌─────────────────────────────────┐
│ fd=3, events=EPOLLIN │ ← new client connections
│ fd=5, events=EPOLLIN │ ← User A client data
│ fd=6, events=EPOLLIN │ ← User B client data
│ fd=12, events=EPOLLIN|EPOLLOUT │ ← backend for User A
│ fd=13, events=EPOLLIN|EPOLLOUT │ ← backend for User B
└─────────────────────────────────┘
Modified by: epoll_ctl(epfd, EPOLL_CTL_ADD/MOD/DEL, fd, event) — always called by worker.
4.4 Ready List
A doubly-linked list inside the kernel. When the ISR marks an fd as having data, it adds that fd to this list. epoll_wait() reads and clears this list.
Written by: Kernel ISR (on hardware interrupt) — completely independent of worker state.
Read by: Worker via epoll_wait().
This separation is critical: the ready list is updated even when the worker is busy. Nothing is lost.
4.5 Edge Triggered vs Level Triggered
nginx uses Edge Triggered (EPOLLET).
| Mode | epoll_wait returns fd when | Risk if you don't drain fully |
|---|---|---|
| Level Triggered (default) | fd has data (keeps returning until empty) | None — will re-notify |
| Edge Triggered (EPOLLET) | fd transitions from no-data → has-data | Will NOT re-notify until new data arrives |
Edge triggered means nginx must loop read() until it gets EAGAIN on every wakeup. If it reads only once and goes back to epoll_wait, remaining data in the buffer will never trigger a new event and that connection stalls.
Same applies to accept() — nginx must loop accept() until EAGAIN to drain the entire accept queue every time fd=3 fires.
4.6 epoll Syscalls
| Syscall | What it does |
|---|---|
epoll_create1(flags) |
Creates epoll instance, returns epfd |
epoll_ctl(epfd, op, fd, event) |
ADD/MOD/DEL an fd in the interest list |
epoll_wait(epfd, events[], maxevents, timeout) |
Sleep until events ready; returns array of ready fds |
timeout = -1 means sleep indefinitely. The process is removed from the CPU scheduler run queue. Zero CPU consumed while sleeping.
5. nginx Startup Sequence
Master Process — Before Workers Start
nginx has two process roles: one master process and one or more worker processes.
Master process:
- Runs as root.
- Reads
nginx.conf. Determinesworker_processescount — if set toauto, nginx reads the number of CPU cores on the machine (nproc) and spawns that many workers. - Calls
socket(),bind(),listen()on port 80 once, before forking. This is deliberate — binding requires root privileges. Workers never need to bind themselves. - Calls
fork()once per worker. Each forked child inherits fd=3 (the listen socket) automatically. This is how all workers share the same port without any of them needing root. - After forking, master drops into a supervision loop — watches workers, restarts any that crash, handles signals (
nginx -s reload,nginx -s stop). - Master itself never handles a single HTTP request.
After fork, each worker independently:
- Calls
epoll_create1()— gets its own epoll instance. - Calls
epoll_ctl(ADD, fd=3, EPOLLIN)— registers the inherited listen socket. - Calls
epoll_wait()— sleeps, waiting for connections.
Master Process (root)
socket() + bind() + listen() on fd=3
│
├── fork() → Worker 1 (inherits fd=3) → own epoll → epoll_wait
├── fork() → Worker 2 (inherits fd=3) → own epoll → epoll_wait
├── fork() → Worker 3 (inherits fd=3) → own epoll → epoll_wait
└── fork() → Worker 4 (inherits fd=3) → own epoll → epoll_wait
All workers call accept() on the same fd=3.
Kernel distributes incoming connections across workers.
Master just watches. Restarts crashed workers.
When multiple workers call accept() on the same fd=3 simultaneously, the kernel ensures only one worker gets each connection — no duplication. nginx also sets accept_mutex (older versions) or relies on EPOLLEXCLUSIVE flag (modern Linux) to avoid all workers waking up for every single connection — known as the thundering herd problem.
This runs once when nginx starts. The worker does not handle any requests until this is complete.
1. socket(AF_INET, SOCK_STREAM, 0)
→ kernel creates socket struct
→ returns fd=3 (listen socket)
2. setsockopt(fd=3, SO_REUSEADDR, 1)
→ allows reuse of port 80 after restart
→ avoids "Address already in use" error
3. bind(fd=3, {ip=0.0.0.0, port=80})
→ registers this socket as owner of port 80 with kernel
→ kernel now routes port 80 packets to fd=3
4. listen(fd=3, backlog=511)
→ kernel creates SYN Queue and Accept Queue for this socket
→ accept queue limit = min(511, somaxconn)
→ kernel begins accepting TCP handshakes even before accept() is called
5. epoll_create1(EPOLL_CLOEXEC)
→ creates epoll instance
→ returns epfd=4
6. epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=3, {EPOLLIN|EPOLLET})
→ registers listen socket in interest list
→ "wake me up when a new connection is ready to accept"
7. epoll_wait(epfd=4, events[], 512, -1)
→ worker sleeps. 0 CPU consumed. Waiting for first knock.
6. Complete Request Lifecycle
Cast
| Symbol | What |
|---|---|
fd=3 |
nginx listen socket |
fd=5 |
User A ↔ nginx TCP connection |
fd=6 |
User B ↔ nginx TCP connection |
fd=12 |
nginx ↔ backend TCP connection for User A |
fd=13 |
nginx ↔ backend TCP connection for User B |
epfd=4 |
epoll instance |
Phase 1 — TCP Handshake (Kernel Only, Worker Sleeping)
Who acts: Kernel TCP stack + NIC. Worker is untouched.
User A browser Kernel (port 80)
│── SYN ──────────────────► NIC receives packet
│ ISR runs: puts in SYN Queue
│◄── SYN-ACK ───────────── Kernel sends SYN-ACK via NIC
│── ACK ──────────────────► ISR runs: moves to Accept Queue
│ Kernel assigns fd=5 internally
│ Marks fd=3 as EPOLLIN in ready list
User B browser (simultaneously)
│── SYN ──────────────────► Same process
│◄── SYN-ACK ─────────────
│── ACK ──────────────────► fd=6 assigned internally
fd=3 still marked EPOLLIN
Buffers touched:
- SYN Queue: written (User A SYN), read (on ACK received), written again (User B SYN)
- Accept Queue: written twice (one entry per completed handshake)
- NIC Ring Buffer: written by NIC DMA, read by ISR
Interrupts: Hardware IRQ fires per packet. ISR runs. Worker untouched.
Phase 2 — Worker Wakes, Accepts Both Connections
Syscalls: epoll_wait, accept, epoll_ctl, fcntl
epoll_wait(epfd=4, events[], 512, -1)
→ returns: [{fd=3, EPOLLIN}]
→ worker wakes up
Worker enters accept loop (must drain fully because EPOLLET):
Iteration 1:
accept(fd=3)
→ kernel dequeues User A's connection from Accept Queue
→ creates fd=5 for this connection
→ returns fd=5
fcntl(fd=5, F_SETFL, O_NONBLOCK)
→ sets fd=5 to non-blocking mode
epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=5, {EPOLLIN|EPOLLET})
→ fd=5 added to interest list
Iteration 2:
accept(fd=3)
→ dequeues User B's connection
→ returns fd=6
fcntl(fd=6, F_SETFL, O_NONBLOCK)
epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=6, {EPOLLIN|EPOLLET})
Iteration 3:
accept(fd=3)
→ returns -1, errno=EAGAIN
→ Accept Queue is empty. Stop looping.
Buffers touched:
- Accept Queue: read (two entries dequeued)
Interest list now:
fd=3 EPOLLIN ← always watching for new clients
fd=5 EPOLLIN ← User A
fd=6 EPOLLIN ← User B
Worker calls epoll_wait again. Sleeps.
Phase 3 — HTTP Request Bytes Arrive
Who acts: NIC + Kernel ISR. Worker sleeping.
User A's browser sends GET /api/users HTTP/1.1\r\nHost: ...\r\n\r\n
NIC receives TCP segment
→ DMA writes to NIC Ring Buffer
→ NIC fires hardware IRQ
→ CPU saves state, jumps to ISR
ISR:
reads packet from NIC Ring Buffer
parses IP + TCP headers
matches src_ip:src_port:dst_ip:dst_port → identifies fd=5's socket
copies HTTP payload into fd=5's recv buffer (kernel RAM)
updates TCP: ACK sent back to User A's browser
marks fd=5 as EPOLLIN in epoll ready list
checks: is worker sleeping in epoll_wait? YES → wakes worker
returns
User B's GET /api/orders HTTP/1.1 arrives simultaneously or milliseconds later. Same ISR sequence for fd=6.
Buffers touched:
- NIC Ring Buffer: written by DMA, read by ISR
- fd=5 recv buffer: written by ISR (~hundreds of bytes, the HTTP request)
- fd=6 recv buffer: written by ISR
Interrupts: One IRQ per packet per user. ISR runs in microseconds each time.
Phase 4 — Worker Reads and Parses Both Requests
Syscalls: epoll_wait, read
epoll_wait(epfd=4, events[], 512, -1)
→ returns: [{fd=5, EPOLLIN}, {fd=6, EPOLLIN}]
→ both events in one batch
Processing fd=5 (User A):
Worker enters read loop (must drain, EPOLLET):
Iteration 1:
read(fd=5, buf, 4096)
→ kernel copies bytes from fd=5 recv buffer (kernel RAM)
into buf (worker process RAM)
→ returns n=312 (bytes read, size of HTTP request)
Iteration 2:
read(fd=5, buf, 4096)
→ returns -1, errno=EAGAIN
→ recv buffer empty. Stop.
Worker parses buf:
method = GET
path = /api/users
headers = { Host: ..., Authorization: Bearer xyz }
Worker creates:
request_context_A {
client_fd = 5
backend_fd = -1 ← not yet
state = PARSED
request = "GET /api/users..."
}
Processing fd=6 (User B): Same sequence. request_context_B created with client_fd=6.
Buffers touched:
- fd=5 recv buffer: read by worker (now empty)
- fd=6 recv buffer: read by worker (now empty)
- Worker heap memory: written (two request_context objects)
Kernel boundary crossing: Every read() call copies bytes from kernel RAM → process RAM.
Phase 5 — Worker Opens Backend Connections
Syscalls: socket, fcntl, connect, epoll_ctl
For User A:
socket(AF_INET, SOCK_STREAM, 0)
→ kernel creates new socket struct
→ returns fd=12
fcntl(fd=12, F_SETFL, O_NONBLOCK)
→ makes fd=12 non-blocking
connect(fd=12, {backend_ip, port=8080})
→ kernel initiates TCP handshake with backend (sends SYN)
→ returns immediately with errno=EINPROGRESS
(handshake happening asynchronously in kernel)
epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=12, {EPOLLOUT|EPOLLET})
→ "tell me when fd=12 is writable" = when handshake completes
request_context_A.backend_fd = 12
request_context_A.state = CONNECTING_BACKEND
For User B: Same. Creates fd=13, connects, registers EPOLLOUT.
Interest list now:
fd=3 EPOLLIN ← new clients
fd=5 EPOLLIN ← User A client
fd=6 EPOLLIN ← User B client
fd=12 EPOLLOUT ← backend for User A (connecting)
fd=13 EPOLLOUT ← backend for User B (connecting)
Worker calls epoll_wait. Sleeps. 5 fds watched. 0 CPU consumed.
Phase 6 — Backend TCP Handshakes Complete
Who acts: Kernel. Worker sleeping.
Backend sends SYN-ACK for User A's connect (fd=12)
→ NIC receives, IRQ fires, ISR runs
→ Kernel completes handshake, fd=12 state = ESTABLISHED
→ fd=12 is now writable
→ ISR marks fd=12 EPOLLOUT in ready list
→ Worker is sleeping → kernel wakes it
Same for User B (fd=13)
Phase 7 — Worker Forwards Requests to Backend
Syscalls: epoll_wait, write, epoll_ctl
epoll_wait returns: [{fd=12, EPOLLOUT}, {fd=13, EPOLLOUT}]
Handling fd=12 (User A):
Look up context: fd=12 → request_context_A → request = "GET /api/users..."
write(fd=12, "GET /api/users HTTP/1.1\r\n...", len)
→ kernel copies bytes from worker RAM → fd=12 send buffer (kernel RAM)
→ kernel TCP stack will send these bytes to backend as TCP packets
→ returns immediately (non-blocking write)
epoll_ctl(epfd=4, EPOLL_CTL_MOD, fd=12, {EPOLLIN|EPOLLET})
→ switch from watching EPOLLOUT to EPOLLIN
→ "now tell me when backend sends a response back"
request_context_A.state = WAITING_BACKEND
Handling fd=13 (User B): Same. Forwards GET /api/orders.
Buffers touched:
- fd=12 send buffer: written by worker (the forwarded HTTP request)
- fd=13 send buffer: written by worker
- Kernel TCP stack reads send buffers, constructs packets, hands to NIC
Interrupt on outgoing side: When NIC finishes sending packets from send buffer, it fires an IRQ to tell kernel "send buffer slots are free now." No worker involvement.
Worker calls epoll_wait. Sleeps.
Phase 8 — Backend Responds (Possibly Out of Order)
Who acts: Kernel ISR. Worker sleeping.
Backend processes /api/orders faster. Sends response for User B first.
Backend TCP segment for User B arrives at NIC
→ DMA → NIC Ring Buffer
→ IRQ fires → ISR runs
→ Payload copied into fd=13 recv buffer
→ fd=13 marked EPOLLIN in ready list
Later:
Backend TCP segment for User A arrives
→ Same path → fd=12 recv buffer filled
→ fd=12 marked EPOLLIN in ready list
Worker was sleeping → kernel wakes it
Buffers touched:
- NIC Ring Buffer: written by DMA, read by ISR
- fd=13 recv buffer: written by ISR (User B response body)
- fd=12 recv buffer: written by ISR (User A response body)
Phase 9 — Worker Reads Backend Response, Writes to Client
Syscalls: epoll_wait, read, write
epoll_wait returns: [{fd=13, EPOLLIN}, {fd=12, EPOLLIN}]
Note: User B's response came first even though User A connected first. epoll returns what is ready, not what came first.
Handling fd=13 (User B's backend response):
read loop on fd=13:
read(fd=13, buf, 65536)
→ kernel copies backend response from fd=13 recv buffer → worker RAM
→ returns n bytes
read(fd=13, buf, 65536)
→ returns EAGAIN (buffer empty, done)
Look up context: fd=13 → request_context_B → client_fd = 6
write(fd=6, buf, n)
→ kernel copies bytes from worker RAM → fd=6 send buffer
→ kernel TCP stack sends packets to User B's browser
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=13, NULL)
→ remove backend fd from interest list
close(fd=13)
→ kernel tears down backend connection, frees socket struct and buffers
request_context_B.state = DONE
Handling fd=12 (User A): Same. Reads fd=12 → writes to fd=5 → closes fd=12.
Buffers touched:
- fd=13 recv buffer: read by worker (emptied)
- fd=12 recv buffer: read by worker (emptied)
- fd=6 send buffer: written by worker (User B response going to browser)
- fd=5 send buffer: written by worker (User A response going to browser)
Kernel does the rest: TCP stack drains send buffers → NIC → wire → browsers.
Phase 10 — Connection Teardown
Depends on: HTTP version and Connection header.
HTTP/1.0 or Connection: close:
close(fd=5) → kernel sends FIN to User A, tears down connection, frees socket
close(fd=6) → same for User B
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=5, NULL)
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=6, NULL)
HTTP/1.1 keep-alive (default):
fd=5 and fd=6 remain open and registered in epoll interest list
Worker goes back to epoll_wait
If same client sends another request → fd=5 fires EPOLLIN → handled immediately
Kernel has idle timers: if no data for N seconds → close
Worker returns to epoll_wait. Sleeping again. Ready for next event.
7. What Happens While Worker is Busy
Scenario: Worker is processing User A's request. At the same moment, Users C, D, E connect and send requests.
Worker processing [User A]
│
│ ← IRQ fires (User C SYN arrives)
│ ISR runs in microseconds
│ Kernel completes C's handshake
│ Adds C to Accept Queue
│ Marks fd=3 EPOLLIN in ready list
│ Worker is NOT sleeping → no wakeup sent. That's fine.
│
│ ← IRQ fires (User D SYN arrives)
│ Same. D added to Accept Queue. fd=3 already marked ready.
│
│ ← IRQ fires (User E data arrives on existing connection)
│ Bytes go into fd=E's recv buffer. fd=E marked EPOLLIN.
│
Worker finishes [User A]
│
epoll_wait()
│
→ returns: [{fd=3, EPOLLIN}, {fd=E, EPOLLIN}]
(fd=3 fires once for all queued connections — edge triggered fires on transition)
Worker drains Accept Queue:
accept() → User C → fd=7 → register with epoll
accept() → User D → fd=8 → register with epoll
accept() → EAGAIN → done
Worker reads fd=E
Worker calls epoll_wait. Sleeps.
Key principle: The ready list is written by ISR independently of worker state. The worker sees everything accumulated while it was busy, on its very next epoll_wait call.
8. All Buffers — Master Reference
| # | Buffer | Location | Default Size | Written By | Read By | Full Behaviour |
|---|---|---|---|---|---|---|
| 1 | NIC Ring Buffer | NIC / kernel RAM (DMA region) | 256–4096 packet slots | NIC hardware via DMA | Kernel ISR | Packets dropped, visible via ethtool |
| 2 | SYN Queue | Kernel RAM | ~128–1024 slots | Kernel TCP stack (on SYN) | Kernel TCP stack (on ACK) | SYN drops, connection timeout |
| 3 | Accept Queue | Kernel RAM | min(backlog, somaxconn) ≈ 511 | Kernel TCP stack (post-handshake) | Process via accept()
|
SYN drops, connection timeout |
| 4 | Socket Recv Buffer (client fd) | Kernel RAM | 208 KB default, 128 MB max | Kernel ISR (per interrupt) | Process via read()
|
TCP flow control, sender throttled |
| 5 | Socket Send Buffer (client fd) | Kernel RAM | 208 KB default, 128 MB max | Process via write()
|
Kernel TCP stack → NIC |
write() returns EAGAIN (non-blocking) |
| 6 | Worker Heap (request_context) | Process RAM | Unbounded (heap) | Process (after read() + parse) |
Process (during response handling) | OOM if too many large requests |
| 7 | Socket Recv Buffer (backend fd) | Kernel RAM | 208 KB default, 128 MB max | Kernel ISR (backend response) | Process via read()
|
TCP flow control |
| 8 | Socket Send Buffer (backend fd) | Kernel RAM | 208 KB default, 128 MB max | Process via write()
|
Kernel TCP stack → NIC |
write() returns EAGAIN |
Data movement order for one request:
NIC Ring Buffer
→ (ISR) → Socket Recv Buffer [client fd]
→ (read syscall) → Worker Heap
→ (write syscall) → Socket Send Buffer [backend fd]
→ (kernel TCP) → NIC → wire → backend
Backend response:
NIC Ring Buffer
→ (ISR) → Socket Recv Buffer [backend fd]
→ (read syscall) → Worker Heap
→ (write syscall) → Socket Send Buffer [client fd]
→ (kernel TCP) → NIC → wire → client browser
9. All Syscalls — Master Reference
| Syscall | Called By | When | What Kernel Does |
|---|---|---|---|
socket() |
Worker | Startup + per backend conn | Allocates socket struct, returns fd |
setsockopt() |
Worker | Startup | Sets socket options (SO_REUSEADDR etc.) |
bind() |
Worker | Startup | Claims port 80 for this process |
listen() |
Worker | Startup | Creates SYN + Accept queues |
accept() |
Worker | On fd=3 EPOLLIN | Dequeues one entry from Accept Queue, returns new fd |
fcntl(F_SETFL, O_NONBLOCK) |
Worker | After accept/socket | Sets fd to non-blocking mode |
epoll_create1() |
Worker | Startup | Creates epoll instance |
epoll_ctl(ADD) |
Worker | After accept/socket | Adds fd to interest list |
epoll_ctl(MOD) |
Worker | After connect/read | Changes watched events (EPOLLIN ↔ EPOLLOUT) |
epoll_ctl(DEL) |
Worker | Before close | Removes fd from interest list |
epoll_wait() |
Worker | After each batch | Sleeps until events; returns ready fd list |
connect() |
Worker | Per upstream conn | Initiates TCP handshake (non-blocking, returns EINPROGRESS) |
read() |
Worker | On EPOLLIN | Copies bytes from socket recv buffer → process RAM |
write() |
Worker | After parsing/response | Copies bytes from process RAM → socket send buffer |
close() |
Worker | After response sent | Sends FIN, frees socket struct and all buffers |
10. Failure Modes
10.1 Accept Queue Full
Cause: Connections arriving faster than worker calls accept().
Symptom: Client sees Connection timed out (kernel drops SYN silently).
Fix: Increase net.core.somaxconn, increase nginx backlog, add more workers.
10.2 Event Loop Starvation
Cause: Worker doing heavy CPU computation or blocking syscall (sync file I/O, blocking DB call) for one connection.
Symptom: All other connections queue up. Latency spikes for everyone.
Fix: Never block the event loop. Offload CPU-heavy work to thread pool (aio threads in nginx). Use async database drivers.
10.3 Send Buffer Full (Slow Client)
Cause: Client reading response slowly. fd=5 send buffer fills up.
Symptom: write(fd=5) returns EAGAIN. Nginx registers EPOLLOUT on fd=5 and comes back to write more when buffer drains.
Fix: nginx handles this internally. Tune send_timeout to close connections from very slow clients.
10.4 Recv Buffer Full (Slow Worker)
Cause: Data arriving faster than worker reads it.
Symptom: TCP flow control kicks in. Sender throttled. No data loss.
Fix: Usually not a problem. If persistent, check for event loop blocking.
10.5 fd Exhaustion
Cause: Too many open connections. Each costs one fd.
Symptom: accept() returns EMFILE (too many open files).
Fix: ulimit -n 65536 in OS. worker_connections 10240 in nginx config.
Quick Mental Model
INTERNET
│ raw bytes
▼
NIC Ring Buffer ← DMA (no CPU)
│ IRQ fires
▼
Kernel ISR ← runs in microseconds, interrupts everything
│ copies bytes
▼
Socket Recv Buffer ← kernel RAM, per fd
│ waits here until worker is ready
▼
epoll Ready List ← kernel marks fd ready
│ wakes worker if sleeping
▼
Worker (epoll_wait returns)
│ read() → copies to process RAM
│ parse HTTP
│ write() to backend fd → send buffer
│
▼
Socket Send Buffer ← kernel RAM, per fd
│ kernel drains via TCP
▼
NIC → wire → backend
Backend response takes same path in reverse.
Worker links client fd ↔ backend fd via request_context in its own memory.
Kernel has no concept of this pairing.
Everything the worker does is non-blocking.
When waiting, worker is off the CPU run queue. Zero cycles consumed.
Top comments (0)