DEV Community

kamal namdeo
kamal namdeo

Posted on

nginx Event Loop — Complete Lifecycle Reference

nginx Event Loop — Complete Lifecycle Reference

A precise, bottom-up reference covering every buffer, syscall, interrupt, and data movement from the moment a TCP packet hits the NIC to the moment a response is sent back. Two concurrent users are used throughout as a concrete example.


Table of Contents

  1. Foundations — fd and Socket
  2. Hardware Layer — NIC, DMA, Interrupts
  3. Kernel Structures and All Buffers
  4. epoll — How the Worker Waits Efficiently
  5. nginx Startup Sequence
  6. Complete Request Lifecycle — Two Concurrent Users
  7. What Happens While Worker is Busy
  8. All Buffers — Master Reference
  9. All Syscalls — Master Reference
  10. Failure Modes

1. Foundations

1.1 Everything is a File

Linux's core philosophy: every I/O resource — files on disk, network connections, pipes, terminals, devices — is represented as a file. This means one unified API (read, write, close) works on all of them. The kernel manages the actual resource. Your process holds a token.

1.2 File Descriptor (fd)

A file descriptor is just an integer. It is a per-process token that refers to a kernel-managed resource. The kernel maintains a table per process called the fd table — a simple array where the index is the fd and the value is a pointer into the kernel.

Process fd table:
┌─────┬───────────────────────────────┐
│ fd  │ points to                     │
├─────┼───────────────────────────────┤
│  0  │ stdin                         │
│  1  │ stdout                        │
│  2  │ stderr                        │
│  3  │ listen socket (nginx)         │
│  5  │ User A client connection      │
│  6  │ User B client connection      │
│ 12  │ backend connection for User A │
│ 13  │ backend connection for User B │
└─────┴───────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

0, 1, 2 are always pre-assigned. Application fds start from 3 upward. The fd is meaningless on its own. It only means something when passed to a syscall — the kernel uses it to look up the real resource.

1.3 Socket

A socket is the kernel's internal data structure representing one end of a network connection. Created when your process calls socket(). Lives entirely in kernel RAM. Your process never touches it directly.

Socket struct (kernel RAM):
┌────────────────────────────────────┐
│ local_ip:    192.168.1.10          │
│ local_port:  80                    │
│ remote_ip:   203.0.113.5           │
│ remote_port: 54231                 │
│ state:       ESTABLISHED           │
│ recv_buffer: [ ...incoming bytes ] │
│ send_buffer: [ ...outgoing bytes ] │
│ tcp_state_machine: ...             │
│ timers, seq numbers, window size   │
└────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

1.4 How fd and Socket Relate

PROCESS (user space)            KERNEL (kernel space)

fd table                        Open file table         Socket struct
┌──────┐                        ┌─────────────┐         ┌──────────────┐
│ fd=5 │ ─────────────────────► │ file entry  │ ──────► │ socket {     │
└──────┘                        │ (flags,     │         │  recv buffer │
                                │  offset)    │         │  send buffer │
                                └─────────────┘         │  state ...   │
                                                        └──────────────┘
Enter fullscreen mode Exit fullscreen mode

This two-level indirection exists so that two processes (parent/child after fork) can share the same socket with different fd numbers.


2. Hardware Layer

2.1 NIC (Network Interface Card)

A physical chip on the motherboard (or PCIe slot). It receives electrical/optical signals from the network cable, decodes them into bytes, and writes them into RAM. The CPU has no involvement in receiving the bytes themselves — the NIC does it autonomously.

2.2 DMA (Direct Memory Access)

DMA is the mechanism that lets the NIC write bytes directly into RAM without involving the CPU. The kernel sets up a region of RAM called the NIC Ring Buffer at boot time and tells the NIC its address. The NIC writes incoming packets straight there. Zero CPU cycles spent moving bytes.

2.3 NIC Ring Buffer

A circular array of fixed-size slots in kernel RAM, allocated at NIC driver initialization.

NIC Ring Buffer (in kernel RAM):
┌──────────┬──────────┬──────────┬──────────┐
│ slot 0   │ slot 1   │ slot 2   │ slot 3   │  ← NIC writes here via DMA
│ pkt data │ pkt data │ (empty)  │ (empty)  │
└──────────┴──────────┴──────────┴──────────┘
      ▲                     ▲
   NIC writes            kernel reads (via interrupt)
Enter fullscreen mode Exit fullscreen mode
  • Size: 256 to 4096 slots typically. Each slot holds one packet (up to 1500 bytes for standard Ethernet MTU).
  • Who writes: NIC hardware via DMA.
  • Who reads: Kernel interrupt handler (ISR).
  • What happens when full: Packets are dropped. Visible via ethtool -S eth0 | grep drop or ip -s link.

2.4 Hardware Interrupt (IRQ — Interrupt Request)

When the NIC finishes writing one or more packets into the Ring Buffer via DMA, it sends an electrical signal on a dedicated wire to the CPU. This signal is an IRQ (Interrupt Request).

The CPU responds by:

  1. Finishing the current instruction (not mid-instruction).
  2. Saving its register state (so it can resume later).
  3. Looking up the interrupt number in the IDT (Interrupt Descriptor Table) — a kernel table mapping interrupt numbers to handler functions.
  4. Jumping to the handler function.
  5. Restoring state and resuming what it was doing before.

This entire sequence takes microseconds. The user process (nginx worker) does not know it happened.

2.5 ISR (Interrupt Service Routine)

The kernel function registered to handle NIC interrupts. When called:

  1. Reads packet(s) from the NIC Ring Buffer.
  2. Parses the Ethernet frame, IP header, TCP header.
  3. Identifies which socket this packet belongs to (by matching src_ip:src_port:dst_ip:dst_port).
  4. Copies payload bytes into that socket's recv buffer in kernel RAM.
  5. Updates TCP sequence numbers, ACKs, window size.
  6. Marks that socket's fd as ready (EPOLLIN) in the epoll ready list.
  7. If the process was sleeping in epoll_wait, wakes it up.
  8. Returns. CPU resumes what it was doing.

3. Kernel Structures and All Buffers

3.1 SYN Queue (Incomplete Connections Queue)

When a client sends a SYN packet to initiate a TCP connection:

  • Kernel puts this half-open connection into the SYN Queue.
  • Sends back SYN-ACK.
  • Waits for the client's final ACK to complete the handshake.
Tunable:  /proc/sys/net/ipv4/tcp_max_syn_backlog
Default:  128 (older kernels) / 1024 (newer)
Enter fullscreen mode Exit fullscreen mode

Who writes: Kernel TCP stack on receiving SYN.
Who reads: Kernel TCP stack on receiving final ACK — moves entry to Accept Queue.
What happens when full: Kernel drops incoming SYNs silently. Client retries. Appears as connection timeout.

SYN flood attacks fill this queue deliberately. tcp_syncookies is the mitigation.

3.2 Accept Queue (Complete Connections Queue)

After the 3-way handshake completes, the connection moves from SYN Queue to the Accept Queue. These connections are fully established and waiting for your process to call accept().

Effective size = min(listen() backlog argument, net.core.somaxconn)

net.core.somaxconn default: 128 (kernel < 5.4) / 4096 (kernel >= 5.4)
nginx default backlog:      511

So on modern Linux: min(511, 4096) = 511 slots
Enter fullscreen mode Exit fullscreen mode

Who writes: Kernel TCP stack when handshake completes.
Who reads: Your process via accept() syscall.
What happens when full: Kernel drops incoming SYNs. Client sees connection timeout.

3.3 Socket Receive Buffer

Per-connection buffer in kernel RAM. Incoming payload bytes (from ISR) land here after the TCP header is stripped.

Default size: ~208 KB  (/proc/sys/net/core/rmem_default = 212992)
Maximum size: 128 MB   (/proc/sys/net/core/rmem_max = 134217728)
Auto-tuning:  Linux adjusts size dynamically between rmem_default and rmem_max
Enter fullscreen mode Exit fullscreen mode

Who writes: Kernel ISR (on hardware interrupt).
Who reads: Your process via read() or recv() syscall.
What happens when full: Kernel stops sending TCP ACKs to the sender. Sender's TCP stack sees its window shrink to zero and stops sending. This is TCP flow control. No data loss — sender just waits.

3.4 Socket Send Buffer

Per-connection buffer in kernel RAM. When your process calls write(), bytes go here. Kernel TCP stack drains this buffer by constructing TCP packets and sending them out via the NIC.

Default size: ~208 KB  (/proc/sys/net/core/wmem_default = 212992)
Maximum size: 128 MB   (/proc/sys/net/core/wmem_max = 134217728)
Enter fullscreen mode Exit fullscreen mode

Who writes: Your process via write() or send() syscall.
Who reads: Kernel TCP stack (constructs packets and hands to NIC).
What happens when full: write() blocks (in blocking mode) or returns EAGAIN (in non-blocking mode). nginx uses non-blocking — it registers EPOLLOUT on the fd and retries when kernel signals the buffer has drained.

3.5 nginx Worker Memory (User Space)

When the worker calls read(fd, buf, size), the kernel copies bytes from the socket recv buffer (kernel RAM) into buf (process RAM). This is the boundary crossing — from kernel space to user space.

The worker parses HTTP in this memory and builds request_context objects:

request_context {
  client_fd:   5            incoming connection fd
  backend_fd:  12           outgoing connection fd
  state:       WAITING_BACKEND
  method:      GET
  path:        /api/users
  headers:     { host, auth, content-type ... }
  body:        (if POST)
  response:    (filled later)
}
Enter fullscreen mode Exit fullscreen mode

This is the only place that links a client fd to its backend fd. The kernel has no concept of this pairing.


4. epoll

4.1 What is epoll

epoll is a Linux kernel subsystem that lets a single process monitor thousands of fds simultaneously without scanning all of them. It is the engine behind nginx's ability to handle 10,000+ concurrent connections in a single thread.

Three kernel structures make up epoll:

4.2 epoll Instance

Created via epoll_create1(). Returns an fd that represents the epoll instance itself. The instance contains:

  • Interest List — which fds to watch and what events.
  • Ready List — which fds currently have events pending.

4.3 Interest List

A red-black tree (O(log n) insertion/deletion) inside the kernel. Each node represents one watched fd and what event to watch for.

Interest list (red-black tree in kernel RAM):
┌─────────────────────────────────┐
│ fd=3,  events=EPOLLIN           │  ← new client connections
│ fd=5,  events=EPOLLIN           │  ← User A client data
│ fd=6,  events=EPOLLIN           │  ← User B client data
│ fd=12, events=EPOLLIN|EPOLLOUT  │  ← backend for User A
│ fd=13, events=EPOLLIN|EPOLLOUT  │  ← backend for User B
└─────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Modified by: epoll_ctl(epfd, EPOLL_CTL_ADD/MOD/DEL, fd, event) — always called by worker.

4.4 Ready List

A doubly-linked list inside the kernel. When the ISR marks an fd as having data, it adds that fd to this list. epoll_wait() reads and clears this list.

Written by: Kernel ISR (on hardware interrupt) — completely independent of worker state.
Read by: Worker via epoll_wait().

This separation is critical: the ready list is updated even when the worker is busy. Nothing is lost.

4.5 Edge Triggered vs Level Triggered

nginx uses Edge Triggered (EPOLLET).

Mode epoll_wait returns fd when Risk if you don't drain fully
Level Triggered (default) fd has data (keeps returning until empty) None — will re-notify
Edge Triggered (EPOLLET) fd transitions from no-data → has-data Will NOT re-notify until new data arrives

Edge triggered means nginx must loop read() until it gets EAGAIN on every wakeup. If it reads only once and goes back to epoll_wait, remaining data in the buffer will never trigger a new event and that connection stalls.

Same applies to accept() — nginx must loop accept() until EAGAIN to drain the entire accept queue every time fd=3 fires.

4.6 epoll Syscalls

Syscall What it does
epoll_create1(flags) Creates epoll instance, returns epfd
epoll_ctl(epfd, op, fd, event) ADD/MOD/DEL an fd in the interest list
epoll_wait(epfd, events[], maxevents, timeout) Sleep until events ready; returns array of ready fds

timeout = -1 means sleep indefinitely. The process is removed from the CPU scheduler run queue. Zero CPU consumed while sleeping.


5. nginx Startup Sequence

Master Process — Before Workers Start

nginx has two process roles: one master process and one or more worker processes.

Master process:

  • Runs as root.
  • Reads nginx.conf. Determines worker_processes count — if set to auto, nginx reads the number of CPU cores on the machine (nproc) and spawns that many workers.
  • Calls socket(), bind(), listen() on port 80 once, before forking. This is deliberate — binding requires root privileges. Workers never need to bind themselves.
  • Calls fork() once per worker. Each forked child inherits fd=3 (the listen socket) automatically. This is how all workers share the same port without any of them needing root.
  • After forking, master drops into a supervision loop — watches workers, restarts any that crash, handles signals (nginx -s reload, nginx -s stop).
  • Master itself never handles a single HTTP request.

After fork, each worker independently:

  • Calls epoll_create1() — gets its own epoll instance.
  • Calls epoll_ctl(ADD, fd=3, EPOLLIN) — registers the inherited listen socket.
  • Calls epoll_wait() — sleeps, waiting for connections.
Master Process (root)
  socket() + bind() + listen() on fd=3
        │
        ├── fork() → Worker 1 (inherits fd=3) → own epoll → epoll_wait
        ├── fork() → Worker 2 (inherits fd=3) → own epoll → epoll_wait
        ├── fork() → Worker 3 (inherits fd=3) → own epoll → epoll_wait
        └── fork() → Worker 4 (inherits fd=3) → own epoll → epoll_wait

All workers call accept() on the same fd=3.
Kernel distributes incoming connections across workers.
Master just watches. Restarts crashed workers.
Enter fullscreen mode Exit fullscreen mode

When multiple workers call accept() on the same fd=3 simultaneously, the kernel ensures only one worker gets each connection — no duplication. nginx also sets accept_mutex (older versions) or relies on EPOLLEXCLUSIVE flag (modern Linux) to avoid all workers waking up for every single connection — known as the thundering herd problem.

This runs once when nginx starts. The worker does not handle any requests until this is complete.

1.  socket(AF_INET, SOCK_STREAM, 0)
     kernel creates socket struct
     returns fd=3 (listen socket)

2.  setsockopt(fd=3, SO_REUSEADDR, 1)
     allows reuse of port 80 after restart
     avoids "Address already in use" error

3.  bind(fd=3, {ip=0.0.0.0, port=80})
     registers this socket as owner of port 80 with kernel
     kernel now routes port 80 packets to fd=3

4.  listen(fd=3, backlog=511)
     kernel creates SYN Queue and Accept Queue for this socket
     accept queue limit = min(511, somaxconn)
     kernel begins accepting TCP handshakes even before accept() is called

5.  epoll_create1(EPOLL_CLOEXEC)
     creates epoll instance
     returns epfd=4

6.  epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=3, {EPOLLIN|EPOLLET})
     registers listen socket in interest list
     "wake me up when a new connection is ready to accept"

7.  epoll_wait(epfd=4, events[], 512, -1)
     worker sleeps. 0 CPU consumed. Waiting for first knock.
Enter fullscreen mode Exit fullscreen mode

6. Complete Request Lifecycle

Cast

Symbol What
fd=3 nginx listen socket
fd=5 User A ↔ nginx TCP connection
fd=6 User B ↔ nginx TCP connection
fd=12 nginx ↔ backend TCP connection for User A
fd=13 nginx ↔ backend TCP connection for User B
epfd=4 epoll instance

Phase 1 — TCP Handshake (Kernel Only, Worker Sleeping)

Who acts: Kernel TCP stack + NIC. Worker is untouched.

User A browser                    Kernel (port 80)
      │── SYN ──────────────────► NIC receives packet
      │                           ISR runs: puts in SYN Queue
      │◄── SYN-ACK ─────────────  Kernel sends SYN-ACK via NIC
      │── ACK ──────────────────► ISR runs: moves to Accept Queue
      │                           Kernel assigns fd=5 internally
      │                           Marks fd=3 as EPOLLIN in ready list

User B browser (simultaneously)
      │── SYN ──────────────────► Same process
      │◄── SYN-ACK ─────────────
      │── ACK ──────────────────► fd=6 assigned internally
                                  fd=3 still marked EPOLLIN
Enter fullscreen mode Exit fullscreen mode

Buffers touched:

  • SYN Queue: written (User A SYN), read (on ACK received), written again (User B SYN)
  • Accept Queue: written twice (one entry per completed handshake)
  • NIC Ring Buffer: written by NIC DMA, read by ISR

Interrupts: Hardware IRQ fires per packet. ISR runs. Worker untouched.


Phase 2 — Worker Wakes, Accepts Both Connections

Syscalls: epoll_wait, accept, epoll_ctl, fcntl

epoll_wait(epfd=4, events[], 512, -1)
 returns: [{fd=3, EPOLLIN}]
 worker wakes up
Enter fullscreen mode Exit fullscreen mode

Worker enters accept loop (must drain fully because EPOLLET):

Iteration 1:
  accept(fd=3)
   kernel dequeues User A's connection from Accept Queue
   creates fd=5 for this connection
   returns fd=5

  fcntl(fd=5, F_SETFL, O_NONBLOCK)
   sets fd=5 to non-blocking mode

  epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=5, {EPOLLIN|EPOLLET})
   fd=5 added to interest list

Iteration 2:
  accept(fd=3)
   dequeues User B's connection
   returns fd=6

  fcntl(fd=6, F_SETFL, O_NONBLOCK)
  epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=6, {EPOLLIN|EPOLLET})

Iteration 3:
  accept(fd=3)
   returns -1, errno=EAGAIN
   Accept Queue is empty. Stop looping.
Enter fullscreen mode Exit fullscreen mode

Buffers touched:

  • Accept Queue: read (two entries dequeued)

Interest list now:

fd=3  EPOLLIN  ← always watching for new clients
fd=5  EPOLLIN  ← User A
fd=6  EPOLLIN  ← User B
Enter fullscreen mode Exit fullscreen mode

Worker calls epoll_wait again. Sleeps.


Phase 3 — HTTP Request Bytes Arrive

Who acts: NIC + Kernel ISR. Worker sleeping.

User A's browser sends GET /api/users HTTP/1.1\r\nHost: ...\r\n\r\n

NIC receives TCP segment
 DMA writes to NIC Ring Buffer
 NIC fires hardware IRQ
 CPU saves state, jumps to ISR

ISR:
  reads packet from NIC Ring Buffer
  parses IP + TCP headers
  matches src_ip:src_port:dst_ip:dst_port  identifies fd=5's socket
  copies HTTP payload into fd=5's recv buffer (kernel RAM)
  updates TCP: ACK sent back to User A's browser
  marks fd=5 as EPOLLIN in epoll ready list
  checks: is worker sleeping in epoll_wait? YES  wakes worker
  returns
Enter fullscreen mode Exit fullscreen mode

User B's GET /api/orders HTTP/1.1 arrives simultaneously or milliseconds later. Same ISR sequence for fd=6.

Buffers touched:

  • NIC Ring Buffer: written by DMA, read by ISR
  • fd=5 recv buffer: written by ISR (~hundreds of bytes, the HTTP request)
  • fd=6 recv buffer: written by ISR

Interrupts: One IRQ per packet per user. ISR runs in microseconds each time.


Phase 4 — Worker Reads and Parses Both Requests

Syscalls: epoll_wait, read

epoll_wait(epfd=4, events[], 512, -1)
 returns: [{fd=5, EPOLLIN}, {fd=6, EPOLLIN}]
 both events in one batch
Enter fullscreen mode Exit fullscreen mode

Processing fd=5 (User A):

Worker enters read loop (must drain, EPOLLET):

  Iteration 1:
    read(fd=5, buf, 4096)
     kernel copies bytes from fd=5 recv buffer (kernel RAM)
      into buf (worker process RAM)
     returns n=312 (bytes read, size of HTTP request)

  Iteration 2:
    read(fd=5, buf, 4096)
     returns -1, errno=EAGAIN
     recv buffer empty. Stop.

Worker parses buf:
  method  = GET
  path    = /api/users
  headers = { Host: ..., Authorization: Bearer xyz }

Worker creates:
  request_context_A {
    client_fd  = 5
    backend_fd = -1        not yet
    state      = PARSED
    request    = "GET /api/users..."
  }
Enter fullscreen mode Exit fullscreen mode

Processing fd=6 (User B): Same sequence. request_context_B created with client_fd=6.

Buffers touched:

  • fd=5 recv buffer: read by worker (now empty)
  • fd=6 recv buffer: read by worker (now empty)
  • Worker heap memory: written (two request_context objects)

Kernel boundary crossing: Every read() call copies bytes from kernel RAM → process RAM.


Phase 5 — Worker Opens Backend Connections

Syscalls: socket, fcntl, connect, epoll_ctl

For User A:

socket(AF_INET, SOCK_STREAM, 0)
 kernel creates new socket struct
 returns fd=12

fcntl(fd=12, F_SETFL, O_NONBLOCK)
 makes fd=12 non-blocking

connect(fd=12, {backend_ip, port=8080})
 kernel initiates TCP handshake with backend (sends SYN)
 returns immediately with errno=EINPROGRESS
   (handshake happening asynchronously in kernel)

epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=12, {EPOLLOUT|EPOLLET})
 "tell me when fd=12 is writable" = when handshake completes

request_context_A.backend_fd = 12
request_context_A.state      = CONNECTING_BACKEND
Enter fullscreen mode Exit fullscreen mode

For User B: Same. Creates fd=13, connects, registers EPOLLOUT.

Interest list now:

fd=3   EPOLLIN          ← new clients
fd=5   EPOLLIN          ← User A client
fd=6   EPOLLIN          ← User B client
fd=12  EPOLLOUT         ← backend for User A (connecting)
fd=13  EPOLLOUT         ← backend for User B (connecting)
Enter fullscreen mode Exit fullscreen mode

Worker calls epoll_wait. Sleeps. 5 fds watched. 0 CPU consumed.


Phase 6 — Backend TCP Handshakes Complete

Who acts: Kernel. Worker sleeping.

Backend sends SYN-ACK for User A's connect (fd=12)
 NIC receives, IRQ fires, ISR runs
 Kernel completes handshake, fd=12 state = ESTABLISHED
 fd=12 is now writable
 ISR marks fd=12 EPOLLOUT in ready list
 Worker is sleeping  kernel wakes it

Same for User B (fd=13)
Enter fullscreen mode Exit fullscreen mode

Phase 7 — Worker Forwards Requests to Backend

Syscalls: epoll_wait, write, epoll_ctl

epoll_wait returns: [{fd=12, EPOLLOUT}, {fd=13, EPOLLOUT}]
Enter fullscreen mode Exit fullscreen mode

Handling fd=12 (User A):

Look up context: fd=12  request_context_A  request = "GET /api/users..."

write(fd=12, "GET /api/users HTTP/1.1\r\n...", len)
 kernel copies bytes from worker RAM  fd=12 send buffer (kernel RAM)
 kernel TCP stack will send these bytes to backend as TCP packets
 returns immediately (non-blocking write)

epoll_ctl(epfd=4, EPOLL_CTL_MOD, fd=12, {EPOLLIN|EPOLLET})
 switch from watching EPOLLOUT to EPOLLIN
 "now tell me when backend sends a response back"

request_context_A.state = WAITING_BACKEND
Enter fullscreen mode Exit fullscreen mode

Handling fd=13 (User B): Same. Forwards GET /api/orders.

Buffers touched:

  • fd=12 send buffer: written by worker (the forwarded HTTP request)
  • fd=13 send buffer: written by worker
  • Kernel TCP stack reads send buffers, constructs packets, hands to NIC

Interrupt on outgoing side: When NIC finishes sending packets from send buffer, it fires an IRQ to tell kernel "send buffer slots are free now." No worker involvement.

Worker calls epoll_wait. Sleeps.


Phase 8 — Backend Responds (Possibly Out of Order)

Who acts: Kernel ISR. Worker sleeping.

Backend processes /api/orders faster. Sends response for User B first.

Backend TCP segment for User B arrives at NIC
→ DMA → NIC Ring Buffer
→ IRQ fires → ISR runs
→ Payload copied into fd=13 recv buffer
→ fd=13 marked EPOLLIN in ready list

Later:
Backend TCP segment for User A arrives
→ Same path → fd=12 recv buffer filled
→ fd=12 marked EPOLLIN in ready list

Worker was sleeping → kernel wakes it
Enter fullscreen mode Exit fullscreen mode

Buffers touched:

  • NIC Ring Buffer: written by DMA, read by ISR
  • fd=13 recv buffer: written by ISR (User B response body)
  • fd=12 recv buffer: written by ISR (User A response body)

Phase 9 — Worker Reads Backend Response, Writes to Client

Syscalls: epoll_wait, read, write

epoll_wait returns: [{fd=13, EPOLLIN}, {fd=12, EPOLLIN}]
Enter fullscreen mode Exit fullscreen mode

Note: User B's response came first even though User A connected first. epoll returns what is ready, not what came first.

Handling fd=13 (User B's backend response):

read loop on fd=13:
  read(fd=13, buf, 65536)
   kernel copies backend response from fd=13 recv buffer  worker RAM
   returns n bytes

  read(fd=13, buf, 65536)
   returns EAGAIN (buffer empty, done)

Look up context: fd=13  request_context_B  client_fd = 6

write(fd=6, buf, n)
 kernel copies bytes from worker RAM  fd=6 send buffer
 kernel TCP stack sends packets to User B's browser

epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=13, NULL)
 remove backend fd from interest list

close(fd=13)
 kernel tears down backend connection, frees socket struct and buffers

request_context_B.state = DONE
Enter fullscreen mode Exit fullscreen mode

Handling fd=12 (User A): Same. Reads fd=12 → writes to fd=5 → closes fd=12.

Buffers touched:

  • fd=13 recv buffer: read by worker (emptied)
  • fd=12 recv buffer: read by worker (emptied)
  • fd=6 send buffer: written by worker (User B response going to browser)
  • fd=5 send buffer: written by worker (User A response going to browser)

Kernel does the rest: TCP stack drains send buffers → NIC → wire → browsers.


Phase 10 — Connection Teardown

Depends on: HTTP version and Connection header.

HTTP/1.0 or Connection: close:

close(fd=5)    kernel sends FIN to User A, tears down connection, frees socket
close(fd=6)    same for User B
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=5, NULL)
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=6, NULL)
Enter fullscreen mode Exit fullscreen mode

HTTP/1.1 keep-alive (default):

fd=5 and fd=6 remain open and registered in epoll interest list
Worker goes back to epoll_wait
If same client sends another request → fd=5 fires EPOLLIN → handled immediately
Kernel has idle timers: if no data for N seconds → close
Enter fullscreen mode Exit fullscreen mode

Worker returns to epoll_wait. Sleeping again. Ready for next event.


7. What Happens While Worker is Busy

Scenario: Worker is processing User A's request. At the same moment, Users C, D, E connect and send requests.

Worker processing [User A]
        │
        │  ← IRQ fires (User C SYN arrives)
        │     ISR runs in microseconds
        │     Kernel completes C's handshake
        │     Adds C to Accept Queue
        │     Marks fd=3 EPOLLIN in ready list
        │     Worker is NOT sleeping → no wakeup sent. That's fine.
        │
        │  ← IRQ fires (User D SYN arrives)
        │     Same. D added to Accept Queue. fd=3 already marked ready.
        │
        │  ← IRQ fires (User E data arrives on existing connection)
        │     Bytes go into fd=E's recv buffer. fd=E marked EPOLLIN.
        │
Worker finishes [User A]
        │
epoll_wait()
        │
→ returns: [{fd=3, EPOLLIN}, {fd=E, EPOLLIN}]
  (fd=3 fires once for all queued connections — edge triggered fires on transition)

Worker drains Accept Queue:
  accept() → User C → fd=7 → register with epoll
  accept() → User D → fd=8 → register with epoll
  accept() → EAGAIN → done

Worker reads fd=E
Worker calls epoll_wait. Sleeps.
Enter fullscreen mode Exit fullscreen mode

Key principle: The ready list is written by ISR independently of worker state. The worker sees everything accumulated while it was busy, on its very next epoll_wait call.


8. All Buffers — Master Reference

# Buffer Location Default Size Written By Read By Full Behaviour
1 NIC Ring Buffer NIC / kernel RAM (DMA region) 256–4096 packet slots NIC hardware via DMA Kernel ISR Packets dropped, visible via ethtool
2 SYN Queue Kernel RAM ~128–1024 slots Kernel TCP stack (on SYN) Kernel TCP stack (on ACK) SYN drops, connection timeout
3 Accept Queue Kernel RAM min(backlog, somaxconn) ≈ 511 Kernel TCP stack (post-handshake) Process via accept() SYN drops, connection timeout
4 Socket Recv Buffer (client fd) Kernel RAM 208 KB default, 128 MB max Kernel ISR (per interrupt) Process via read() TCP flow control, sender throttled
5 Socket Send Buffer (client fd) Kernel RAM 208 KB default, 128 MB max Process via write() Kernel TCP stack → NIC write() returns EAGAIN (non-blocking)
6 Worker Heap (request_context) Process RAM Unbounded (heap) Process (after read() + parse) Process (during response handling) OOM if too many large requests
7 Socket Recv Buffer (backend fd) Kernel RAM 208 KB default, 128 MB max Kernel ISR (backend response) Process via read() TCP flow control
8 Socket Send Buffer (backend fd) Kernel RAM 208 KB default, 128 MB max Process via write() Kernel TCP stack → NIC write() returns EAGAIN

Data movement order for one request:

NIC Ring Buffer
  → (ISR) → Socket Recv Buffer [client fd]
  → (read syscall) → Worker Heap
  → (write syscall) → Socket Send Buffer [backend fd]
  → (kernel TCP) → NIC → wire → backend

Backend response:
NIC Ring Buffer
  → (ISR) → Socket Recv Buffer [backend fd]
  → (read syscall) → Worker Heap
  → (write syscall) → Socket Send Buffer [client fd]
  → (kernel TCP) → NIC → wire → client browser
Enter fullscreen mode Exit fullscreen mode

9. All Syscalls — Master Reference

Syscall Called By When What Kernel Does
socket() Worker Startup + per backend conn Allocates socket struct, returns fd
setsockopt() Worker Startup Sets socket options (SO_REUSEADDR etc.)
bind() Worker Startup Claims port 80 for this process
listen() Worker Startup Creates SYN + Accept queues
accept() Worker On fd=3 EPOLLIN Dequeues one entry from Accept Queue, returns new fd
fcntl(F_SETFL, O_NONBLOCK) Worker After accept/socket Sets fd to non-blocking mode
epoll_create1() Worker Startup Creates epoll instance
epoll_ctl(ADD) Worker After accept/socket Adds fd to interest list
epoll_ctl(MOD) Worker After connect/read Changes watched events (EPOLLIN ↔ EPOLLOUT)
epoll_ctl(DEL) Worker Before close Removes fd from interest list
epoll_wait() Worker After each batch Sleeps until events; returns ready fd list
connect() Worker Per upstream conn Initiates TCP handshake (non-blocking, returns EINPROGRESS)
read() Worker On EPOLLIN Copies bytes from socket recv buffer → process RAM
write() Worker After parsing/response Copies bytes from process RAM → socket send buffer
close() Worker After response sent Sends FIN, frees socket struct and all buffers

10. Failure Modes

10.1 Accept Queue Full

Cause: Connections arriving faster than worker calls accept().
Symptom: Client sees Connection timed out (kernel drops SYN silently).
Fix: Increase net.core.somaxconn, increase nginx backlog, add more workers.

10.2 Event Loop Starvation

Cause: Worker doing heavy CPU computation or blocking syscall (sync file I/O, blocking DB call) for one connection.
Symptom: All other connections queue up. Latency spikes for everyone.
Fix: Never block the event loop. Offload CPU-heavy work to thread pool (aio threads in nginx). Use async database drivers.

10.3 Send Buffer Full (Slow Client)

Cause: Client reading response slowly. fd=5 send buffer fills up.
Symptom: write(fd=5) returns EAGAIN. Nginx registers EPOLLOUT on fd=5 and comes back to write more when buffer drains.
Fix: nginx handles this internally. Tune send_timeout to close connections from very slow clients.

10.4 Recv Buffer Full (Slow Worker)

Cause: Data arriving faster than worker reads it.
Symptom: TCP flow control kicks in. Sender throttled. No data loss.
Fix: Usually not a problem. If persistent, check for event loop blocking.

10.5 fd Exhaustion

Cause: Too many open connections. Each costs one fd.
Symptom: accept() returns EMFILE (too many open files).
Fix: ulimit -n 65536 in OS. worker_connections 10240 in nginx config.


Quick Mental Model

INTERNET
   │ raw bytes
   ▼
NIC Ring Buffer  ← DMA (no CPU)
   │ IRQ fires
   ▼
Kernel ISR  ← runs in microseconds, interrupts everything
   │ copies bytes
   ▼
Socket Recv Buffer  ← kernel RAM, per fd
   │ waits here until worker is ready
   ▼
epoll Ready List  ← kernel marks fd ready
   │ wakes worker if sleeping
   ▼
Worker (epoll_wait returns)
   │ read() → copies to process RAM
   │ parse HTTP
   │ write() to backend fd → send buffer
   │
   ▼
Socket Send Buffer  ← kernel RAM, per fd
   │ kernel drains via TCP
   ▼
NIC → wire → backend

Backend response takes same path in reverse.
Worker links client fd ↔ backend fd via request_context in its own memory.
Kernel has no concept of this pairing.
Everything the worker does is non-blocking.
When waiting, worker is off the CPU run queue. Zero cycles consumed.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)