kamal namdeo

Posted on Jun 27

nginx Event Loop — Complete Lifecycle Reference

#nginx #eventloop #web #linux

nginx Event Loop — Complete Lifecycle Reference

A precise, bottom-up reference covering every buffer, syscall, interrupt, and data movement from the moment a TCP packet hits the NIC to the moment a response is sent back. Two concurrent users are used throughout as a concrete example.

Foundations — fd and Socket
Hardware Layer — NIC, DMA, Interrupts
Kernel Structures and All Buffers
epoll — How the Worker Waits Efficiently
nginx Startup Sequence
Complete Request Lifecycle — Two Concurrent Users
What Happens While Worker is Busy
All Buffers — Master Reference
All Syscalls — Master Reference
Failure Modes

1. Foundations

1.1 Everything is a File

Linux's core philosophy: every I/O resource — files on disk, network connections, pipes, terminals, devices — is represented as a file. This means one unified API (read, write, close) works on all of them. The kernel manages the actual resource. Your process holds a token.

1.2 File Descriptor (fd)

A file descriptor is just an integer. It is a per-process token that refers to a kernel-managed resource. The kernel maintains a table per process called the fd table — a simple array where the index is the fd and the value is a pointer into the kernel.

Process fd table:
┌─────┬───────────────────────────────┐
│ fd  │ points to                     │
├─────┼───────────────────────────────┤
│  0  │ stdin                         │
│  1  │ stdout                        │
│  2  │ stderr                        │
│  3  │ listen socket (nginx)         │
│  5  │ User A client connection      │
│  6  │ User B client connection      │
│ 12  │ backend connection for User A │
│ 13  │ backend connection for User B │
└─────┴───────────────────────────────┘

0, 1, 2 are always pre-assigned. Application fds start from 3 upward. The fd is meaningless on its own. It only means something when passed to a syscall — the kernel uses it to look up the real resource.

1.3 Socket

A socket is the kernel's internal data structure representing one end of a network connection. Created when your process calls socket(). Lives entirely in kernel RAM. Your process never touches it directly.

Socket struct (kernel RAM):
┌────────────────────────────────────┐
│ local_ip:    192.168.1.10          │
│ local_port:  80                    │
│ remote_ip:   203.0.113.5           │
│ remote_port: 54231                 │
│ state:       ESTABLISHED           │
│ recv_buffer: [ ...incoming bytes ] │
│ send_buffer: [ ...outgoing bytes ] │
│ tcp_state_machine: ...             │
│ timers, seq numbers, window size   │
└────────────────────────────────────┘

1.4 How fd and Socket Relate

PROCESS (user space)            KERNEL (kernel space)

fd table                        Open file table         Socket struct
┌──────┐                        ┌─────────────┐         ┌──────────────┐
│ fd=5 │ ─────────────────────► │ file entry  │ ──────► │ socket {     │
└──────┘                        │ (flags,     │         │  recv buffer │
                                │  offset)    │         │  send buffer │
                                └─────────────┘         │  state ...   │
                                                        └──────────────┘

This two-level indirection exists so that two processes (parent/child after fork) can share the same socket with different fd numbers.

2. Hardware Layer

2.1 NIC (Network Interface Card)

A physical chip on the motherboard (or PCIe slot). It receives electrical/optical signals from the network cable, decodes them into bytes, and writes them into RAM. The CPU has no involvement in receiving the bytes themselves — the NIC does it autonomously.

2.2 DMA (Direct Memory Access)

DMA is the mechanism that lets the NIC write bytes directly into RAM without involving the CPU. The kernel sets up a region of RAM called the NIC Ring Buffer at boot time and tells the NIC its address. The NIC writes incoming packets straight there. Zero CPU cycles spent moving bytes.

2.3 NIC Ring Buffer

A circular array of fixed-size slots in kernel RAM, allocated at NIC driver initialization.

NIC Ring Buffer (in kernel RAM):
┌──────────┬──────────┬──────────┬──────────┐
│ slot 0   │ slot 1   │ slot 2   │ slot 3   │  ← NIC writes here via DMA
│ pkt data │ pkt data │ (empty)  │ (empty)  │
└──────────┴──────────┴──────────┴──────────┘
      ▲                     ▲
   NIC writes            kernel reads (via interrupt)

Size: 256 to 4096 slots typically. Each slot holds one packet (up to 1500 bytes for standard Ethernet MTU).
Who writes: NIC hardware via DMA.
Who reads: Kernel interrupt handler (ISR).
What happens when full: Packets are dropped. Visible via ethtool -S eth0 | grep drop or ip -s link.

2.4 Hardware Interrupt (IRQ — Interrupt Request)

When the NIC finishes writing one or more packets into the Ring Buffer via DMA, it sends an electrical signal on a dedicated wire to the CPU. This signal is an IRQ (Interrupt Request).

The CPU responds by:

Finishing the current instruction (not mid-instruction).
Saving its register state (so it can resume later).
Looking up the interrupt number in the IDT (Interrupt Descriptor Table) — a kernel table mapping interrupt numbers to handler functions.
Jumping to the handler function.
Restoring state and resuming what it was doing before.

This entire sequence takes microseconds. The user process (nginx worker) does not know it happened.

2.5 ISR (Interrupt Service Routine)

The kernel function registered to handle NIC interrupts. When called:

Reads packet(s) from the NIC Ring Buffer.
Parses the Ethernet frame, IP header, TCP header.
Identifies which socket this packet belongs to (by matching src_ip:src_port:dst_ip:dst_port).
Copies payload bytes into that socket's recv buffer in kernel RAM.
Updates TCP sequence numbers, ACKs, window size.
Marks that socket's fd as ready (EPOLLIN) in the epoll ready list.
If the process was sleeping in epoll_wait, wakes it up.
Returns. CPU resumes what it was doing.

3. Kernel Structures and All Buffers

3.1 SYN Queue (Incomplete Connections Queue)

When a client sends a SYN packet to initiate a TCP connection:

Kernel puts this half-open connection into the SYN Queue.
Sends back SYN-ACK.
Waits for the client's final ACK to complete the handshake.

Tunable:  /proc/sys/net/ipv4/tcp_max_syn_backlog
Default:  128 (older kernels) / 1024 (newer)

Who writes: Kernel TCP stack on receiving SYN.
Who reads: Kernel TCP stack on receiving final ACK — moves entry to Accept Queue.
What happens when full: Kernel drops incoming SYNs silently. Client retries. Appears as connection timeout.

SYN flood attacks fill this queue deliberately. tcp_syncookies is the mitigation.

3.2 Accept Queue (Complete Connections Queue)

After the 3-way handshake completes, the connection moves from SYN Queue to the Accept Queue. These connections are fully established and waiting for your process to call accept().

Effective size = min(listen() backlog argument, net.core.somaxconn)

net.core.somaxconn default: 128 (kernel < 5.4) / 4096 (kernel >= 5.4)
nginx default backlog:      511

So on modern Linux: min(511, 4096) = 511 slots

Who writes: Kernel TCP stack when handshake completes.
Who reads: Your process via accept() syscall.
What happens when full: Kernel drops incoming SYNs. Client sees connection timeout.

3.3 Socket Receive Buffer

Per-connection buffer in kernel RAM. Incoming payload bytes (from ISR) land here after the TCP header is stripped.

Default size: ~208 KB  (/proc/sys/net/core/rmem_default = 212992)
Maximum size: 128 MB   (/proc/sys/net/core/rmem_max = 134217728)
Auto-tuning:  Linux adjusts size dynamically between rmem_default and rmem_max

Who writes: Kernel ISR (on hardware interrupt).
Who reads: Your process via read() or recv() syscall.
What happens when full: Kernel stops sending TCP ACKs to the sender. Sender's TCP stack sees its window shrink to zero and stops sending. This is TCP flow control. No data loss — sender just waits.

3.4 Socket Send Buffer

Per-connection buffer in kernel RAM. When your process calls write(), bytes go here. Kernel TCP stack drains this buffer by constructing TCP packets and sending them out via the NIC.

Default size: ~208 KB  (/proc/sys/net/core/wmem_default = 212992)
Maximum size: 128 MB   (/proc/sys/net/core/wmem_max = 134217728)

Who writes: Your process via write() or send() syscall.
Who reads: Kernel TCP stack (constructs packets and hands to NIC).
What happens when full: write() blocks (in blocking mode) or returns EAGAIN (in non-blocking mode). nginx uses non-blocking — it registers EPOLLOUT on the fd and retries when kernel signals the buffer has drained.

3.5 nginx Worker Memory (User Space)

When the worker calls read(fd, buf, size), the kernel copies bytes from the socket recv buffer (kernel RAM) into buf (process RAM). This is the boundary crossing — from kernel space to user space.

The worker parses HTTP in this memory and builds request_context objects:

request_context {
  client_fd:   5           ← incoming connection fd
  backend_fd:  12          ← outgoing connection fd
  state:       WAITING_BACKEND
  method:      GET
  path:        /api/users
  headers:     { host, auth, content-type ... }
  body:        (if POST)
  response:    (filled later)
}

This is the only place that links a client fd to its backend fd. The kernel has no concept of this pairing.

4. epoll

4.1 What is epoll

epoll is a Linux kernel subsystem that lets a single process monitor thousands of fds simultaneously without scanning all of them. It is the engine behind nginx's ability to handle 10,000+ concurrent connections in a single thread.

Three kernel structures make up epoll:

4.2 epoll Instance

Created via epoll_create1(). Returns an fd that represents the epoll instance itself. The instance contains:

Interest List — which fds to watch and what events.
Ready List — which fds currently have events pending.

4.3 Interest List

A red-black tree (O(log n) insertion/deletion) inside the kernel. Each node represents one watched fd and what event to watch for.

Interest list (red-black tree in kernel RAM):
┌─────────────────────────────────┐
│ fd=3,  events=EPOLLIN           │  ← new client connections
│ fd=5,  events=EPOLLIN           │  ← User A client data
│ fd=6,  events=EPOLLIN           │  ← User B client data
│ fd=12, events=EPOLLIN|EPOLLOUT  │  ← backend for User A
│ fd=13, events=EPOLLIN|EPOLLOUT  │  ← backend for User B
└─────────────────────────────────┘

Modified by: epoll_ctl(epfd, EPOLL_CTL_ADD/MOD/DEL, fd, event) — always called by worker.

4.4 Ready List

A doubly-linked list inside the kernel. When the ISR marks an fd as having data, it adds that fd to this list. epoll_wait() reads and clears this list.

Written by: Kernel ISR (on hardware interrupt) — completely independent of worker state.
Read by: Worker via epoll_wait().

This separation is critical: the ready list is updated even when the worker is busy. Nothing is lost.

4.5 Edge Triggered vs Level Triggered

nginx uses Edge Triggered (EPOLLET).

Mode	epoll_wait returns fd when	Risk if you don't drain fully
Level Triggered (default)	fd has data (keeps returning until empty)	None — will re-notify
Edge Triggered (EPOLLET)	fd transitions from no-data → has-data	Will NOT re-notify until new data arrives

Edge triggered means nginx must loop read() until it gets EAGAIN on every wakeup. If it reads only once and goes back to epoll_wait, remaining data in the buffer will never trigger a new event and that connection stalls.

Same applies to accept() — nginx must loop accept() until EAGAIN to drain the entire accept queue every time fd=3 fires.

4.6 epoll Syscalls

Syscall	What it does
`epoll_create1(flags)`	Creates epoll instance, returns epfd
`epoll_ctl(epfd, op, fd, event)`	ADD/MOD/DEL an fd in the interest list
`epoll_wait(epfd, events[], maxevents, timeout)`	Sleep until events ready; returns array of ready fds

timeout = -1 means sleep indefinitely. The process is removed from the CPU scheduler run queue. Zero CPU consumed while sleeping.

5. nginx Startup Sequence

Master Process — Before Workers Start

nginx has two process roles: one master process and one or more worker processes.

Master process:

Runs as root.
Reads nginx.conf. Determines worker_processes count — if set to auto, nginx reads the number of CPU cores on the machine (nproc) and spawns that many workers.
Calls socket(), bind(), listen() on port 80 once, before forking. This is deliberate — binding requires root privileges. Workers never need to bind themselves.
Calls fork() once per worker. Each forked child inherits fd=3 (the listen socket) automatically. This is how all workers share the same port without any of them needing root.
After forking, master drops into a supervision loop — watches workers, restarts any that crash, handles signals (nginx -s reload, nginx -s stop).
Master itself never handles a single HTTP request.

After fork, each worker independently:

Calls epoll_create1() — gets its own epoll instance.
Calls epoll_ctl(ADD, fd=3, EPOLLIN) — registers the inherited listen socket.
Calls epoll_wait() — sleeps, waiting for connections.

Master Process (root)
  socket() + bind() + listen() on fd=3
        │
        ├── fork() → Worker 1 (inherits fd=3) → own epoll → epoll_wait
        ├── fork() → Worker 2 (inherits fd=3) → own epoll → epoll_wait
        ├── fork() → Worker 3 (inherits fd=3) → own epoll → epoll_wait
        └── fork() → Worker 4 (inherits fd=3) → own epoll → epoll_wait

All workers call accept() on the same fd=3.
Kernel distributes incoming connections across workers.
Master just watches. Restarts crashed workers.

When multiple workers call accept() on the same fd=3 simultaneously, the kernel ensures only one worker gets each connection — no duplication. nginx also sets accept_mutex (older versions) or relies on EPOLLEXCLUSIVE flag (modern Linux) to avoid all workers waking up for every single connection — known as the thundering herd problem.

This runs once when nginx starts. The worker does not handle any requests until this is complete.

1.  socket(AF_INET, SOCK_STREAM, 0)
    → kernel creates socket struct
    → returns fd=3 (listen socket)

2.  setsockopt(fd=3, SO_REUSEADDR, 1)
    → allows reuse of port 80 after restart
    → avoids "Address already in use" error

3.  bind(fd=3, {ip=0.0.0.0, port=80})
    → registers this socket as owner of port 80 with kernel
    → kernel now routes port 80 packets to fd=3

4.  listen(fd=3, backlog=511)
    → kernel creates SYN Queue and Accept Queue for this socket
    → accept queue limit = min(511, somaxconn)
    → kernel begins accepting TCP handshakes even before accept() is called

5.  epoll_create1(EPOLL_CLOEXEC)
    → creates epoll instance
    → returns epfd=4

6.  epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=3, {EPOLLIN|EPOLLET})
    → registers listen socket in interest list
    → "wake me up when a new connection is ready to accept"

7.  epoll_wait(epfd=4, events[], 512, -1)
    → worker sleeps. 0 CPU consumed. Waiting for first knock.

6. Complete Request Lifecycle

Cast

Symbol	What
`fd=3`	nginx listen socket
`fd=5`	User A ↔ nginx TCP connection
`fd=6`	User B ↔ nginx TCP connection
`fd=12`	nginx ↔ backend TCP connection for User A
`fd=13`	nginx ↔ backend TCP connection for User B
`epfd=4`	epoll instance

Phase 1 — TCP Handshake (Kernel Only, Worker Sleeping)

Who acts: Kernel TCP stack + NIC. Worker is untouched.

User A browser                    Kernel (port 80)
      │── SYN ──────────────────► NIC receives packet
      │                           ISR runs: puts in SYN Queue
      │◄── SYN-ACK ─────────────  Kernel sends SYN-ACK via NIC
      │── ACK ──────────────────► ISR runs: moves to Accept Queue
      │                           Kernel assigns fd=5 internally
      │                           Marks fd=3 as EPOLLIN in ready list

User B browser (simultaneously)
      │── SYN ──────────────────► Same process
      │◄── SYN-ACK ─────────────
      │── ACK ──────────────────► fd=6 assigned internally
                                  fd=3 still marked EPOLLIN

Buffers touched:

SYN Queue: written (User A SYN), read (on ACK received), written again (User B SYN)
Accept Queue: written twice (one entry per completed handshake)
NIC Ring Buffer: written by NIC DMA, read by ISR

Interrupts: Hardware IRQ fires per packet. ISR runs. Worker untouched.

Phase 2 — Worker Wakes, Accepts Both Connections

Syscalls: epoll_wait, accept, epoll_ctl, fcntl

epoll_wait(epfd=4, events[], 512, -1)
→ returns: [{fd=3, EPOLLIN}]
→ worker wakes up

Worker enters accept loop (must drain fully because EPOLLET):

Iteration 1:
  accept(fd=3)
  → kernel dequeues User A's connection from Accept Queue
  → creates fd=5 for this connection
  → returns fd=5

  fcntl(fd=5, F_SETFL, O_NONBLOCK)
  → sets fd=5 to non-blocking mode

  epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=5, {EPOLLIN|EPOLLET})
  → fd=5 added to interest list

Iteration 2:
  accept(fd=3)
  → dequeues User B's connection
  → returns fd=6

  fcntl(fd=6, F_SETFL, O_NONBLOCK)
  epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=6, {EPOLLIN|EPOLLET})

Iteration 3:
  accept(fd=3)
  → returns -1, errno=EAGAIN
  → Accept Queue is empty. Stop looping.

Buffers touched:

Accept Queue: read (two entries dequeued)

Interest list now:

fd=3  EPOLLIN  ← always watching for new clients
fd=5  EPOLLIN  ← User A
fd=6  EPOLLIN  ← User B

Worker calls epoll_wait again. Sleeps.

Phase 3 — HTTP Request Bytes Arrive

Who acts: NIC + Kernel ISR. Worker sleeping.

User A's browser sends GET /api/users HTTP/1.1\r\nHost: ...\r\n\r\n

NIC receives TCP segment
→ DMA writes to NIC Ring Buffer
→ NIC fires hardware IRQ
→ CPU saves state, jumps to ISR

ISR:
  reads packet from NIC Ring Buffer
  parses IP + TCP headers
  matches src_ip:src_port:dst_ip:dst_port → identifies fd=5's socket
  copies HTTP payload into fd=5's recv buffer (kernel RAM)
  updates TCP: ACK sent back to User A's browser
  marks fd=5 as EPOLLIN in epoll ready list
  checks: is worker sleeping in epoll_wait? YES → wakes worker
  returns

User B's GET /api/orders HTTP/1.1 arrives simultaneously or milliseconds later. Same ISR sequence for fd=6.

Buffers touched:

NIC Ring Buffer: written by DMA, read by ISR
fd=5 recv buffer: written by ISR (~hundreds of bytes, the HTTP request)
fd=6 recv buffer: written by ISR

Interrupts: One IRQ per packet per user. ISR runs in microseconds each time.

Phase 4 — Worker Reads and Parses Both Requests

Syscalls: epoll_wait, read

epoll_wait(epfd=4, events[], 512, -1)
→ returns: [{fd=5, EPOLLIN}, {fd=6, EPOLLIN}]
→ both events in one batch

Processing fd=5 (User A):

Worker enters read loop (must drain, EPOLLET):

  Iteration 1:
    read(fd=5, buf, 4096)
    → kernel copies bytes from fd=5 recv buffer (kernel RAM)
      into buf (worker process RAM)
    → returns n=312 (bytes read, size of HTTP request)

  Iteration 2:
    read(fd=5, buf, 4096)
    → returns -1, errno=EAGAIN
    → recv buffer empty. Stop.

Worker parses buf:
  method  = GET
  path    = /api/users
  headers = { Host: ..., Authorization: Bearer xyz }

Worker creates:
  request_context_A {
    client_fd  = 5
    backend_fd = -1       ← not yet
    state      = PARSED
    request    = "GET /api/users..."
  }

Processing fd=6 (User B): Same sequence. request_context_B created with client_fd=6.

Buffers touched:

fd=5 recv buffer: read by worker (now empty)
fd=6 recv buffer: read by worker (now empty)
Worker heap memory: written (two request_context objects)

Kernel boundary crossing: Every read() call copies bytes from kernel RAM → process RAM.

Phase 5 — Worker Opens Backend Connections

Syscalls: socket, fcntl, connect, epoll_ctl

For User A:

socket(AF_INET, SOCK_STREAM, 0)
→ kernel creates new socket struct
→ returns fd=12

fcntl(fd=12, F_SETFL, O_NONBLOCK)
→ makes fd=12 non-blocking

connect(fd=12, {backend_ip, port=8080})
→ kernel initiates TCP handshake with backend (sends SYN)
→ returns immediately with errno=EINPROGRESS
   (handshake happening asynchronously in kernel)

epoll_ctl(epfd=4, EPOLL_CTL_ADD, fd=12, {EPOLLOUT|EPOLLET})
→ "tell me when fd=12 is writable" = when handshake completes

request_context_A.backend_fd = 12
request_context_A.state      = CONNECTING_BACKEND

For User B: Same. Creates fd=13, connects, registers EPOLLOUT.

Interest list now:

fd=3   EPOLLIN          ← new clients
fd=5   EPOLLIN          ← User A client
fd=6   EPOLLIN          ← User B client
fd=12  EPOLLOUT         ← backend for User A (connecting)
fd=13  EPOLLOUT         ← backend for User B (connecting)

Worker calls epoll_wait. Sleeps. 5 fds watched. 0 CPU consumed.

Phase 6 — Backend TCP Handshakes Complete

Who acts: Kernel. Worker sleeping.

Backend sends SYN-ACK for User A's connect (fd=12)
→ NIC receives, IRQ fires, ISR runs
→ Kernel completes handshake, fd=12 state = ESTABLISHED
→ fd=12 is now writable
→ ISR marks fd=12 EPOLLOUT in ready list
→ Worker is sleeping → kernel wakes it

Same for User B (fd=13)

Phase 7 — Worker Forwards Requests to Backend

Syscalls: epoll_wait, write, epoll_ctl

epoll_wait returns: [{fd=12, EPOLLOUT}, {fd=13, EPOLLOUT}]

Handling fd=12 (User A):

Look up context: fd=12 → request_context_A → request = "GET /api/users..."

write(fd=12, "GET /api/users HTTP/1.1\r\n...", len)
→ kernel copies bytes from worker RAM → fd=12 send buffer (kernel RAM)
→ kernel TCP stack will send these bytes to backend as TCP packets
→ returns immediately (non-blocking write)

epoll_ctl(epfd=4, EPOLL_CTL_MOD, fd=12, {EPOLLIN|EPOLLET})
→ switch from watching EPOLLOUT to EPOLLIN
→ "now tell me when backend sends a response back"

request_context_A.state = WAITING_BACKEND

Handling fd=13 (User B): Same. Forwards GET /api/orders.

Buffers touched:

fd=12 send buffer: written by worker (the forwarded HTTP request)
fd=13 send buffer: written by worker
Kernel TCP stack reads send buffers, constructs packets, hands to NIC

Interrupt on outgoing side: When NIC finishes sending packets from send buffer, it fires an IRQ to tell kernel "send buffer slots are free now." No worker involvement.

Worker calls epoll_wait. Sleeps.

Phase 8 — Backend Responds (Possibly Out of Order)

Who acts: Kernel ISR. Worker sleeping.

Backend processes /api/orders faster. Sends response for User B first.

Backend TCP segment for User B arrives at NIC
→ DMA → NIC Ring Buffer
→ IRQ fires → ISR runs
→ Payload copied into fd=13 recv buffer
→ fd=13 marked EPOLLIN in ready list

Later:
Backend TCP segment for User A arrives
→ Same path → fd=12 recv buffer filled
→ fd=12 marked EPOLLIN in ready list

Worker was sleeping → kernel wakes it

Buffers touched:

NIC Ring Buffer: written by DMA, read by ISR
fd=13 recv buffer: written by ISR (User B response body)
fd=12 recv buffer: written by ISR (User A response body)

Phase 9 — Worker Reads Backend Response, Writes to Client

Syscalls: epoll_wait, read, write

epoll_wait returns: [{fd=13, EPOLLIN}, {fd=12, EPOLLIN}]

Note: User B's response came first even though User A connected first. epoll returns what is ready, not what came first.

Handling fd=13 (User B's backend response):

read loop on fd=13:
  read(fd=13, buf, 65536)
  → kernel copies backend response from fd=13 recv buffer → worker RAM
  → returns n bytes

  read(fd=13, buf, 65536)
  → returns EAGAIN (buffer empty, done)

Look up context: fd=13 → request_context_B → client_fd = 6

write(fd=6, buf, n)
→ kernel copies bytes from worker RAM → fd=6 send buffer
→ kernel TCP stack sends packets to User B's browser

epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=13, NULL)
→ remove backend fd from interest list

close(fd=13)
→ kernel tears down backend connection, frees socket struct and buffers

request_context_B.state = DONE

Handling fd=12 (User A): Same. Reads fd=12 → writes to fd=5 → closes fd=12.

Buffers touched:

fd=13 recv buffer: read by worker (emptied)
fd=12 recv buffer: read by worker (emptied)
fd=6 send buffer: written by worker (User B response going to browser)
fd=5 send buffer: written by worker (User A response going to browser)

Kernel does the rest: TCP stack drains send buffers → NIC → wire → browsers.

Phase 10 — Connection Teardown

Depends on: HTTP version and Connection header.

HTTP/1.0 or Connection: close:

close(fd=5)   → kernel sends FIN to User A, tears down connection, frees socket
close(fd=6)   → same for User B
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=5, NULL)
epoll_ctl(epfd=4, EPOLL_CTL_DEL, fd=6, NULL)

HTTP/1.1 keep-alive (default):

fd=5 and fd=6 remain open and registered in epoll interest list
Worker goes back to epoll_wait
If same client sends another request → fd=5 fires EPOLLIN → handled immediately
Kernel has idle timers: if no data for N seconds → close

Worker returns to epoll_wait. Sleeping again. Ready for next event.

7. What Happens While Worker is Busy

Scenario: Worker is processing User A's request. At the same moment, Users C, D, E connect and send requests.

Worker processing [User A]
        │
        │  ← IRQ fires (User C SYN arrives)
        │     ISR runs in microseconds
        │     Kernel completes C's handshake
        │     Adds C to Accept Queue
        │     Marks fd=3 EPOLLIN in ready list
        │     Worker is NOT sleeping → no wakeup sent. That's fine.
        │
        │  ← IRQ fires (User D SYN arrives)
        │     Same. D added to Accept Queue. fd=3 already marked ready.
        │
        │  ← IRQ fires (User E data arrives on existing connection)
        │     Bytes go into fd=E's recv buffer. fd=E marked EPOLLIN.
        │
Worker finishes [User A]
        │
epoll_wait()
        │
→ returns: [{fd=3, EPOLLIN}, {fd=E, EPOLLIN}]
  (fd=3 fires once for all queued connections — edge triggered fires on transition)

Worker drains Accept Queue:
  accept() → User C → fd=7 → register with epoll
  accept() → User D → fd=8 → register with epoll
  accept() → EAGAIN → done

Worker reads fd=E
Worker calls epoll_wait. Sleeps.

Key principle: The ready list is written by ISR independently of worker state. The worker sees everything accumulated while it was busy, on its very next epoll_wait call.

8. All Buffers — Master Reference

#	Buffer	Location	Default Size	Written By	Read By	Full Behaviour
1	NIC Ring Buffer	NIC / kernel RAM (DMA region)	256–4096 packet slots	NIC hardware via DMA	Kernel ISR	Packets dropped, visible via ethtool
2	SYN Queue	Kernel RAM	~128–1024 slots	Kernel TCP stack (on SYN)	Kernel TCP stack (on ACK)	SYN drops, connection timeout
3	Accept Queue	Kernel RAM	min(backlog, somaxconn) ≈ 511	Kernel TCP stack (post-handshake)	Process via `accept()`	SYN drops, connection timeout
4	Socket Recv Buffer (client fd)	Kernel RAM	208 KB default, 128 MB max	Kernel ISR (per interrupt)	Process via `read()`	TCP flow control, sender throttled
5	Socket Send Buffer (client fd)	Kernel RAM	208 KB default, 128 MB max	Process via `write()`	Kernel TCP stack → NIC	`write()` returns EAGAIN (non-blocking)
6	Worker Heap (request_context)	Process RAM	Unbounded (heap)	Process (after `read()` + parse)	Process (during response handling)	OOM if too many large requests
7	Socket Recv Buffer (backend fd)	Kernel RAM	208 KB default, 128 MB max	Kernel ISR (backend response)	Process via `read()`	TCP flow control
8	Socket Send Buffer (backend fd)	Kernel RAM	208 KB default, 128 MB max	Process via `write()`	Kernel TCP stack → NIC	`write()` returns EAGAIN

Data movement order for one request:

NIC Ring Buffer
  → (ISR) → Socket Recv Buffer [client fd]
  → (read syscall) → Worker Heap
  → (write syscall) → Socket Send Buffer [backend fd]
  → (kernel TCP) → NIC → wire → backend

Backend response:
NIC Ring Buffer
  → (ISR) → Socket Recv Buffer [backend fd]
  → (read syscall) → Worker Heap
  → (write syscall) → Socket Send Buffer [client fd]
  → (kernel TCP) → NIC → wire → client browser

9. All Syscalls — Master Reference

Syscall	Called By	When	What Kernel Does
`socket()`	Worker	Startup + per backend conn	Allocates socket struct, returns fd
`setsockopt()`	Worker	Startup	Sets socket options (SO_REUSEADDR etc.)
`bind()`	Worker	Startup	Claims port 80 for this process
`listen()`	Worker	Startup	Creates SYN + Accept queues
`accept()`	Worker	On fd=3 EPOLLIN	Dequeues one entry from Accept Queue, returns new fd
`fcntl(F_SETFL, O_NONBLOCK)`	Worker	After accept/socket	Sets fd to non-blocking mode
`epoll_create1()`	Worker	Startup	Creates epoll instance
`epoll_ctl(ADD)`	Worker	After accept/socket	Adds fd to interest list
`epoll_ctl(MOD)`	Worker	After connect/read	Changes watched events (EPOLLIN ↔ EPOLLOUT)
`epoll_ctl(DEL)`	Worker	Before close	Removes fd from interest list
`epoll_wait()`	Worker	After each batch	Sleeps until events; returns ready fd list
`connect()`	Worker	Per upstream conn	Initiates TCP handshake (non-blocking, returns EINPROGRESS)
`read()`	Worker	On EPOLLIN	Copies bytes from socket recv buffer → process RAM
`write()`	Worker	After parsing/response	Copies bytes from process RAM → socket send buffer
`close()`	Worker	After response sent	Sends FIN, frees socket struct and all buffers

10. Failure Modes

10.1 Accept Queue Full

Cause: Connections arriving faster than worker calls accept().
Symptom: Client sees Connection timed out (kernel drops SYN silently).
Fix: Increase net.core.somaxconn, increase nginx backlog, add more workers.

10.2 Event Loop Starvation

Cause: Worker doing heavy CPU computation or blocking syscall (sync file I/O, blocking DB call) for one connection.
Symptom: All other connections queue up. Latency spikes for everyone.
Fix: Never block the event loop. Offload CPU-heavy work to thread pool (aio threads in nginx). Use async database drivers.

10.3 Send Buffer Full (Slow Client)

Cause: Client reading response slowly. fd=5 send buffer fills up.
Symptom: write(fd=5) returns EAGAIN. Nginx registers EPOLLOUT on fd=5 and comes back to write more when buffer drains.
Fix: nginx handles this internally. Tune send_timeout to close connections from very slow clients.

10.4 Recv Buffer Full (Slow Worker)

Cause: Data arriving faster than worker reads it.
Symptom: TCP flow control kicks in. Sender throttled. No data loss.
Fix: Usually not a problem. If persistent, check for event loop blocking.

10.5 fd Exhaustion

Cause: Too many open connections. Each costs one fd.
Symptom: accept() returns EMFILE (too many open files).
Fix: ulimit -n 65536 in OS. worker_connections 10240 in nginx config.

Quick Mental Model

INTERNET
   │ raw bytes
   ▼
NIC Ring Buffer  ← DMA (no CPU)
   │ IRQ fires
   ▼
Kernel ISR  ← runs in microseconds, interrupts everything
   │ copies bytes
   ▼
Socket Recv Buffer  ← kernel RAM, per fd
   │ waits here until worker is ready
   ▼
epoll Ready List  ← kernel marks fd ready
   │ wakes worker if sleeping
   ▼
Worker (epoll_wait returns)
   │ read() → copies to process RAM
   │ parse HTTP
   │ write() to backend fd → send buffer
   │
   ▼
Socket Send Buffer  ← kernel RAM, per fd
   │ kernel drains via TCP
   ▼
NIC → wire → backend

Backend response takes same path in reverse.
Worker links client fd ↔ backend fd via request_context in its own memory.
Kernel has no concept of this pairing.
Everything the worker does is non-blocking.
When waiting, worker is off the CPU run queue. Zero cycles consumed.

nginx Event Loop — Complete Lifecycle Reference

Table of Contents

1. Foundations

1.1 Everything is a File

1.2 File Descriptor (fd)

1.3 Socket

1.4 How fd and Socket Relate

2. Hardware Layer

2.1 NIC (Network Interface Card)

2.2 DMA (Direct Memory Access)

2.3 NIC Ring Buffer

2.4 Hardware Interrupt (IRQ — Interrupt Request)

2.5 ISR (Interrupt Service Routine)

3. Kernel Structures and All Buffers

3.1 SYN Queue (Incomplete Connections Queue)

3.2 Accept Queue (Complete Connections Queue)

3.3 Socket Receive Buffer

3.4 Socket Send Buffer

3.5 nginx Worker Memory (User Space)

4. epoll

4.1 What is epoll

4.2 epoll Instance

4.3 Interest List

4.4 Ready List

4.5 Edge Triggered vs Level Triggered

4.6 epoll Syscalls

5. nginx Startup Sequence

Master Process — Before Workers Start

6. Complete Request Lifecycle

Cast

Phase 1 — TCP Handshake (Kernel Only, Worker Sleeping)

Phase 2 — Worker Wakes, Accepts Both Connections

Phase 3 — HTTP Request Bytes Arrive

Phase 4 — Worker Reads and Parses Both Requests

Phase 5 — Worker Opens Backend Connections

Phase 6 — Backend TCP Handshakes Complete

Phase 7 — Worker Forwards Requests to Backend

Phase 8 — Backend Responds (Possibly Out of Order)

Phase 9 — Worker Reads Backend Response, Writes to Client

Phase 10 — Connection Teardown

7. What Happens While Worker is Busy

8. All Buffers — Master Reference

9. All Syscalls — Master Reference

10. Failure Modes

10.1 Accept Queue Full

10.2 Event Loop Starvation

10.3 Send Buffer Full (Slow Client)

10.4 Recv Buffer Full (Slow Worker)

10.5 fd Exhaustion

Quick Mental Model