Rafa Calderon

Posted on Apr 3

How Cloudflare Replaced NGINX with Rust, Tokio, and Pingora — and Saved 434 Years of TLS Handshakes Every Single Day

#rust #nginx #cloudflare #softwaredevelopment

1 trillion requests per day, years of workarounds, and an architectural problem with no patch. This is the story of what broke, what Cloudflare built to replace it, and why every technical decision in Pingora is a direct answer to a specific NGINX limitation.

First, the numbers

Cloudflare published the migration performance data in their blog in September 2022. Not projections, not synthetic benchmarks — production metrics on real traffic:

Metric	Result
CPU and memory consumed	−66% vs NGINX on identical hardware
New TCP connections opened per second	3× fewer globally
Connection reuse rate (major customer)	From 87.1% → 99.92%
Reduction in new connections (same customer)	160× fewer
Median TTFB	−5ms
95th percentile TTFB	−80ms
TLS handshake time saved	434 years... per day

The 434-year number is the direct mathematical consequence of 99.92% reuse at their scale. To understand why that number was unreachable with NGINX, you need to understand how NGINX works internally.

How NGINX works internally — and where it breaks

NGINX uses a master-worker process model: one master process that coordinates, and N worker processes — typically one per CPU core. The OS assigns each incoming connection to a worker, and that connection lives there until it finishes. The worker has complete ownership.
This design was brilliant at the time. It avoids inter-process synchronization complexity and makes good use of per-core cache locality. But it has three structural problems that cannot be solved from within the model:

Problem 1: connection pools are not shared

Each worker maintains its own connection pool toward origins. If worker A has an open, idle TLS connection to api.client.com, and a new request toward that same origin arrives and the OS assigns it to worker B — worker B opens a new connection from scratch. It has no access to worker A's pool.

On a server with 16 workers and an origin that all of them need to connect to, in the worst case you have 16 connections doing the work that one could do. Multiply this by the number of origins, the number of servers in Cloudflare's datacenters, and the traffic volume — the waste is enormous.

Every new connection means a TCP handshake and, over HTTPS, a TLS handshake. TLS 1.3 has reduced the latency of this process, but it carries non-trivial CPU cost and round-trips. Cloudflare estimated that, at their scale, this accumulated waste added up to 434 years of handshake time per day.

Cloudflare spent years trying to mitigate this problem. They wrote multiple posts about their NGINX workarounds. Eventually they reached the conclusion that every company reaches when they hit this limit: you cannot share a connection pool across NGINX worker processes because the isolated process model makes it architecturally impossible without rebuilding from scratch.

Problem 2: load imbalance from request pinning

Because each request is pinned to a worker for its entire lifetime, the OS has to distribute incoming requests across workers before knowing how long they'll take. A request that takes 2ms and one that takes 2 seconds waiting on a slow origin cost the OS exactly the same at assignment time — but the worker that gets the slow one is blocked for those 2 seconds.

In practice, this translates to highly uneven CPU loads: some cores saturated, others underutilized. Adding more workers improves the situation but aggravates the connection pool problem.

Problem 3: extensibility in C

Cloudflare needed to implement complex business logic in the proxy: routing by client characteristics, real-time header modification, custom cache logic, security rules. NGINX allows extension via C modules (requiring recompilation) or via Lua with OpenResty (introducing VM overhead and a separate runtime).

Both options have limits. C modules are prone to the same memory issues as NGINX itself. The team wanted to write business logic with modern systems language tooling — type safety, normal unit tests, the Rust crate ecosystem.

The underlying problem: memory safety in C

A network proxy processes untrusted data from the Internet on every request. Malformed headers, oversized bodies, payloads designed to exploit parsers. In C, a parsing bug can escalate to buffer overflow, heap corruption, or remote code execution.

This is not theoretical. CVE-2013-2028: stack buffer overflow in NGINX, remote code execution. CVE-2017-7529: integer overflow in the range request module, out-of-bounds memory read. CVE-2021-23017: off-by-one in NGINX's DNS resolver. In 2022, the NSA and CISA jointly published a guide explicitly recommending migrating critical infrastructure to memory-safe languages.

Cloudflare was already a Rust shop. The decision was practical: an entire class of vulnerabilities becomes impossible in safe Rust because the compiler rejects the code that could cause them.

The response: what each Pingora decision solves

With the problems clear, Pingora's architecture reads like a solution map. Every technical piece answers a specific problem.

Answer to Problem 1: a single multi-threaded process with a shared pool

The fix to the connection pool problem is changing the unit of isolation: from processes to threads within a single process.

Pingora runs as a single process with N threads — by default, one per core. All threads share the same memory space, and therefore the same connection pool. When thread A has an open TLS connection to api.client.com and thread B needs the same origin, it can reuse it directly. No IPC, no serialization, no coordination protocol — it's a pointer to a shared structure in memory.

This is what produces the jump from 87.1% to 99.92% connection reuse. It's not an optimization trick — it's the direct consequence of eliminating the process isolation that made sharing impossible.

Answer to Problem 2: Tokio's work-stealing scheduler

Multi-threading solves the connection pool but introduces a new risk: if you distribute work across threads statically, imbalance persists.

Pingora uses Tokio's multi-thread work-stealing scheduler. According to Tokio's official documentation and the technical post on the scheduler rewrite, the mechanism is more sophisticated than its name suggests.

Each worker thread maintains three levels of task queues:

LIFO slot: the most recently spawned task. Executed first to maximize CPU cache locality.
Local queue: bounded (max 256 tasks), lock-free. Only the owning thread pushes; other threads can steal from the opposite end.
Global queue: consulted periodically, not every tick, to minimize synchronization overhead.

When a thread exhausts its queues, it picks a random victim thread and steals half its local queue in a single batch operation — multiple tasks at once to reduce per-steal synchronization cost. The queues are implemented with atomic operations (compare-and-swap), no mutexes on the hot path.

The practical result over NGINX's request pinning problem: in Pingora, when a connection is waiting on a slow origin, that connection is simply a suspended Future in the scheduler. The thread that initiated it can steal work from other threads and keep processing other requests. No blocking. Cores are utilized uniformly and automatically.

Answer to Problem 3: the ProxyHttp trait

Instead of C modules or Lua scripts, Pingora exposes a Rust trait called ProxyHttp that defines phases of each request's lifecycle. It's the same mental model as NGINX/OpenResty's configurable phases — but implemented in compiled, type-safe Rust.

Phase	When it runs
`request_filter`	When the client request arrives
`upstream_peer`	To decide which backend gets the request (required)
`upstream_request_filter`	Before forwarding the request to the backend
`upstream_response_filter`	When the backend response arrives
`response_filter`	Before sending the response to the client
`logging`	Always, even on error. For metrics and tracing.

You implement only the phases you need. The Rust compiler guarantees your implementation is type-correct, race-free, and memory-safe. A logic bug in your request_filter is a compilation error or a failing test — not a segfault in production at 3am.

Answer to the memory safety problem: Rust by construction

In safe Rust, a buffer overflow is a compilation error. A use-after-free is a compilation error. A data race between threads is a compilation error. These aren't warnings, linters, or static analysis — the compiler rejects code that could cause those conditions.

Cloudflare reported a significant reduction in memory safety errors after the migration, and that engineers could focus on product logic instead of chasing segfaults. This category of improvement is hard to quantify in production numbers, but its consequences show up in what Cloudflare has built since: FL2, their system managing security and performance rules for every customer — 15 years of C, rewritten in 2024-2025 — is also built on Pingora.

The two mechanisms NGINX never had

Beyond the structural problems that motivated the migration, Cloudflare had to build two engineering pieces that didn't exist in NGINX and that they needed to operate at their scale.

TinyUFO: the cache algorithm they published as an independent crate

Caching in a high-traffic proxy has a problem that LRU doesn't solve well: web access patterns follow a Zipf distribution, where a few items are extremely hot and most are accessed very rarely. LRU treats all items equally in terms of admission — if something arrives and the cache is full, it evicts the least recently used, regardless of how frequently that item was accessed.

The result is that scans — sequential reads of items that won't be accessed again — can contaminate the cache and evict hot items. At 40M req/s, the consequences of a poor admission policy are measured in millions of additional cache misses.

Cloudflare built TinyUFO by combining two recent research algorithms:

S3-FIFO (ACM 2023 paper): instead of a doubly-linked list like LRU, uses three FIFO queues. FIFO queues have better CPU cache behavior — insertions and evictions are sequential memory accesses, not random pointer traversals. A ghost queue tracks recently evicted items; if they return quickly, they're promoted to the main queue instead of starting over.

TinyLFU: maintains approximate frequency counts using a Count-Min Sketch — a fixed-size probabilistic structure independent of the number of tracked items. Before admitting a new item, it checks whether its access frequency beats the item that would be evicted. Scans don't pass the filter because their items appear with frequency 1.

The design is completely lock-free: metadata operations use atomic compare-and-swap. In their benchmarks with 8 threads on x64 Linux, TinyUFO outperforms moka (another widely-used TinyLFU implementation) in throughput, precisely because it eliminates mutex contention.

They published it as an independent crate on crates.io, separate from the rest of Pingora. Usable in any Rust project that needs a high-performance in-memory cache.

Graceful Restart: transferring sockets between processes

A proxy in production needs to update without dropping traffic. The problem: the new process needs to bind() on the same port the old process is already listening on.

The kernel mechanism that solves this is SCM_RIGHTS — a Linux feature that Cloudflare documented in detail in their own blog: Know your SCM_RIGHTS.

The mechanics: file descriptors are process-local indices in a file descriptor table. They're not global kernel handles. SCM_RIGHTS allows sending an open file descriptor from one process to another using sendmsg() over a Unix domain socket — the kernel duplicates the underlying resource (the active network socket with all its established connections) into the receiving process's file descriptor table.

The Pingora upgrade protocol, from official docs:

The new binary starts with --upgrade. It does not call bind(). It connects to a coordination socket and waits.
SIGQUIT is sent to the old process. The old process transfers its listening socket FDs to the new one via SCM_RIGHTS.
The new process starts accepting new connections on the received sockets.
The old process drains: it finishes its in-flight requests within the grace period and exits.

The guarantees: every request is handled by the old process or the new one, never by neither. The listening socket is never closed. No client sees Connection Refused. No request that can finish within the grace period is cut.

HAProxy and Envoy use the same mechanism. What makes Pingora different is that it's integrated transparently into the server lifecycle — two terminal commands and the upgrade happens with no additional intervention.

What Pingora is today

Pingora has been open source since March 2024 (Apache 2.0). What has happened since then signals that the bet worked:

FL2 — the system Cloudflare calls the "brain" of their network, 15 years of C managing security and configuration rules for every customer — was rewritten in 2024-2025 on top of Pingora. This is not a satellite project: it's Cloudflare's central infrastructure.

ecdysis — the library that encapsulates the zero-downtime upgrade mechanism (SCM_RIGHTS) — was published in 2025 as an independent Rust crate. Usable in any Rust network service without depending on Pingora.

Pingora 0.8.0 patched request smuggling vulnerabilities in ingress proxy configurations, responsibly disclosed through their bug bounty.

The current MSRV is 1.84, with a rolling 6-month policy. The API is pre-1.0, so expect breaking changes — but the ProxyHttp trait has proven stable enough for production for years.

If you want to explore the codebase or build on it:

GitHub cloudflare/pingora — source code, architecture notes, and the full workspace
Quick Start guide — official tutorial, walks you through building a working load balancer
User Guide — configuration, TLS, graceful restart, custom services

Pingora isn't for everyone. If you need a reverse proxy you can configure in 10 minutes, Caddy or Traefik are better options — they're binaries, not frameworks. Pingora is for when you have a real infrastructure problem that a configurable proxy can't solve: connection efficiency at scale, routing logic that exceeds what Lua can express maintainably, or the requirement that a memory bug in your proxy not become a CVE.

Cloudflare spent years concluding they needed to build their own proxy. The deployment numbers say they got the technical decisions right.

References

How we built Pingora — Cloudflare Blog (2022) — all production data cited in this article
Open sourcing Pingora — Cloudflare Blog (2024) — the announcement and usage context
GitHub cloudflare/pingora — source code, documentation and examples
Tokio scheduler: making it 10x faster — the technical post on the Tokio scheduler rewrite
Tokio runtime docs — official multi-thread scheduler documentation
Know your SCM_RIGHTS — Cloudflare Blog — the FD transfer mechanism explained by the team that uses it
Graceful restart docs — the upgrade protocol officially documented
TinyUFO on crates.io — the independent crate with benchmark suite
S3-FIFO paper (ACM 2023) — the academic algorithm behind TinyUFO
Cloudflare FL2 (2025) — the system running on Pingora in production
NSA/CISA: Software Memory Safety (2022) — the government guide on memory-safe languages

DEV Community