DEV Community

speed engineer
speed engineer

Posted on • Originally published at Medium

My Load Balancer Handles 5M RPS: Architecture and Lessons Learned

From 50K RPS to 5M RPS: The Hard-Won Insights That Only Come From Scale


My Load Balancer Handles 5M RPS: Architecture and Lessons Learned

From 50K RPS to 5M RPS: The Hard-Won Insights That Only Come From Scale

Scaling load balancing to 5 million requests per second requires rethinking every assumption about network processing, memory management, and system architecture.

So I remember this moment, right? We’re in a perf review meeting, staring at Black Friday: 50K RPS and I’m… calm. CPU ~15%, memory flat, p95 sane. Headroom for days — or so I said, smugly, into a room that would later haunt me. We celebrated that dashboard like it meant we were safe for a year. It didn’t. Two years and a few gray hairs later we’re at 5M RPS and the thing I believed most — “just add more boxes” — turned out to be the first lie scale tells you. At 50K you adjust knobs; at 5M you renegotiate physics. Every nanosecond becomes a character in the story. You stop thinking in “requests” and start thinking in cache lines, queue depths, and PCIe lanes. And weirdly, the kernel becomes a politely smiling antagonist, nodding helpfully while stealing your cycles like it’s tipping from a jar.

There’s this psychological trap, too. The illusion that good graphs mean good architecture. They don’t. At modest loads, almost anything looks clean if you haven’t stressed the failure modes. The hidden costs — the copies, the context switches, the cold cache lines — are all there, just not loud enough to get your attention. Until they are. Then they scream.

The Performance Cliff: Where Traditional Load Balancers Break

Benchmarks adore happy paths. Production at millions of RPS is a bag of misaligned MTUs, bursty clients, and surprise TLS renegotiations. Someone somewhere will have a firmware quirk and your perfect assumptions will crumble. The painful, measured truths we learned the hard way:

  • Memory bandwidth becomes the primary bottleneck past ~3M RPS. Not CPU. Memory. That realization landed like a plot twist I didn’t want. We had cores to spare and still stalled; the memory controllers were pegged while the ALUs twiddled their thumbs.
  • L3 cache misses : 2% → 23% beyond 2M RPS. That’s catastrophic enough to feel personal. When perf counters show long-latency loads tripling, you start treating locality like a religion.
  • Context switching taxes you 50K–75K RPS per extra percent. The scheduler is incredible technology, but at this scale every involuntary switch is a pothole you hit with all four tires.
  • Interrupts quietly chew ~40% of CPU at 5M RPS. The CPU flinches; you pay. We learned to batch, coalesce, and pin, but the lesson stuck: your NIC’s interrupt strategy is not a footnote — it’s architecture.

And SSL/TLS? The ~30% throughput tax you can’t charm your way out of. At 5M RPS, that’s 1.5M req/s gone to math that doesn’t bargain. You can’t “optimize” exponentiation with vibes. You either offload, resume, or get ruthless with where and how you do crypto. Also: handshake storms will find you on the worst possible day.

There’s another cliff most folks don’t mention: tail latency under partial failure. When one backend goes wobbly, naive algorithms push retries into a thundering herd. At 100K, you notice. At 5M, you ignite. We learned to isolate, dampen, and treat “retry” like a loaded gun.

Architecture Evolution: From Standard to High-Performance

We rebuilt it three times. Each rewrite felt final — until the next wall. Each taught us a different thing about where time goes.

Architecture 1: Standard HAProxy (50K–200K RPS)

Started sensible. Kernel networking, classic config:

global  
    daemon  
    maxconn 50000  
    nbproc 4  
defaults  
    mode http  
    timeout connect 5000ms  
    timeout client 50000ms    
    timeout server 50000ms  
frontend web_frontend  
    bind *:80  
    default_backend web_servers  
backend web_servers  
    balance roundrobin  
    server web1 10.0.1.10:8080 check  
    server web2 10.0.1.11:8080 check
Enter fullscreen mode Exit fullscreen mode

Performance Characteristics:

  • Peak RPS: ~180K (single node)
  • CPU Utilization: ~15% at peak
  • Memory Usage: ~2GB
  • P95 Latency: ~45ms

This was fine — truly fine — until context switches and kernel overhead started acting like landlords charging rent on every packet. The classic kernel path is feature-rich and battle-tested, but at high RPS you drown in “bookkeeping”: sk_buff orchestration, socket queues, copies between rings and buffers you didn’t ask for. It’s like paying tolls for a road you don’t need to be on.

We tuned IRQ affinity, bumped socket buffers, played with RPS/RFS, TSO/GRO, even flirted with io_uring. Each tweak helped a little and then… plateau. The lesson: at some point, the overhead of crossing user/kernel space dominates, and the cleanest fix is to stop crossing so often.

Architecture 2: DPDK-Enabled Load Balancing (200K–2M RPS)

We went kernel-bypass with DPDK. Pre-alloc pools, run-to-completion, offloads. You become the OS for your packets, which is as terrifying as it sounds for the first week and then oddly empowering.

// DPDK-based packet processing  
struct rte_mbuf *process_packet(struct rte_mbuf *pkt) {  
    struct rte_ether_hdr *eth_hdr = rte_pktmbuf_mtod(pkt, struct rte_ether_hdr *);  
    struct rte_ipv4_hdr *ip_hdr = (struct rte_ipv4_hdr *)((char *)eth_hdr + sizeof(*eth_hdr));  
    struct rte_tcp_hdr *tcp_hdr = (struct rte_tcp_hdr *)((char *)ip_hdr + sizeof(*ip_hdr));  
    ip_hdr->dst_addr = select_backend_ip(ip_hdr->src_addr); // direct rewrite  
    pkt->ol_flags |= PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM;    // HW checksum  
    return pkt;  
}
Enter fullscreen mode Exit fullscreen mode

Performance Improvements:

  • Peak RPS: ~1.8M (single node)
  • CPU Utilization: ~85% (useful cycles)
  • Memory Usage: ~8GB (pre-alloc pools)
  • P95 Latency: ~12ms
  • Context Switches: −94%

That 94% reduction read like a typo the first time. We re-ran it on three different boxes with three different NICs because it felt indecent that a configuration change could be worth that much. But it’s not a trick; it’s just cutting out the expensive middleman. The hard parts: you own scheduling, safety nets shrink, and you must design at the level of rings, bursts, and backpressure. You discover your appetite for polls versus interrupts and learn that “busy-polling” is not a dirty phrase if it saves your tail.

We also learned to treat mbuf lifecycle like gold. Free late, allocate smart, avoid fragmentation like you avoid scope creep. Packet pools per socket, per core caches sized by measurement, not vibes.

Architecture 3: Distributed Multi-Core Architecture (2M–5M RPS)

DPDK gave us speed; NUMA + per-core gave us scale. Pin everything, allocate local, fear cross-socket hops. Treat your machine like a cluster where socket boundaries are network links in disguise.

struct lb_core {  
    unsigned int core_id;  
    struct rte_ring *rx_ring, *tx_ring;  
    struct backend_pool *backends;  
    struct connection_table *conn_table;  
} __rte_cache_aligned;  
static int init_lb_core(struct lb_core *core, unsigned int core_id) {  
    int socket_id = rte_lcore_to_socket_id(core_id);  
    core->conn_table = rte_zmalloc_socket("conn_table",  
        sizeof(struct connection_table), RTE_CACHE_LINE_SIZE, socket_id);  
    cpu_set_t set; CPU_ZERO(&set); CPU_SET(core_id, &set);  
    pthread_setaffinity_np(pthread_self(), sizeof(set), &set);  
    return 0;  
}
Enter fullscreen mode Exit fullscreen mode

Final Performance:

  • Peak RPS: ~5.2M (8-core system)
  • CPU Utilization: ~92% (balanced)
  • Memory Usage: ~32GB (NUMA-optimized)
  • P95 Latency: ~8ms
  • L3 Miss Ratio: ~3.2%

We also separated concerns aggressively: RX parsing, classification, routing decision, TX; all per-core, with hot data structures sized to fit L2 where possible. The big “aha” was admitting that a global shared state — even read-mostly — was a tax we couldn’t afford. We pushed as much as possible into per-core shards and reconciled slowly in the background. Work-stealing? We tried it. At this RPS, the steal overhead often costs more than the imbalance, so we engineered the sources to be balanced instead.

Critical Optimizations That Made the Difference

There were dozens (NIC RSS tuning, batching handshakes, smarter retry budgets), but these three rearranged the ceiling height.

Optimization 1: Zero-Copy Packet Processing

Copies are tiny betrayals at this scale. Kernel→userspace, parse buffers, header fiddling — each hop quietly detonates your bandwidth and dirties caches. Every “just memcpy this” is a small stone in the backpack that becomes a boulder at 5M RPS.

// Bad: Multiple copies (kernel↔userspace, buffers)  
recv(fd, buffer, sz, 0);      // Copy  
parse_http_request(buffer);   // Copy  
modify_headers(buffer);       // Copy  
send(fd2, buffer, sz, 0);     // Copy  

// Good: DPDK zero-copy  
struct rte_mbuf *pkt = rte_pktmbuf_alloc(mbuf_pool);  
char *data = rte_pktmbuf_mtod(pkt, char *);  
modify_packet_in_place(data);  
rte_eth_tx_burst(port_id, qid, &pkt, 1);
Enter fullscreen mode Exit fullscreen mode

Impact:

  • Memory Bandwidth: 45GB/s → 12GB/s
  • L3 Hit Ratio: 77% → 91%
  • CPU Cycles/packet: −35%

This also forced cleaner architecture: “in-place or bust.” We stopped treating packets like blobs to decode and re-encode, and instead surgically modified what mattered. Alignment became a first-class citizen. Even the way we touched headers — read order, write order — was tuned to avoid false sharing and straddling cache lines.

Optimization 2: Connection Table Optimization

Lock-free, cache-aligned, power-of-two, linear probing, ABA-safe. It fought us; we won. The goal wasn’t cleverness — it was predictability. Constant-time-ish lookups with minimal memory traffic.

struct connection_entry {  
    uint32_t client_ip; uint16_t client_port;  
    uint32_t backend_ip; uint16_t backend_port;  
    uint64_t last_seen;  uint32_t flags;  
} __attribute__((packed, aligned(32)));
Enter fullscreen mode Exit fullscreen mode

Performance Characteristics:

  • Lookup Time: ~12ns avg (vs ~180ns unordered_map)
  • Memory: −40% via tight packing
  • Scalability: Linear to ~10M conns

We also learned to expire aggressively — but deterministically. Periodic scans by socket-local sweepers with bounded work per tick, not “GC storms” that freeze the world. And because scale makes rarely-colliding keys collide, our probing strategy picked a MAX_PROBE tuned by measurement, not hope.

Optimization 3: NUMA-Aware Memory Management

Allocate by socket. Treat cross-socket like a toll road. It’s incredible how much “mystery latency” is just a cache miss that took a bus ride.

packet_pools[socket_id] = rte_pktmbuf_pool_create(  
  "packet_pool", PACKET_POOL_SIZE, PACKET_CACHE_SIZE, 0,  
  RTE_MBUF_DEFAULT_BUF_SIZE, socket_id);
Enter fullscreen mode Exit fullscreen mode

NUMA Impact:

  • Memory Latency: −45%
  • Controller Efficiency: +60%
  • Scaling: linear across sockets

We mapped NIC queues to cores on the same socket, kept rings local, and audited every allocation path until “local or bust” was the default. When we had to cross sockets (rare), we batched it and paid once. The perf counters told the story: cross-socket traffic dipped, and P99 tails tightened without us changing a line of “business logic.”

Load Balancing Algorithms That Scale

Round-robin/least-conns choke on coordination. Consistent hashing + virtual nodes kept the hot path lock-free and failure churn sane. More importantly, it gave us stickiness without shared locks: the same client hash maps to the same backend, so connection reuse improves cache warmth on the backend, too.

#define VIRTUAL_NODES_PER_BACKEND 150  
struct consistent_hash_ring {  
    struct hash_node nodes[MAX_BACKENDS * VIRTUAL_NODES_PER_BACKEND];  
    uint32_t node_count;  
} __rte_cache_aligned;  

// Lock-free backend selection (binary search on sorted ring)  
static inline uint32_t select_backend(struct consistent_hash_ring *ring, uint32_t client_hash) {  
    uint32_t left = 0, right = ring->node_count - 1;  
    while (left < right) {  
        uint32_t mid = (left + right) / 2;  
        if (ring->nodes[mid].hash < client_hash) left = mid + 1;  
        else right = mid;  
    }  
    return ring->nodes[left].backend_id;  
}
Enter fullscreen mode Exit fullscreen mode

Performance Benefits:

  • Synchronization: none on hot path
  • Distribution: ~99.2% uniform
  • Failover Churn: < 1% redistribution

That <1% redistribution kept backends warm and minimized cascading cache penalties. We versioned rings, updated them off the hot path, and swapped pointers atomically. Even failure promotion became predictable: less drama, more math.

Monitoring and Observability at Scale

At 5M RPS, monitoring can become the bottleneck. We made it per-core, lock-free, and aggregated cautiously. You cannot afford global locks around counters; your “insight” will be the slowest path in the system.

High-Performance Metrics Collection

struct lb_stats {  
    uint64_t requests_processed, bytes_forwarded;  
    uint64_t connection_errors, backend_timeouts;  
} __rte_cache_aligned;  

static __thread struct lb_stats core_stats;  

static void aggregate_stats(void) {  
    struct lb_stats total = (struct lb_stats){0};  
    RTE_LCORE_FOREACH_WORKER(lid) {  
        struct lb_stats *c = get_core_stats(lid);  
        total.requests_processed += c->requests_processed;  
        total.bytes_forwarded    += c->bytes_forwarded;  
        // ...  
    }  
    export_metrics(&total);  
}
Enter fullscreen mode Exit fullscreen mode

We exported deltas instead of absolute counters to cut payload size. Sampling intervals were tuned so we never exceeded ~1–2% overhead for observability. Anything heavier got downgraded or moved to on-demand profiling. Flame graphs and perf sampling live behind a feature flag — flip it, learn, flip it back. And yes, we rate-limited logs, tagged by core and socket, and wrote them to memory-mapped files to avoid I/O stalls on the dataplane.

Critical Metrics for 5M RPS Systems:

  • Per-core RPS (first sign of imbalance; drift means your queues or RSS are off)
  • Memory bandwidth per socket (when this pegs, everything else lies)
  • L1/L2/L3 cache hit ratios (the truth about your data structures)
  • Cross-socket traffic (your NUMA tax statement)
  • Packet drop rates (ingress/egress separately; drops are truths, not insults)
  • TLS handshake rate and resumption hit rate (crypto pain index)
  • Retries per backend and per cause (throttle the self-inflicted wounds)

We also tracked a quiet hero: queue depth histograms on RX/TX per core. It’s how we spotted microbursts and tuned coalescing. Bursts don’t show in averages; they show in the short, sharp spikes that blow your tail.

Lessons Learned: What I Wish I Knew Earlier

Lesson 1: The 80/20 Rule Doesn’t Apply

At the edge, “edge cases” are the workload. The rare path is the one that melts your caches. Optimize the weird paths too. It’s exhausting — but cheaper than firefighting.

Lesson 2: Memory is the New CPU

We spent ~60% of effort on bandwidth and locality. It paid for everything else. Profile memory controllers, not just cores. Learn to love perf stat output you used to skip.

Lesson 3: Premature Optimization vs Premature Dismissal

Some “too low-level” work becomes table stakes at scale. Smell the inflection point: rising L3 misses, NIC queues nudging limits, TLS churn becoming nontrivial. That’s when “later” becomes “now.”

Lesson 4: Testing at Scale is Different

100K RPS tests don’t predict 5M RPS behavior. Emulate burstiness , TLS renegotiation storms , cache dirt , backend brownouts , and NIC queue saturation. If your generator can’t produce pathologies, it’s not a load test — it’s a vibe check.

Lesson 5: Hardware Matters More Than Software

DDR4–2400 → DDR4–3200: ~15% throughput. CPU uarch swings 40%+. PCIe lane layout, NUMA topology, NIC RSS capabilities — these are first-class design constraints, not purchase order trivia. You cannot software your way out of a memory wall you bought.

Decision Framework: When to Optimize for Extreme Scale

Implement High-Performance Architecture When:

  • Traffic projection > 1M RPS in ≤12 months (honest projection, not marketing)
  • P99 < 10ms under real load (with TLS, retries, bursts)
  • Perf work beats the cost of throwing hardware (TCO, not list price)
  • Team has low-level chops or appetite (someone must love perf counters)
  • Reliability demands no degradation during spikes (your SLOs mean it)

Standard Load Balancers Suffice When:

  • Steady state < 500K RPS and growth sane
  • P99 ≥ 50ms acceptable and consistent
  • Feature velocity > raw perf (ship features, not cache lines)
  • Operational simplicity wins (managed LB services are fine technology)
  • You can buy time with autoscaling without drowning in tail latency

The point isn’t ideology. It’s fit. Extreme scale pays off only when it actually shows up — and stays.

The 5M RPS Reality

This wasn’t “do the same faster.” It was “do different things entirely.” Bypass layers you once revered, treat memory like a network, count nanoseconds like currency. The paradox: what feels like over-engineering at 50K becomes survival at 5M. The trick is leaving yourself an escape hatch — design optionality so future-you doesn’t have to chainsaw the foundations. A few we were grateful for: the ability to flip to kernel bypass behind a flag, the ability to shard per core without changing semantics, and the freedom to swap load balancing strategies without rewiring the world.

At this scale every best practice gets cross-examined. Some hold; many don’t. “One big lock” becomes a horror story. “Global counters” turn into slow-motion denial-of-service. Even your dashboards, if naive, become a self-inflicted wound. The teams that win measure honestly, change their minds quickly, and optimize where the physics actually is — on the wire, in the cache, across the socket boundary.

And yes, I sleep better now. Not because it’s perfect (ha), but because the scary parts are visible and contained. We have names for them. We have graphs for them. We have knobs we trust. That might be the quiet superpower of 5M RPS: the humility it teaches. You stop arguing with the hardware and start listening to it.


Enjoyed the read? Let’s stay connected!

  • 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
  • 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
  • ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

Top comments (0)