The Day the Rust Runtime Saved Us From a 24-Hour Tail Latency Regression

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We were running a 12-node Kubernetes cluster in GKE, each node an n2-standard-8 with 30 GB RAM. The engine processed player move events, built a real-time visibility graph, and broadcast updates to 120,000 concurrent sessions. The service was written in Go 1.21, using Gin for HTTP and a custom event bus powered by Kafka. We had tuned GC to pause < 1 ms every 5 seconds, but at 140,000 moves per second the allocator started sweating.

The profiler screenshot from 03:17 showed:

flat flat% sum% cum cum%
4.2s 18.3% 18.3% 12.4s 54.1% syscall.Syscall
2.8s 12.2% 30.5% 8.9s 38.8% runtime.mallocgc
1.5s 6.5% 37.0% 4.2s 18.3% runtime.chanrecv

The channel receiver was blocking on a 32 KB batch read from Kafka while holding a global lock. Total in-flight objects: 8.7 M objects, 1.4 GB. We had seen this movie before: a GC cycle, a channel wake-up storm, then 24 seconds of stutter.

What We Tried First (And Why It Failed)

We tried three things in the next 90 minutes.

First, we doubled the number of Kafka partitions from 120 to 240 and increased consumers to 240. The head-of-line blocking dropped, but now the Go runtime was doing 120,000 context switches per second and TCP retransmits spiked from 12 to 217 in 30 seconds.

Second, we tuned the GC with GOMEMLIMIT=24GiB and GOGC=25. The max GC pause dropped to 3.1 ms, but the 99.9th percentile stayed at 150 ms because the scheduler was fighting over run queues. We lost 700 MB of RSS to fragmentation in 3 minutes.

Third, we replaced Gin with Fiber, hoping the zero-allocation router would help. Latency fell to 42 ms for three minutes—then the Go allocator paused for 7.4 ms during a sbrk syscall, and everything locked up again.

By 05:00 AM we had burned 78 CPU-minutes, one on-call engineer, and half a terabyte of network egress. The system was still broken, and we were out of knobs in Go.

The Architecture Decision

At 05:12 AM we made the call: rewrite the event bus and visibility engine in Rust. Not a partial port—a clean slate with tokio 1.28, flume for channels, and a custom sharded visibility graph using hashbrowns new FxHashMap.

We chose Rust because we needed:

no stop-the-world GC pauses
bounded worst-case latency for channel wake-ups (< 100 µs)
deterministic memory layout for cache locality
async tasks that dont leak wake-ups into the scheduler

The tradeoff was three engineer-weeks of Rust training and losing Gos cross-language safety net. We accepted the risk because the Go runtime had become the constraint, not our domain logic.

We started with the visibility graph: an immutable arena of 120,000 players, updated via atomic swaps. The first benchmark on an n2-standard-4 showed:

latency 50% 32 µs
latency 99% 147 µs
latency 99.9% 210 µs
allocs 14,280 per second

Better, but still too high. Profiling with perf showed 32 % of time in lock cmpxchg on the arenas atomic pointer. We switched to a sharded graph: 24 shards, each with its own atomic arena. The lock contention dropped to 1.4 %, and the 99.9th percentile fell to 112 µs.

Next, we replaced flumes unbounded channels with bounded ring buffers (1024 slots). We ran bcc-tools biolatency on the Kafka device and saw:

usec : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1 | |
4 -> 7 : 58 |** |
8 -> 15 : 112 | |
16 -> 31 : 298 |** |
32 -> 63 : 712 |* |
64 -> 127 : 1,234 |*** |
128 -> 255 : 876 |** |

Latency stabilized at 89 µs for moves, with no GC pauses above 50 µs.

What The Numbers Said After

We deployed the Rust event bus to staging at 11:00 AM and ran a 1-hour load test at 200,000 moves per second. The results:



Latency (µs) Go 1.21 Rust tokio 1.28
p50 3.4 ms 42 µs
p95 18 ms 68 µs
p99 184 ms 91 µs
p99.9 210 ms 118 µs
RSS growth +1.2 GB +340 MB
Allocs/sec 14

---

> If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2

---