Rust Was the Constraint: How We Discovered the Language Was Our Scaling Bottleneck

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our treasure-hunt engine, running on Go 1.21 and a 3-layer micro-service stack, was supposed to scale to 50,000 concurrent connections with sub-50 ms p99 latency. We had engineered around every other obvious constraint: connection pooling, sharded Redis clusters with write-behind caching, and a bespoke lock-free ring buffer for the move stream. Yet every Friday night when North American players came online, the GC would cycle and the jitter spiked above 80 ms. pprof showed 38 % of wall time inside the sweep phase and 12 % inside mark termination. We measured the heap at 7.6 GB per instance, even though the live objects only accounted for 1.4 GB. The rest was fragmented or pinned.

Worse, the Go runtime did not expose a tunable limit on GC pacing. We tried GOGC=25, GOGC=10, even GOMEMLIMIT=4GiB, but the sweeper still ran in stop-the-world bursts. The jemalloc allocator underneath Gos page heap was coalescing small arenas so aggressively that the allocator latency histogram developed a fat tail beyond 2 ms per allocation.

What We Tried First (And Why It Failed)

First we attacked the symptom: we tuned the GOGC knob downward and increased GOMEMLIMIT in 500 MiB increments. At GOMEMLIMIT=3.2GiB the GC frequency doubled, but the pause times dropped to 22 ms. Unfortunately, the heap fragmentation increased the RSS by 22 %, which forced us to shrink the shard count from 32 to 24 per AZ. That meant fewer players per cluster and higher cross-AZ traffic during the daily spike.

Next we tried replacing the in-memory ring buffer with a C++ shim using Boost.Lockfree. We pinned the worker threads to dedicated cores, used HugeTLB pages, and replaced the Go allocator with mimalloc. The lock-free queue reduced the move-processing latency from 1.4 ms to 0.9 ms p95. Yet the GC pauses still dominated the tail. The Go runtime was still accounting for 27 % of wall time, and the jemalloc allocator inside mimalloc showed 8 % of allocations above 300 µs. We were optimizing the wrong layer.

The Architecture Decision

In a meeting that lasted 78 minutes, we forced ourselves to ask the real question: Was the runtime the constraint? We built a single-threaded prototype in Rust that processed the same move stream with a lock-free MPSC channel backed by a custom arena allocator. We ran it under perf with flamegraph, and the picture was unambiguous: no GC, no stop-the-world, and the allocator latency histogram was flat at 45 µs p99. The binary grew by 280 KB but the RSS shrank from 3.4 GB to 2.1 GB because we could control arena growth explicitly.

We chose Rust for the new treasure-hunt service layer. We kept the Go layer for matchmaking and lobby management because their concurrency patterns were trivially parallel and benefited from Gos scheduler for short-lived goroutines. The Rust service, codenamed IronHoard, became the move-dispatcher: a 24-core tokio runtime with a sharded work-stealing scheduler, custom arenas of 64 MiB each, and a lock-free channel to the Go layer via gRPC over shared memory (using boringtun for zero-copy framing).

The switch was not free. We burned three weeks porting the move-validation logic and another two weeks tuning the allocator. We hit the infamous Vec::reserve exponential growth bug in our arena implementation, which caused a single allocation to fragment the heap and double the p99 latency for one shard. We fixed it by switching to std::alloc::Allocator with a bump-pointer region tied to the MPSC queue.

What The Numbers Said After

Under the same 15,000 concurrent load we ran before the meltdown:

Go service CPU usage dropped from 68 % to 42 % because IronHoard handled the move stream directly.
IronHoard allocator latency (measured via jemalloc-rss malloc_stats_print): p99 48 µs, p99.9 92 µs.
GC jitter in the Go layer vanished; Go p99 GC pause time fell to 1.2 ms, and tail dropped below 3 ms.
RSS per IronHoard pod stabilized at 1.9 GiB, a 24 % reduction from the mimalloc shim.
The cross-AZ traffic decreased 18 % because IronHoard shrank the queue depth.

In production under peak load of 48,000 concurrent players:

IronHoard p99 latency: 14 ms, p99.9: 38 ms.
Go matchmaking layer p99 latency: 22 ms, p99.9: 62 ms.
Total cost per 1,000 player-minutes dropped 14 % because we halved the number of Go pods per AZ.

The flamegraph told the story: 62 % of IronHoards wall time was spent in the lock-free MPSC channel, 23 % in the custom arena, and 11 % in the SHA-256 move validation. There was no GC flame.

What I Would Do Differently

I would not have trusted the runtime to protect me from myself. We overestimated the Go schedulers ability to manage 110,000 concurrent connections per pod while keeping GC pauses below 30 ms. The jemalloc allocator hidden inside Go became our invisible bottleneck; we only saw it when we measured allocation latency with jemalloc-rs