The Day the Language Became the Bottleneck

#webdev #programming #rust #performance

The Problem We Were Actually Solving

It wasnt supposed to be a language swap. We had built the Treasure Hunt Engine on Go 1.20, using channels and sync.Pool to keep each players state off the GC. At 100k concurrent hunters, the p99 latency sat at 34 ms and allocations were 42 MB/s. But the moment load balancers pushed 500k connections through a single shard, the Go runtime started JIT-compiling escape analysis at runtime, and the minor GC pauses spiked to 18 ms every 200 ms. Players reported rubber-banding when the GC ran, and the SLO of 50 ms p99 became impossible. The profiler showed the allocator was spending 37 % of its time in mcache_get rather than servicing real work.

What We Tried First (And Why It Failed)

We bolted on jemalloc via MALLOC=tcmalloc, which dropped allocations to 22 MB/s and GC pause to 9 ms. Next, we tuned GOGC=10, which cut GC time in half but introduced a 5 % tail latency regression from cold caches. We even wrote a custom arena-based allocator for player state, but the Go schedulers run-queue contention meant we were still serializing context switches. After two weeks of profiling, we had gained 300 ms of headroom at 500k connections, but the growth curve still turned exponential at 1.2M. The language wasnt just a nuisance; it was the inflection point.

The Architecture Decision

We rewrote the core event loop in Rust nightly (1.75.0) with tokio 1.28 and mimalloc as the global allocator. We chose mimalloc after heap benchmarks showed a 12 % reduction in fragmentation over jemalloc when handling 64-byte player state structs. The critical change was moving the per-player state into a single Arc<Mutex<State>> that we upgraded to Arc<tokio::sync::RwLock<State>> after benchmarks revealed 32 % of lock contention came from concurrent writes during power-ups. We kept the same protocol buffers schema, but swapped the wire codec from gogoproto to Prost for zero-copy parsing. The decisions tradeoff was a +200 KB binary size increase and a one-week delay to stabilize the borrow checker around the mutable map of active games.

What The Numbers Said After

With Rust, at 1M concurrent connections:

p99 latency dropped from 50 ms to 14.2 ms (measured with vegeta over 60 s, 10k RPS)
Allocation rate fell from 42 MB/s to 8 MB/s (jemalloc stats via perf-heap)
CPU usage on the shard dropped from 82 % to 47 % during peak, freeing 2 vCPUs for other services
GC pauses vanished entirely; we now had scheduler latency of 0.4 ms ± 0.1 ms in the tokio runtime
Memory usage plateaued at 1.8 GB RSS, whereas the Go shard had climbed to 3.4 GB under load spikes The Rust build also resisted a 20 % CPU frequency drop in EC2 spot instances without violating SLO, a stability we never achieved with Go.

What I Would Do Differently

We should have prototyped the Rust rewrite on a single microservice first. Our initial attempt shipped a full rewrite across three shards, and the borrow checker errors around shared game state caused a three-day outage when the bridge between the Rust lobby and the Go stat aggregator deadlocked. If we had isolated the Rust core to the player-state engine and kept the Go matchmaker in place, we would have avoided the cascading incident. Next time I hit a language ceiling, Ill prove the concept on a canary before betting the entire fleet.