The Day the Go Runtime Became the Bottleneck in Our Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

It was 2024 and Veltrixs flagship treasure hunt engine was choking under 2,000 concurrent players. The CTO had set a hard latency budget: p99 < 200 ms from event publish to each players mobile client. We were running on Go 1.21, using the standard net/http server with a fixed number of OS threads and a 128 MB heap ceiling. At 1,800 players the median latency was 85 ms, but p99 was 380 ms and climbing. Profiling with pprof showed 42 % of CPU time stuck inside the scheduler, not our business logic. The GC, while low-latency, was still missing 12 ms every 100 ms cycle under load. Our in-house metrics dashboard lit up red every 90 seconds with a single metric: Bottleneck: Goroutine preemption.

We had already tuned buffer pools, adjusted write timeouts, and even recompiled the Go runtime to disable the write barrier for our use case. None of it moved the needle. The Go scheduler simply couldnt keep 2,000 goroutines—each doing a lightweight pub/sub fan-out—fairly scheduled on 8 physical cores. We had hit the ceiling of the runtimes cooperative model.

What We Tried First (And Why It Failed)

First we tried sharding the players across 8 separate Go processes, each bound to a CPU core and using shared-nothing architecture. The p99 latency dropped to 240 ms, but the latency variance spiked to ±70 ms because the shards couldnt rebalance load in real time. The GC pauses were now per-process, so total GC time per second actually increased from 12 ms to 19 ms across the fleet.

Next we experimented with async/await patterns, switching from goroutines to an event-loop library we wrote on top of Gos epoll wrapper. The p99 improved to 280 ms, but we introduced a new problem: the event-loop ran single-threaded, and saturating a 10 Gbps NIC saturated one core. The other 7 cores sat idle, and horizontal scaling still required nginx round-robin, which didnt respect per-player state locality.

Finally we tried a hybrid: Go for the lightweight admin API and Rust for the heavy fan-out path. The idea was to keep Gos ergonomics where it mattered and Rusts predictability for CPU-bound work. Immediately the scheduler CPU usage in the Go processes dropped to 12 %, but we hit a new failure mode: the unbuffered channel between Go and Rust introduced 6-12 ms of latency jitter on every cross-language call. The CTOs budget was now 180-220 ms, so that jitter blew it.

The Architecture Decision

We ran a 48-hour spike on Rust-only, replacing the entire fan-out engine with Tokio 1.36, a custom slab allocator, and a lock-free ring buffer for cross-thread message passing. The build script compiled with lto=true and codegen-units=1 to ensure the monomorphised hot paths stayed hot in L1 cache. We turned off jemallocs background thread and pinned arena allocation to a single NUMA node to cut cross-node traffic on our bare-metal cluster.

The config change that bit us the hardest was the Tokio worker thread count. We started with one worker per core, but the lock-free ring buffer contention caused 15 % of sends to retry. Profiling with perf revealed 8 % CPU time in _raw_spin_lock. We dropped to 6 workers—60 % of logical cores—reducing contention, and the retry rate fell to 2 %. The scheduler latency histogram flattened.

For persistence we picked RocksDB with a custom write-compression layer that compresses player paths with Zstandard level 3 in 4 KB blocks. We reserved 4 GB of huge pages at boot—2048 × 2 MB pages—mapped to /dev/hugepages. Without huge pages, RocksDBs read amplification caused 42 % extra I/O under load.

What The Numbers Said After

After two weeks of canary traffic, the Rust rebuild hit 2,500 concurrent players with p99 latency of 168 ms and p99.9 at 234 ms. GC CPU time vanished—total CPU usage dropped 28 %—and cross-core load was within 3 % of balanced. The memory overhead per player fell from 14 KB to 6 KB, and the RSS footprint stabilised at 1.8 GB instead of the Go versions 3.1 GB at 2,000 players.

The clincher was the outage test. While the Go version would drop 12 % of real-time updates during a rolling restart, the Rust engine sustained 100 % delivery with a 6-second leader failover and a 9-second follower catch-up. The gap wasnt microseconds; it was the difference between meeting the SLA and missing it.

What I Would Do Differently

I would not have tried the Go/Rust hybrid again. The channel boundary cost 6-12 ms of jitter per hop, and fixing it meant designing a lock-free handoff in C, which defeated the point of ergonomics. If we had bitten the bullet and moved the entire IO path to Rust up front, the migration would have taken two weeks instead of six.

I would also budget time for jemalloc vs. mimalloc vs. snmalloc. Our first benchmarks showed mimalloc reducing L3 cache misses by 14 % versus jemalloc on AMD EPYC, but the mimalloc allocator itself burned 1.2 % CPU under load. We chose snmalloc for its lower contention under NUMA and its deterministic deallocation, which cut our jemalloc-related tail latencies by 20 %.

Finally, I would insist on a synthetic load test that simulates a 10× spike in 500 ms—real treasure hunts spike when a clue drops. The Go version would fall over; the Rust version would absorb it. That single test would have saved us months of fire-drill analysis.