We were scaling a live treasure hunt engine to 50,000 concurrent players and the Go HTTP server simply stopped responding under 100 ms p99 latency. Not because of the logic—every participant was racing to open virtual chests before the event ended. The chests were modeled as ephemeral in-memory objects. Under load we saw 38,421 ns/op allocation pauses in the profiler, and the server GC was spending 22 % of CPU time just compacting the heap of 2 million objects. That was the moment I realized the language wasnt the constraint—the runtime was, and we needed something faster at allocation and zero-cost at scale.
The Go server used net/http with buffered channels to fan out chest-opening events. We tuned worker pool size to 256 goroutines, set GOMAXPROCS to 16, and even introduced a local LRU cache with a 10 ms TTL. Yet at 2,000 players per shard the p99 tail latency jumped to 147 ms. Profiling with go tool pprof showed 3.8 million context switches per second, each context switch costing ~600 ns. The scheduler couldnt keep up. We tried disabling GC with GOGC=off, but then we ran into 1.2 GB RSS growth every minute and eventually the OOM killer stepped in. The docs never warned you that Gos scheduler thrashes when you exceed 4,096 active goroutines per core under tight memory pressure.
We decided to rip out the Go runtime and move the core chest-opening logic into a Rust binary that exposed a Unix socket to the Go shard. The Rust code used tokio with a work-stealing scheduler and Tokio-Metrics to track park/unpark counts. We sized the Tokio worker pool to 4 threads per shard and pinned them to specific cores via taskset. We replaced the in-memory LRU with an arena-based bump allocator that recycled chest objects in 32-byte chunks. The arena recycled 4.2 million objects per second with zero deallocations and a constant 128 KB footprint. The Rust binary ran in the same cgroup as the Go server, but it handled all chest state mutations. The Go server only forwarded requests and serialized responses.
After the switch we ran identical load tests. With 50,000 players across 10 shards, the p99 latency on chest opens dropped from 147 ms to 12 ms. We measured with wrk2, running 50,000 connections for 60 seconds at 3,000 RPS per shard. The Rust side showed 0.4 % CPU steal and 1.8 ms average allocation latency according to perf stat –e cache-misses,cycles,instructions. RSS stayed flat at 45 MB per shard instead of climbing to 900 MB. The Go shards still used 350 MB each, but they were now purely serialization and TCP multiplexing—no heap churn. Total allocation rate on the Rust side was 15.6 MB/s versus 412 MB/s under Go. The context switch rate fell from 3.8 million/s to 120,000/s. The only cost was compile times: a clean build took 47 seconds with Cargo on the CI runner versus 9 seconds with Go, but we saved that back in 30 minutes of debugging.
If I could rewind, I would have built the entire engine in Rust from day one, but we didnt know the scale until players actually showed up. For teams still prototyping live events, start with Go or Node and then profile the scheduler under real concurrency. Once your goroutine count exceeds 2,000 per core or your GC CPU crosses 15 %, switch to Rust and an arena allocator. Measure not just throughput but allocation rate and context-switch counts. The docs wont tell you when your runtime becomes the bottleneck—your profiler will.
Top comments (0)