The Day Our Runtime Became the Garbage Collector

#webdev #programming #rust #performance

The Problem We Were Actually Solving

I remember the moment it happened. Not the panic of a crash, not the alert firing—just the silent realization that our Go runtime was no longer a performance enabler but a performance limiter. We were running a distributed treasure hunt engine for a mobile gaming event with 1.2 million concurrent users. The engine ingested 4.5 million location updates per second and needed to validate each against a 2TB in-memory state graph. Gos GC was doing its job, but with each collection cycle, latency spiked to 80ms while CPU spent 18% of cycles in sweep termination. Our SLA required ≤50ms p99 latency. That runtime ceiling was invisible until the moment it became the wall.

What We Tried First (And Why It Failed)

We started with Go 1.21 and its concurrent tri-color GC. At 500k updates/sec, p99 was 22ms and CPU was 65%—well within spec. When load doubled to 1M updates/sec, GC pause times doubled. We profiled with go tool trace and saw the stopTheWorld phase lasting 3–5ms per collection—seemingly small, but spread across 16 worker goroutines, it became a distributed deadlock.

We tried tuning GOGC from 100 to 50, hoping to trade memory for latency. All we got was a 30% memory increase and 12% GC CPU overhead. The bigger heap meant more dirty pages, which increased page faults and NUMA migrations. We tried Go 1.22s soft memory limit flag, but the GC still couldnt keep up with mutation rates above 1.5M updates/sec. The language had become the bottleneck.

The Architecture Decision

We stopped blaming the GC and started benchmarking alternatives. We spun up a Rust version using tokio with jemalloc and a custom sharded state store. The first surprise: Rusts Arc and Mutex were still 4x slower than Gos channels for coarse-grained locks, so we switched to lock-free queues using crossbeam-channel and mimalloc for global allocator.

We benchmarked with criterion.rs on a synthetic load of 5M updates/sec. The Rust binary showed p99 latency of 18ms, GC CPU usage dropped to 2%, and RSS stabilized at 11GB versus Gos 14GB. The tradeoff was compile time: a clean build took 7m30s and hot reloading was impossible, which meant we had to adopt a blue-green deployment with docker build --squash. For a live gaming event, that was an acceptable cost.

The real architectural shift wasnt the language—it was the allocator and the concurrency model. We split the state graph into 1024 shards, each with its own lock-free update queue. Each shard ran on a dedicated tokio worker, and we used tokio::task::unconstrained to bypass the work-stealing scheduler in hot paths. The Go version couldnt adopt this granularity because channel allocations overwhelmed the GC.

What The Numbers Said After

After migrating the top-10 shards to Rust, p99 latency at 1.2M updates/sec dropped to 19ms and stabilized. GC CPU in Go dropped from 18% to 3%, but we still had 12% CPU in runtime scheduler preemption. Rust used 15% less memory and showed zero GC pressure. We kept Go for the non-critical path—a Node.js leaderboard that consumed events via NATS—because its GC pauses were negligible at low throughput.

Heres a snapshot from perf record on the Rust shard:

 10.23% treasure mimalloc:malloc
 8.45% treasure tokio::task::waker
 6.12% treasure lock::spin_lock
 4.31% treasure <unknown> [kernel.kallsyms]

The lock contention was 6%—expected for a lock-free queue—and the waker overhead was 8%, which we mitigated by pinning tasks to CPU cores using taskset and setting RUSTFLAGS=-C target-cpu=native.

We also ran a 24-hour soak test with 2.4M updates/sec. The Go versions RSS grew to 18GB due to heap fragmentation; the Rust version stayed at 12GB and p99 latency remained under 22ms. The Rust binarys memory profile from jemalloc-prof showed exactly 4,287 allocations leaked—all intentional, for long-lived game sessions we explicitly never free until the event ends.

What I Would Do Differently

I would not have assumed Rusts safety guarantees would automatically translate to performance. We spent two weeks chasing Arc memory bloat before switching to lock-free queues. That was a hard lesson: Rusts zero-cost abstractions are zero-cost only if you dont pay for the abstractions you dont need.

I would also have split the Rust shards earlier. We initially kept 90% of the state in Go and only ported hot path shards. The GC pressure from the remaining Go heap still caused 5ms latency spikes during major GC cycles. Porting the entire state graph to Rust eliminated that unpredictability.

Finally, I would have adopted a formal memory model earlier. We had a race condition between a background leaderboard update and a Rust shard eviction that caused a use-after-free. After adding loom and loom::model in CI, we caught the bug in 4 hours instead of 4 days during load testing. Formal verification isnt just for aerospace—its for high-frequency gaming at scale.