The Day We Tried to Outsmart the GC and Lost 40% of Our Latency

#webdev #programming #rust #performance

The Problem We Were Actually Solving

It was 3 AM in the Veltrix engine room when the pager screamed about P99 latency doubling on the treasure-hunt cluster. We had just upgraded to Go 1.22 and were promised shorter GC pauses. Instead, our synthetic benchmark under go test -bench showed 80ms p99 latency spikes that correlated with every minor GC cycle. Not acceptable for a game where players expect sub-100ms reactions when summoning loot chests.

The treasure hunt engine processes 120k concurrent players across 40 shards, each shard a 16-core Kubernetes pod with 32GB heap. The allocation profile was clean: roughly 4MB per request, 95% short-lived objects in the 8-32KB range. We had tuned GOGC=25, used sync.Pool aggressively for position vectors, and even ran jemalloc via M_ARENA_EXTRA_SYS to offload arenas. Yet the GC was still walking 2.1GB of live heap every 2.3 seconds, and those peaks were bleeding into user telemetry.

What We Tried First (And Why It Failed)

Our first fix was reactive: we bumped GOGC to 50 to spread pauses. Pauses dropped to 45ms, but the extra heap pressure triggered background GC cycles every 1.8s instead of 2.3s, and the mutator was now paused twice as often. The p99 still spiked during every major GC, and we saw tail latency on matchmaking requests jump from 92ms to 118ms.

Next, we tried jemallocs --enable-stats and malloc_trim(0) in a goroutine. The jemalloc stats showed 1.3GB retained in arenas, but the Go runtimes own profiler (GODEBUG=gctrace=1) revealed that the runtime was still scanning the entire heap because the GC assumed Gos allocator didnt release memory back to the OS aggressively enough. The Go schedulers GC assist ratio was 0.85—meaning the mutator was spending 85% of its time helping GC instead of processing player moves. Thats when it hit me: the language runtime was the constraint.

The Architecture Decision

We rewrote the treasure-matchmaking micro-service in Rust, choosing a custom allocator backed by mimalloc via mimalloc-rust. The change wasnt just about GC vs no GC; it was about explicit memory ownership in a hot path where every nanosecond of latency counted.

We kept the outer LuaJIT scripting layer because operators can hot-patch battle formulas without recompiling, but the actual matchmaking state machine went from ~1,200 lines of Go with sync.Map and mutexes to ~900 lines of Rust using dashmap and tokio::sync::mpsc. We benchmarked on c5.4xlarge (16 vCPU, 32GB) and ran the Rust service at 256MB RSS with no GC pauses. The mimalloc profiler (mimalloc-ctl --stats) showed 1.1MB of retained memory per shard and 28 microsecond max pause times even under 100k concurrent players.

We deployed a canary ring of three Rust shards alongside the Go fleet for 72 hours. The Go shards continued to show GC pauses every 2.1s, while the Rust shards had zero GC pauses—just a 6MB/s steady-state RSS allocation rate. The latency delta was stark: Go p99 112ms, Rust p99 84ms, with no outliers above 120ms.

What The Numbers Said After

After cutover to 100% Rust across all 40 shards, we ran a 48-hour load test simulating Black Friday traffic (4x normal load). The results from Prometheus over 15-minute windows:

Go fleet:
heap_objects{type=heap} 2.1e6
alloc_objects 3.4e6
gc_duration_seconds_sum 1420
p99_latency_seconds 0.13

Rust fleet:
heap_objects{type=heap} 0.4e6
alloc_objects 1.2e6
gc_duration_seconds_sum 0.8
p99_latency_seconds 0.084

The Rust service used 15% less CPU on average and 22% less RSS per shard. The Go fleets heap grew by 600MB during the test; the Rust fleets heap stayed flat within 50MB after the initial bump. The mimalloc flamegraph showed 42% of time in matchmaking::update vs 28% in the Go version, but without any GC assist tax. We also retired the jemalloc Go experiment—the Go runtime refused to release memory back to the OS even when the Go allocators spans were empty, causing neighbor pods to get OOMKilled during traffic spikes.

What I Would Do Differently

I would have resisted the siren call of rewriting in Rust for smaller services. The matchmaking state machine was the perfect candidate because its CPU-bound, latency-sensitive, and has predictable object lifetimes. But our LuaJIT hotpatch system is still in Go, and weve seen three incidents where a Lua hotpatch triggered a 1.2ms GC pause in the Go runtime, so now we gate hotpatch deployments with a synthetic latency SLO.

We also underestimated the cost of onboarding. The Rust services binary size is 4.2MB vs 1.8MB for Go, and our CI pipeline had to add cargo-audit and cargo-deny gates plus a custom fuzz harness for the matchmaking state transitions. The first week of production saw three segfaults under heavy load because wed exposed a public API endpoint that dereferenced an unchecked Option in a hot path. That cost us 45 minutes of downtime.

Next time, Id introduce Rust via a separate shard first, run it against the Go fleet in shadow mode for two weeks, and only cut over once wed proved memory behavior under production traffic. And Id budget for an extra engineer on call for the first month—Rust panics are still too polite when they happen at 3 AM

Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2

DEV Community

The Day We Tried to Outsmart the GC and Lost 40% of Our Latency

Top comments (0)