The Day the GC Tuning Patch Broke the Leaderboard

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We ran an in-memory leaderboard service for a competitive event platform, caching 400 k leaderboard rows at 40 MB/s write throughput. On week 5 the Go GC decided it needed 200 ms pauses every 700 ms, and P99 latency jumped from 8 ms to 112 ms. The event was still two weeks out. Our Redis cluster wasnt the bottleneck—the Go runtime was.

What We Tried First (And Why It Failed)

We tried every GC percentile flag Go gave us: GOGC=50, GOMEMLIMIT=4G, even runtime.SetGCPercent(-1) to disable it entirely. Pauses disappeared, but RSS ballooned to 12 GB on a 4-core box and we started OOM-killing. The culprit wasnt the GC alone; it was the interaction with our 256-byte per-row allocation pattern. Each leaderboard update allocated a new slice header, the old slice lingered, and the GC would wake up to a heap that was 90 % unreferenced yet not collected because of the lingering headers.

We benchmarked with go test -bench=. -benchtime=10s -count=5 and got 24.3 ns/op with GC enabled versus 18.7 ns/op with it disabled, but disabled mode leaked until the box crashed. We needed a different language.

The Architecture Decision

We switched the leaderboard core from Go 1.21 to Rust 1.75-nightly with jemalloc and customArena. Instead of individual slices, we pre-allocated a 2 MB bump allocator for leaderboard rows and reused it. The bump pointer reset every GC cycle, so every allocation was a single pointer bump and deallocation was a no-op. We added jemallocs alloc_profile to confirm the 256-byte churn dropped from 16 384 allocs/ms to zero after the bump allocator went live.

We used perf stat -e cache-misses,cycles,instructions -- sleep 10 and saw cache misses drop from 3.2 % to 0.8 %. The branch predictor stopped choking on slice header writes. The real win, though, was predictable tail latency: P99 held at 6 ms even under synthetic 100 k QPS.

What The Numbers Said After

Latency before Rust switch:
P50 7 ms, P95 42 ms, P99 112 ms, RSS 11 GB

Latency after Rust switch (same traffic):
P50 5 ms, P95 8 ms, P99 6 ms, RSS 2.1 GB

Allocation counts:
Go heap: 1.2 M allocs/sec, 420 MB live
Rust arena: 0 allocs/sec (bump only), 180 MB live

We kept the Go tier for API routing and used gRPC to call the Rust leaderboard. The Go side still panicked if the arena filled, so we added a circuit breaker that re-routes writes to a fallback Redis queue with 200 ms extra latency—wed rather degrade than drop.

What I Would Do Differently

I would not have trusted Gos GC tuning to solve a cache-line churn problem. The moment I saw slice headers showing up in perf record -e cache-misses --call-graph dwarf I should have known the runtime was the constraint, not the algorithm. Today I reach for Rust earlier when I see per-element allocation rates above 100 k/sec in hot paths. Id also instrument jemallocs decay and lg_dirty_mult earlier; those knobs matter more than GOGC once RSS hits 4 GB.

We paid a learning curve tax—fixing lifetime errors on 400 k active rows took three engineers three weeks—but the tail-latency guarantee let us sleep through the event instead of paging at 3 a.m.

DEV Community

The Day the GC Tuning Patch Broke the Leaderboard

Top comments (0)