When We Burned 30% CPU on a One-Line Config Mistake

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Last summer the treasure-hunt engine at Veltrix had to run two thousand individual games on a single Kubernetes cluster without any player noticing lag above 35 ms p99. The games were built from hand-authored JSON quests served by a Go microservice that streamed events through Redis Streams. Each quest contained variable-length arrays of 3D waypoints, which the engine unpacked, validated, and broadcast to players every 150 ms. Our SLA dashboard said we were green, but the on-call rotation was still waking up because the Go runtime was spending 30 % of CPU inside mallocgc—yes, the Go GC, not our code. Profiling output from go tool pprof -alloc_space showed 2.1 GB of transient allocations per game instance per minute. We had capped golang.org/x/exp/maps at 16 workers, but the number of GC pauses was still 0.8 s⁻¹, enough to jitter above the SLA.

What We Tried First (And Why It Failed)

We started by rewriting the JSON parser in Go using encoding/json with DisallowUnknownFields, hoping to catch malformed quests earlier. The error rate dropped from 0.3 % to 0.1 %, but the GC profile stayed identical. Next we swapped the Redis Streams consumer to a buffered channel with 16 goroutines, then moved to a worker pool with explicit backpressure. The GC time actually increased to 34 % of CPU; we had only shifted the allocation burst. We tried sync.Pool for the waypoint slices, but Gos escape analysis kept allocating on the heap anyway because the slices were returned from a higher-order function. At this point we knew the language runtime was the constraint—not our code.

The Architecture Decision

We ran a last-place spike with Rust. The key change was replacing the Go service with a Tokio actor system that streamed quest events through a broadcast channel (tokio::sync::broadcast). Instead of allocating a new Vec<Waypoint> for every broadcast, we pre-allocated a 4 KB buffer on the stack using MaybeUninit and re-used it via an Arc<OnceCell<Vec<Waypoint>>> shared across all actors. The Rust compiler forced us to spell out every lifetime, so the escape analysis was explicit. We compared the two versions using the same quest payloads (average 4 200 waypoints per game) under cargo flamegraph and jemalloc profiling. The Go version allocated 1.8 GB/minute; the Rust version allocated 14 MB/minute with zero GC pauses. Latency p99 dropped from 37 ms to 12 ms on identical Kubernetes nodes, even though the Rust binary was 1.2 MB larger than the Go binary.

What The Numbers Said After

Concrete measurements taken with vegeta at 2 000 RPS:
Go service: p99 37 ms, p50 11 ms, GC pauses 0.8 s⁻¹, RSS 180 MB
Rust service: p99 12 ms, p50 5 ms, GC pauses 0 s⁻¹, RSS 82 MB

We rolled the change to staging using an ArgoCD canary policy that cut traffic in 5 % increments. The first increment exposed a subtle ordering bug in our quest scheduler that only happened when waypoints were reused instead of freshly allocated. Rusts borrow checker caught it at compile time, whereas the Go version had run for three days before the bug surfaced in production as a race on player scores. After fixing the borrow issue, the Rust version stayed within SLA even when we doubled the quest payload size to 8 500 waypoints.

What I Would Do Differently

I would never again use Go for a stateful streaming service that must guarantee sub-40 ms p99 latency. The runtime cost of garbage collection is not linear; it compounds with allocation size and hot code paths. On the other hand, Rust introduced its own complexity curve: learning MaybeUninit and Arc<OnceCell<T>> cost us two engineer-weeks, and the first week was mostly fighting the borrow checker on shared game state. If the quests had been read-only after publishing, a Rust Arc<Vec<Waypoint>> alone would have sufficed; we over-engineered the reuse buffer. The real breakthrough was admitting the language runtime was the bottleneck—once we did, the problem became tractable.

DEV Community

When We Burned 30% CPU on a One-Line Config Mistake

Top comments (0)