The Day Our Operator Became The Constraint

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The engine was a classic microservice written in Go 1.21, running on Kubernetes 1.28 with cgroups v2. The treasure-hunt rules required sub-second path validation across 2 M active users and 600 K concurrently open sessions. Our first production run on a c6g.2xlarge node (8 vCPU, 16 GB RAM) showed:

Go alloc: 1.8 GB/s
RSS resident: 9.2 GB
P99 latency: 220 ms

The symptoms were brutal: every time the GC started (every 500 ms on average), latency spiked for 40-60 ms while the mutator was paused. The Go GC is good at throughput, but at this allocation rate it became the bottleneck. We tried lowering GOGC to 25, but allocation rate barely dropped and p99 became unbounded because the collector couldnt keep up with 32 % CPU parked.

What We Tried First (And Why It Faliure)

Next we reached for pprof. The graph showed the GC scanning 1.6 GB heap in 27 ms, but the final STW phase took 14 ms even after GOGC=10. That STW pause was causing 10 % of our requests to breach 500 ms. The real kicker was the card table resize: each time the heap grew past 4 GB, the card table doubled, causing an extra 14 ms pause. Our heap size was still climbing because our hot path kept allocating 64-byte slices for partial path updates.

We tried sync.Pool to reuse the slice objects. Pool hit rate hit 92 %, but we introduced a 140 ns per-object lock latency spike under contention. Our profiling showed the spin loop inside pool.Get() burning 8 % CPU in kernel space when 1000+ goroutines contested the same bucket. The yield() syscall that Gos runtime does to mitigate this adds up to 2.3 µs per contested access; multiply by 1 M ops/sec and you get 2.3 ms/s wasted, which at p99 looked like 19 ms added latency.

The Architecture Decision

At this point the language runtime itself was the constraint. We benchmarked Rust 1.75 using mimalloc-rs with jemalloc backend. The same workload in Rust with an arena allocator yielded:

Alloc rate: 380 MB/s
RSS resident: 4.8 GB
P99 latency: 68 ms

The key difference was the Rust allocator never resized the arena; we set it to 80 MB up front and reused it. The Go heap still had to grow to 9 GB because temporary slices escaped to the heap even with pooling. Rusts borrow checker let us prove the slices never outlived the request, so we could stack-allocate them in a 128 KB scratch arena.

We also replaced the Go scheduler with a work-stealing thread pool tuned to 4 OS threads pinned to the same NUMA node. That killed the parked goroutine cost from 32 % to 2 %. The arena allocator in Rust removed the GC pause entirely: mutator utilization stayed at 98 % with zero STW.

We ran a canary with 5 % traffic on the new binary for 7 days. The Rust runtime used 30 % less memory than Go with GOGC=10 and delivered consistent 65 ms p99, beating the requirement by 35 %.

What The Numbers Said After

After full cutover:

Before: Go 1.21, 8 vCPU, 16 GB RAM, p99 220 ms
After: Rust 1.75, 4 vCPU, 8 GB RAM, p99 65 ms

Latency distribution flattened: p50 stayed at 42 ms, p95 at 78 ms. The only regression appeared in startup time—Rust took 420 ms to initialize the arena, while Go did it in 30 ms. For a 24/7 operator service that mattered less than the consistent tail latency.

Our memory safety story improved too. During the Rust rewrite we caught three use-after-free bugs in the Go port by enabling MIRI on nightly builds. The first was a pointer to a temporary slice that escaped into a shared cache; the second was a double-free in the reward-distribution path; the third was a data race on a counter protected by a single CAS loop. All three would have been silent data corruptions in production.

What I Would Do Differently

I would not have waited three weeks to measure allocator pressure. If I had run go tool trace with the alloc tracer enabled on day one, the 1.8 GB/s allocation rate would have screamed at us. We also should have profiled the Go GC interactively: running GODEBUG=gctrace=1 at runtime shows GC summary every cycle, but the real pause breakdown requires the trace package with the heap goal graph.

Next time I see a language runtime eating scheduler CPU, Ill reach for perf first, then the allocator metrics, and only then consider a rewrite. A rewrite in Rust is not free: the Rust toolchain added 3 GB to our CI image size, increased our build time from 3 minutes to 8 minutes on GitHub Actions, and forced us to rewrite the observability layer because our Go metrics exporter was too GC-bound to export data during stress.

The lesson is local: measure the runtime before blaming the code. Go is fast, but when youre burning 32 % CPU in parked goroutines because the scheduler is fighting the GC, the language is the bottleneck, not the algorithm.