When Our Go Engine Blew Up at 3 AM and How Rust Saved the Treasure Hunt

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our treasure map was a graph of 2.3 million nodes and 6.8 million edges stored in Redis. Each client move emitted a WebSocket message that touched 6–12 nodes, triggered proximity calculations in Lua scripts, and updated leaderboard ranks in a PostgreSQL materialized view. Under moderate load, Gos GC would pause for 40–80 ms every 200 ms. During the city-wide launch party, 35,000 users hit refresh simultaneously after a live clue drop. The GC pauses jumped to 300 ms and the RSS curve looked like an EKG alarm. We were dropping WebSocket frames at 8 % and the OOM killer eventually evicted two pods. Users saw Leaderboard n/a for 47 seconds.

What We Tried First (And Why It Failed)

First fix: tune GOGC. We dropped it from 100 to 50, then 25. Response time improved 12 % but latency percentiles remained spiky. We tried running pprof against the GC. It showed 34 % of CPU time in mark termination, with 2.1 million heap objects per second being scanned. The Lua interpreter embedded in Redis was allocating 4 KB Lua stacks per call, and Gos escape analyzer revealed our map objects were escaping to the heap because the graph traversal used a slice of pointers.

Next attempt: rewrite the proximity calculation in C and call it via cgo. This reduced GC pressure by 18 %, but the cgo boundary added 1.2 µs of latency per hop, and we hit the cgo call limit of 2000 per second due to the sheer number of proximity checks. The latency tail grew from 20 ms to 35 ms.

We profiled the Redis Lua itself. It was spending 30 % of CPU in string concatenation when constructing proximity strings. We rewrote that in SHA-1 hashes and base64, but the Redis memory usage exploded from 9 GB to 14 GB, and the LuaJIT still had to scan every node once per move.

The Architecture Decision

At this point I admitted the language runtime was the bottleneck, not the algorithm. Gos GC is great for batch processing but terrible for interactive, latency-sensitive workloads with irregular allocations. I chose Rust for the new treasure core, targeting a rewrite of the graph traversal and proximity engine. We kept Redis and PostgreSQL as data stores but moved the CPU-heavy pathfinding to a separate Rust service deployed on Kubernetes with cpu=2,memory=4Gi limits.

Key trade-offs:

Rusts generational arena allocator eliminated pointer chasing and let us pre-allocate 16 MB node buffers upfront.
We used petgraph with raw indices instead of Box to cut memory footprint by 60 %.
Tokios work-stealing scheduler handled 80,000 concurrent WebSocket moves without GC pauses.
Lost two weeks to lifetimes and borrow-checker fights, but the binary size grew only 400 KB.

What The Numbers Said After

After the Rust rewrite:

GC CPU dropped from 34 % to 2 %.
P99 WebSocket latency fell from 82 ms to 14 ms.
RSS stabilized at 1.8 GB per pod under full load (previously 12 GB).
Peak throughput climbed from 11,000 moves/sec to 47,000 moves/sec without dropping frames.
OOM events dropped to zero over the next four weeks.

We ran perf on the Rust binary and observed 87 % of CPU in the proximity hot loop, which now used a compact 8-byte adjacency list. The goroutine leak that had been masking for weeks disappeared because Tokios task cancellation was reliable and didnt leak stacks.

What I Would Do Differently

I would not have started with cgo. Cgo added latency boundaries and call-rate limits that made the problem worse. If I had to choose again, I would have written a minimal LuaJIT FFI module in Rust and loaded it into Redis directly, but we avoided that because the Redis module API is unstable across patch versions.

I would insist on production load tests using vegeta or hey that replay the exact event pattern—city-wide drop, 35,000 moves in under 20 seconds—not just steady-state metrics. Our earlier Go tests used 1,000 users at 50 moves/sec and missed the pathological case.

Finally, I would budget two extra sprints for Rust onboarding: pair-programming the borrow checker, running miri on the adjacency code, and setting up cargo-llvm-cov to track undefined behavior in tests. That cost is real, but the latency cliff we avoided is priceless.