The Problem We Were Actually Solving
Last April the Veltrix game servers began to stutter during the weekly global hunt. Players in Mumbai, São Paulo and Seattle all reported the same second-long freeze at minute 47, exactly when the treasure table exploded from 500 k rows to 2 million. Prometheus showed P99 latency climbing from 12 ms to 2.1 s and allocator stalls in jemalloc at 4 GB. The interesting detail was that the freeze happened only when the treasure table was larger than main-memory; once it spilled to SSD the GC pauses were gone. That meant the problem was not I/O but the runtimes idea of what memory safety looked like. We had tuned PostgreSQL, our CDN, even the kernels dirty-ratio, but the garbage collector was the invisible bottleneck.
What We Tried First (And Why It Failed)
We first blamed the SQL. Running pg_stat_statements showed the treasure lookup was a single CTE with an ORDER BY and a LIMIT. A 20-line Ruby script cached the winner in Redis and served the event, but the freeze persisted. We added read-replicas; P99 stayed the same. We put the whole table in TimescaleDBs in-memory cache; the freeze moved to minute 56 when the cache finally evicted something. The graph was still a hockey-stick.
Then we tried JRuby with the new incremental GC. The GC logs showed 200 ms safepoints every 800 ms. Latency still hit 2.4 s. We switched to TruffleRuby, hoping Graals native image would help. The startup time alone was 4.3 s and the allocation rate tripled because of the polyglot sandbox. The ops team said you cannot hot-patch a GraalVM node at 3 a.m. So the runtime was the constraint, but nothing we tried had removed the GC wall.
The Architecture Decision
I spent a sleepless Saturday running flamegraphs inside a flamegraph. The top entry was always malloc_hook in jemalloc and, deeper, objc_msgSend on the Ruby side. That told me the root cause was the language runtime interpreting every object dispatch. We could keep the Ruby logic—it was 1200 lines of battle-tested treasure geometry—but we needed a runtime that did not interpret.
We rewrote the treasure picker in Rust. Not idiomatic Rust with Vec>, but a flat arena of u32 indices and precomputed bounding boxes so the whole table lived as two slices: one for coordinates, one for rewards. We used BTreeMap only for the pruning phase; the hot path was linear in SIMD registers via the packed_simd crate. The allocator was mimalloc with large-page support so jemalloc never touched the treasure arena. The change was a 12-hour rewrite of the reward-selection loop; we left the REST API, Redis cache and PostgreSQL untouched.
The latency test looked like this:
Baseline (MRI 2.7): P99 2.1 s, alloc 1.8 GB, GC 24 %
Rust (mimalloc): P99 48 ms, alloc 120 MB, GC 0 %
After shipping to 10 % of players the error budget stayed flat. We rolled it out globally.
What The Numbers Said After
We left Prometheus scraping jemalloc and mimalloc for three weeks. The jemalloc stall events dropped from 47 per minute to zero. The treasure events RSS stayed at 140 MB even when the table grew to 5 million rows. The p95 tail was now dominated by PostgreSQLs seq scan, not by our code. The interesting detail was that the Rust binary used 30 % less CPU overall because the CPU spent zero cycles in a write barrier.
What I Would Do Differently
I would not have waited for the GC flamegraph. If I had instrumented malloc immediately I would have seen jemalloc was the bottleneck two days earlier. We also over-optimised the arena too early; the first Rust version still boxed every treasure struct and the GC pauses moved to the arena allocator. Only when we switched to raw slices did the GC truly vanish. Lastly, I would insist on a single cross-language profiler next time. perf every 10 seconds was not enough; we needed eBPF-based heap flamegraphs to see the malloc path without recompiling.
Top comments (0)