DEV Community

Cover image for The Day We Realized Treasure Hunt Was a Memory Allocator in Disguise
pretty ncube
pretty ncube

Posted on

The Day We Realized Treasure Hunt Was a Memory Allocator in Disguise

The problem we were actually solving

It started with a tail latency spike on the Treasure Hunt engine at 2:17 a.m. on Black Friday week. The p99 latency jumped from 8 ms to 840 ms in two minutes, then settled back down after a leaderboard recompute finished. We thought it was a database lock, so we spent six hours tuning connection pools and adding read replicas. Nothing helped. Then I ran perf top and saw 48 % of CPU time inside jemallocs malloc and free on the HuntUpdate struct. Thats when I understood: the engine wasnt CPU-bound; it was page-fault bound due to 600,000 tiny GC allocations per second. We had turned a real-time ranking system into a garbage collectors playground.

What we tried first (and why it failed)

Our first fix was to switch the Node.js worker pool from 16 to 64 threads, hoping the event loop would become less congested. It didnt. After that, we rewrote the leaderboard heat map in C++ using std::unordered_map, but the spikes moved to a different phase of the recompute—now inside std::allocator. I even tried running the engine on GraalVM with SubstrateVM ahead-of-time compilation, and while startup was faster, the allocation rate remained the same. The profiler output from perf record -e cache-misses -j any,u -o hunts.perf showed 3.2 million cache misses per recompute cycle. We were treating symptoms instead of the allocator.

The architecture decision

One afternoon, after the fifth rollback, a colleague casually mentioned that the Rust fxhash crate produced a 32-bit hash in one-fifth the cycles of V8s Math.random(). That planted the seed. We prototyped the entire Treasure Hunt engine in Rust, specifically targeting no_std for the inner loop and using ahash for 128-bit hashing. We rewrote the immutable leaderboard diff as a single Arc<[Player]> slice and replaced the hot path for ranking with a rayon parallel sort on u64 keys. The first build crashed at runtime with a stack overflow because we forgot to set ulimit -s 65536. After bumping the stack size and enabling jemalloc via tikv-jemallocator, the allocation count dropped from 600 k/s to 12 k/s. The unwrap-heavy code compiled with clippy --deny warnings and our CI pipeline now runs cargo test --release -- --nocapture before each canary.

What the numbers said after

We measured again with the same Black Friday workload. The p99 latency fell from 840 ms to 12 ms. Allocation bytes per update dropped from 48 kB to 960 B. The worst-case pause time measured by tokio-console went from 180 ms to 3 ms. The flame graph from pprof --http=:8080 showed 0 % time in GC. On AWS c6g.xlarge, the engine now handles 210 k updates/sec with 0.2 ms stddev, whereas the Node.js version saturated at 65 k updates/sec before falling over. We also saved $7 k/month by right-sizing the fleet: the Rust binary uses 45 MB RSS versus 320 MB for Node.js, so we cut three m6g.large instances.

What I would do differently

In retrospect, we should have written a 50-line C benchmark first instead of diving straight into Rust. The C version would have shown us the allocation rate immediately without the fear of borrowing rules. We also underestimated the ownership complexity around Arc cycles between the game state and the leaderboard. One night we hit a deadlock because two tokio::spawn blocks were holding RwLocks in opposite orders. We fixed it by flattening the lock hierarchy and using parking_lot::RwLock, but it cost us two days. Finally, I would insist on cargo-audit and cargo-deny from day one; last month a low-severity ring advisory slipped through, and we had to push a zero-downtime patch in 25 minutes during a live event.

Top comments (0)