The Day Our Treasure Hunt Engine Ate 160 GiB of RAM and How We Fought Back

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We built a real-time treasure-hunt server whose job was to dispatch randomized virtual coins as fast as players tapped buttons on their phones. Our SLA demanded p99 latency under 15 ms and zero GC pauses longer than 1 µs. We chose Rust because the team had just shipped a gRPC service in Go that would occasionally hiccup at 200 k users and drop 300 ms latency spikes. Our new server had to scale to 5 million concurrent sessions on a 4-core Kubernetes node pool. By day 18 of load testing, the Go version plateaued at 2.1 million users; the Rust prototype hit 4.8 million but began OOMing under sustained load. Our Prometheus dashboard showed resident memory climbing from 4.2 GiB to 160 GiB in 40 minutes while latency stayed flat. No leaks in Valgrind, no stack overflows—just the OS killing pods for exceeding memory limits.

What We Tried First (And Why It Failed)

We started with Tokio 1.21, tokio-uring for async file I/O, and jemalloc via the default Rust build. The jemalloc profile told a story the Rust docs never printed:

__je_arena_tcache_evict+0x42
__je_tcache_bin_flush_small+0x1a8
__je_malloc_small+0x2a0
tokio::runtime::basic_scheduler::Inner::run+0xe8

The allocators tcache flushes were colliding on the arena lock every time we allocated a coin payload—8 bytes per hit, 300 k allocations per second. We tried bumping MAX_THREADS, switched to malloc_conf=background_thread:true, and even patched jemalloc to use per-thread arenas. None of it mattered; the contention migrated to the spinlock inside __je_malloc_small. We recompiled with mimalloc 2.0.1 and the resident set never climbed past 38 GiB. Problem solved? Not quite: the mimalloc background scanner paused the runtime for 4–6 ms every 10–15 seconds under peak load, breaking our p99 SLA. So we fired jemalloc and mimalloc and reached for snmalloc.

The Architecture Decision

We ported the entire coin-dispatching path to snmalloc 0.6.0 on a custom nightly Rust toolchain. The decision cost us two weeks: the snmalloc crate had no async-io support, so we rewrote the I/O layer to use io_uring with direct syscalls rather than tokio. The trade-off was explicit: lose the Tokio schedulers ergonomics for sub-microsecond allocation latency and zero background threads. Our new allocator profile showed a flat 180 ns per 8-byte allocation with >99 % latency under 100 ns. We rebuilt the binary with lto=thin and codegen-units=1 to reduce instruction cache misses. Load tests began passing: 5 million users, 14.2 ms p99 latency, 32 GiB resident memory peak. The Kubernetes memory limit dropped from 200 GiB to 64 GiB, freeing 24 cores for the next microservice.

What The Numbers Said After

Here is the delta from the OOM night to the snmalloc night:

Metric	jemalloc/Tokio	snmalloc/io_uring
p99 latency	18 ms	14.2 ms
RSS peak	160 GiB	32 GiB
Alloc/sec	312 k	318 k
Alloc latency avg	240 ns	75 ns
Background GC pause >1 ms	47 / minute	0 / minute

The snmalloc build also shrank the binary by 18 % because the allocator stubs replaced jemallocs 500 KB arena tables. The one regression was compile time: snmalloc rebuilt itself in 47 seconds on a 32-core runner, slowing our CI by 30 %. We mitigated it with sccache and precompiled artifacts.

What I Would Do Differently

I would not have assumed jemalloc is the fastest allocator for every Rust workload. In 2024 we measured three more: mimalloc, snmalloc, and rpmalloc. The critical detail we missed in the Rust allocator docs was the interaction between tcache flushes, arena locks, and async tasks. Next time Ill profile the allocator before committing to the language runtime.

I would also never have shipped a production allocator switch without validating allocator latency under a 500 k users synthetic load for 72 hours. The 4–6 ms mimalloc pauses only showed up between the 36th and 48th hour; we would have caught them in pre-prod if we had run longer tests.

Finally, I would insist on a compile-time flag that swaps allocators via cargo features. Our next feature branch still builds with jemalloc for easier profiling, but defaults to snmalloc in production. The Cargo.toml now reads:

[dependencies]
snmalloc-rs = { version = "0.6", optional = true, features = ["io_uring"] }
jemallocator = { version = "0.5", optional = true }
[features]
default = ["allocator-snmalloc"]
allocator-snmalloc = ["snmalloc-rs"]
allocator-jemalloc = ["jemallocator"]

One flag, two allocators, no more OOM nights.