DEV Community

Cover image for The Day We Found Out The Garbage Collector Was Making Architectural Decisions Without Asking
pretty ncube
pretty ncube

Posted on

The Day We Found Out The Garbage Collector Was Making Architectural Decisions Without Asking

We needed a treasure hunt engine that wouldnt fold under the weight of 500k concurrent players. Our first version ran on Go 1.21 with the default GC, and everything was fine—until it wasnt. One Sunday evening, P99 latency for the /find endpoint jumped from 45 ms to 2.1 seconds. The alert was brutal:

runtime: garbage collector pause (1.78s), CPU util 87%
Enter fullscreen mode Exit fullscreen mode

That GC pause wasnt a signal of load; it was the load itself. The Go GC tuner thought it was protecting us, but it was actually deciding how much heap we could touch before it decided to collect again. We had traded low latency for predictable latency, and predictable latency turned out to be slow.

What We Tried First (And Why It Failed)

We tweaked GOGC from 100 to 25, reduced allocations in the path, and even switched to arena allocation via github.com/ianlancetaylor/malloc. It helped for a week. Then the same pattern returned:

# HELP treasure_hunt_find_latency_seconds summary
treasure_hunt_find_latency_seconds{quantile=0.99} 1.84
Enter fullscreen mode Exit fullscreen mode

We spun up pprof and saw the GC wasnt collecting more memory—it was collecting more CPU time because our internal message struct now carried 300 bytes of extra debug flags nobody needed. The GC wasnt just cleaning memory; it was recompiling the entire program in its head every time it ran. Thats when I understood the runtime had become the constraint.

The Architecture Decision

We stopped fighting the GC and moved the hunt engine to Rust 1.75 with Tokio 1.28. We chose tokio-uring for async I/O on io_uring, shaved every Alloc::default(), and turned on jemalloc via tikv-jemallocator because jemallocs arena separation gave us sub-millisecond allocation latency at 500k ops/sec.

We accepted the tradeoff: no more GC pauses means we now have to manage memory explicitly. The first time we saw jemallocs stats:

allocated: 3.1 GiB
active: 3.4 GiB
metadata: 172 MiB
resident: 3.9 GiB
Enter fullscreen mode Exit fullscreen mode

That 172 MiB of metadata was the cost of arena isolation—we could live with it. The real win was the latency delta:

Metric Go 1.21 Default GC Rust 1.75 Tokio + jemalloc
P99 /find latency 2.1 s 12 ms
RSS after 1 h load 14.2 GiB 4.9 GiB
GC pressure 6.2 s total per hour 0.0 s

The architecture now looks like this:

┌────────────────────┐ ┌───────────────────┐
│ UDP shards (10) │───▶│ tokio-uring │
│ per core │ │ io_uring ring │
└────────────────────┘ └────────┬──────────┘
 │
 ┌───────────▼────────────┐
 │ hunt_service.rs │
 │ - Arc<HuntMap> │
 │ - jemalloc arena 0..9 │
 └───────────┬────────────┘
 │
 ┌───────────▼────────────┐
 │ RegionAllocator │
 │ - bump alloc per hunt │
 └────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

What The Numbers Said After

We ran a 48-hour chaos test: 500k players moving through 20k concurrent hunts. The Go version died at the 20-hour mark with a 30-second GC stop-the-world. The Rust version stayed flat:

treasure_hunt_find_latency_seconds{quantile=0.99} 13
process_virtual_memory_bytes 4.4 GiB
jemalloc.allocated_bytes 3.2 GiB
Enter fullscreen mode Exit fullscreen mode

Even better, jemallocs arena separation meant each hunt thread had its own allocation arena. We instrumented jemalloc via prometheus:

# HELP jemalloc.arena_i.allocated_bytes arena N bytes allocated
jemalloc.arena_0.allocated_bytes 12.3 MiB
jemalloc.arena_1.allocated_bytes 11.8 MiB
Enter fullscreen mode Exit fullscreen mode

That isolation killed lock contention in the allocator. The lockstat output showed zero contention on malloc:

lockstat_total: 0
Enter fullscreen mode Exit fullscreen mode

What I Would Do Differently

If I could rewind, I would resist the temptation to switch languages until we had a profiling story that clearly pinned the GC as the bottleneck. We wasted two weeks on GOGC tuning before we ran pprof on wall-clock time and saw the GC as the single longest stack.

Second, jemalloc wasnt the default choice; we benchmarked mimalloc and snmalloc. Mimalloc gave us 2 ms lower P99, but its TLS overhead at 20k concurrent coroutines negated the win. We picked jemalloc for its per-thread arena design and mature flamegraph integration:

jeprof --show_bytes --svg my_server.12345 > profile.svg
Enter fullscreen mode Exit fullscreen mode

Finally, Id budget time for Rusts learning curve up front. The borrow checker didnt care about our deadlines; we spent 14 engineer-days untangling lifetime errors in the HuntMap. Our post-mortem showed 60 % of the delay came from fighting the compiler, not the code. That time is now paid back every time the server stays up under load.


The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2


Top comments (0)