The Day the Garbage Collector Slowed Down a Real-Time Treasure Hunt

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Last July we rolled out a new tier of Veltrix: real-time treasure hunts where users solve location-based puzzles in under 30 seconds. The backend is a state machine that ingests GPS pings, validates them against event geofences, and emits updated leaderboards every second. Latency had to stay below 50 ms p99; anything higher and the UI stuttered and the fun died.

Wed built the first version in Go because thats what most of Veltrix used. The service handled 8 k rps on three c6g.large nodes, but the p99 tail was creeping up to 82 ms. Profiling with go tool pprof showed the GC was stopping the world for 12 ms every ~200 ms. That 12 ms push put us 16 ms over budget when combined with a single slow neighbor.

What We Tried First (And Why It Failed)

We tried several Go-level tweaks:

Increasing GOMAXPROCS to 4 – the extra goroutines only widened the tail further because the GC now had more heaps to scan.
Switching to Go 1.21 with the new concurrent GC – the worst-case still hit 11 ms.
Moving the geofence validation into a C-extension using go:linkname – the tail dropped to 60 ms, but the build was fragile and the C ABI tied us to specific libc versions.

None of the fixes addressed the fundamental pain: Gos GC is nonephemeral. Every object born in one p99 window can still be alive in the next 99 windows, so the heap never shrinks. Thats fine for batch jobs, but real-time scoreboards own tens of thousands of short-lived structs:

MemStats after 30 s:
Alloc = 140 MiB
TotalAlloc = 1.2 GiB
PauseTotalNs = 342 ms
NumGC = 11

We needed memory that behaved like a ring buffer, not a generational heap.

The Architecture Decision

After a four-day spike in Rust, we swapped the hot path to a custom allocator that bypassed the system allocator on every allocation. The new segment:

Rust 1.72, no_std, allocator_api
jemalloc as the system allocator, but only for ≥4 KiB blocks
Our own bump-pointer arena for ≤4 KiB (geofence checks, scoreboard rows)

Key trade-offs we accepted:

We gave up Gos runtime magic (stack traces, defer, recovery) in the hot loop.
We accepted the pain of writing unsafe for zero-copy deserialization of Avro messages.
We lost runtime reflection, so we had to hand-write Serde traits for every event type.

In exchange we gained:

Deterministic deallocation: the arena reset every second, so GC pauses vanished.
A 3× smaller memory footprint: 42 MiB vs 140 MiB.
P99 latency of 27 ms on the same hardware.

What The Numbers Said After

We ran a 2-hour canary with synthetic load at 10 k rps:

Before (Go 1.21, concurrent GC):
p50 8 ms p95 35 ms p99 82 ms RSS 240 MiB

After (Rust bump arena):
p50 5 ms p95 20 ms p99 27 ms RSS 89 MiB

GC pauses measured via perf_event_open: 0 over 7.2 k s.

The Rust binary grew 400 KiB larger (7.4 MiB vs 7.0 MiB), but the RSS drop more than paid for it in our Kubernetes overcommit policy.

One surprise: the flame graph still showed 3 ms spent in poll syscalls. We had forgotten to set SO_REUSEPORT on our UDP socket, so the kernel was serializing the recv path across three listeners. After adding:

let _ = socket.set_reuse_port(true)?;

the p99 dropped another 4 ms to 23 ms.

What I Would Do Differently

I would never again ship a real-time path in Go without first proving the GC can be silenced. The initial 8 k rps test looked fine until the 10th user joined a dense city block and the GPS pings became correlated.

I also underestimated the tooling tax. Debugging unwind in release builds with custom allocators was brutal. Next time I would start with mimallocs arena mode before rolling a custom allocator.

Finally, I would insist on a compile-time boundary between Rust and Go. Our original plan was a single binary with CGO, but the resulting stack traces mixed Go panics with Rust unwinds—impossible to read in Sentry. We ended up splitting the hot path into a sidecar and using gRPC for IPC. The extra hop cost 2 ms, which we clawed back by enabling gRPC keep-alive and zero-copy encoding.

The treasure hunt still runs, the GC has stopped moving, and users think the game is simply more responsive. We finally solved the real problem: not the geofences, but the language that let the fence cross the latency line.

Top comments (1)

Vasyl • May 27

The problem wasn’t the geofences, but the GC” goes hard 😄
This is peak “Rust fixed our latency” content. Dropping p99 from 82 ms to 23 ms is actually insane.