Three Nights Without Sleep to Find the Leaking Pointer

#webdev #programming #rust #performance

The Problem We Were Actually Solving

At Veltrix we run a low-latency real-time treasure hunt where thousands of clients poll for a 50-byte payload every 250 ms. The server ingests 1.2 million messages per second, hydrates them into 8 KB arena-allocated blobs, and streams the result back. We hit 55 µs p99 latency in staging, but in prod the servers RSS ballooned from 2 GB to 9 GB in four hours, and jemalloc showed 8.2 GB actively allocated though the counter in Prometheus read only 2.1 GB. perf c2c screamed at us: 43 % of cache misses were on a single 8-byte pointer that escaped the arena every time a treasure box was opened. The box struct itself was 32 bytes, but somewhere in the JSON deserialization path a Box<String> was being created on every fourth request, and since the arena lived in a single-threaded tokio runtime, the drop never happened until shutdown. The allocator was handing out 4 KB pages like candy while the pointer hitchhiked to a work-stealing thread where it finally panicked during unwind.

What We Tried First (And Why It Failed)

We began with Go 1.21. We used encoding/json with map[string]interface{} and global sync.Pool for 8 KB buffers. P99 latency measured 68 µs, but RSS grew 300 MB per hour because the GC couldnt keep up with the arena-style allocation bursts. We tried replacing the pool with a custom sync.Pool of bytes.Buffer, but the pointer leak migrated to a race between Buffer.Grow and the GC: the grow could trigger an allocation whose finalizer ran on a different P, causing a spurious pointer in a stack frame that was already unwound. We finally switched to fasthttp, pushed latency to 42 µs, but the allocations still climbed—this time because fasthttps object pool reused Request objects but left the []byte slices pinned by string conversions. The Go runtimes own allocator was fragmenting the heap into 64-byte slabs while the treasure boxes demanded 8 KB slabs. At 1.2 million rps the allocator backlog reached 400 k items and GC pauses spiked to 12 ms.

The Architecture Decision

We rewrote the deserializer in Rust 1.75 with serde and simd-json on a bumpalo arena. The key change: every treasure box now implements Drop that does nothing; we rely on arena reset instead of deallocation. The arena size is fixed at 128 MB and we reset it every 16 ms using a timer wheel. The JSON parser writes directly into the arena, so the only pointers that ever escape are the top-level treasure payload and a Vec<u8> that lives in a thread-local queue. We also switched from tokio to mio with a custom work-stealing scheduler because tokios spawn rate of ~1 k tasks per second was causing arena resets to block on task teardown. With the new runtime the arena reset is a single arena.reset() followed by std::mem::take, measured at 7 µs. We wrapped the arena in Arc<Mutex<...>> protected by a spin lock because RwLock added 3 µs and the lock was held for < 1 µs anyway.

What The Numbers Said After

Latency distribution after the change:

p50: 48 µs
p90: 60 µs
p99: 65 µs

RSS grew from 9 GB to 2.4 GB in production over 24 hours, and jemalloc stats reported 2.2 GB active, matching our Prometheus counter. perf c2c no longer flagged cache-line contention on the pointer. RSS churn dropped from 700 MB/hour to 12 MB/hour. The bump allocator had zero fragmentation: all allocations were 8 KB aligned and reused within the same epoch. The only remaining spike was the arena reset at epoch boundaries, which showed as a 14 µs interrupt every 16 ms—still below our SLA.

What I Would Do Differently

I would not have started with Go. The GC promises low latency, but it cannot promise bounded RSS when the workload mixes large arena allocations with high QPS. I would also avoid tokio for work-stealing arenas; we traded 3 µs of lock overhead for 14 µs of timer scheduling and saved 700 k allocations per second. Finally, I would compile with lto=thin and codegen-units=1 from day one; the deserializers hot path inlines simd-jsons UTF-8 parser, and the extra 300 ms of compile time saved 8 µs on every 250 ms interval, which is the difference between missing and hitting the SLA.