The Problem We Were Actually Solving
In 2024, Veltrixs treasure-hunt engine, Quarry, powered 120k concurrent sessions across 42 shards with Go 1.21 and a custom shard allocator. Latency was flat at 12ms for 95th percentile under synthetic load, but our secondary metrics were lying. Flamegraphs showed 32% of time in the scheduler, and irqbalance taking 8% CPU stealing from our game-loop. The real problem wasnt algorithmic bottlenecks—it was Gos cooperative scheduler refusing to preempt on syscalls inside a hot loop that called epoll_pwait every 50μs.
What We Tried First (And Why It Failed)
The first patch set replaced epoll with io_uring and rewrote the event loop in pure Go. The GC pause dropped from 1.2ms to 0.7ms, but the scheduler still blocked the entire P on each syscall. Next we tried GOMAXPROCS=48, but the shard allocator fought us: each shard pinned 3 Ps for hot-path logic, leaving only 18 Ps for background jobs. This starved the work-stealing path and pushed context-switch overhead above 14% CPU. The final straw was a memory regression: after weeks of tuning, jemalloc reported 1.8× resident set growth versus a C++ baseline on the same hardware.
The Architecture Decision
We migrated Quarry to Rust 1.76 with tokio 1.36 + mio 0.8, keeping the shard allocators API but swapping the runtime. Tokios work-steal scheduler and io-uring driver moved syscall latency into dedicated kernel threads instead of blocking the Go Ps. We added a custom slab allocator sized for 1024-byte shard state to eliminate jemalloc churn. The shards themselves became async tasks instead of goroutines, letting us pin them to CPU cores without starvation. The change meant rewriting 45k lines of high-frequency path logic, but the scheduler finally stopped fighting us.
What The Numbers Said After
flamegraph --frequency 9999 --kernel --title shard-00 showed syscall latency dropping from 12μs median → 4μs, and the scheduler slice fell to 7%. heap profiling revealed the slab allocator held 1.4MB RSS vs jemallocs 2.6MB. Under 400k simulated players, P99 latency shrank from 28ms → 11ms, and GC pressure vanished because Rusts allocator returned memory to the OS instead of hoarding it. CPU usage per shard fell 18%, letting us consolidate two shards onto a single bare-metal host and cut colocation cost by $3k/month.
What I Would Do Differently
I would have started with Rust six months earlier instead of trying to bend Gos scheduler. The Go teams heroic work on io_uring and PGO still wouldnt reach the determinism we needed for frame-pacing. Today I would also skip mio in favor of tokios io-uring crate directly, because the mio abstraction layer added 2μs of indirection on every syscall path—something the 2024 benchmarks naively masked. Finally, I would budget two engineers for a month-long rewrite: the cost is trivial compared to months of Go runtime profiling that never quite removed scheduler jitter.
Top comments (0)