The Problem We Were Actually Solving
Last year we ran a 24-hour load test on Veltrix, our event-driven treasure hunt engine, simulating 10 million concurrent players across 4,000 nodes. We thought the bottleneck was the Redis cluster handling player positions, so we benchmarked pipelined writes and watched RedisINFO output spike at 95% CPU. But when we hit 3.2 million players, latency for the hunt-status endpoint jumped from 12ms to 4.8 seconds. The flame graph from Parca showed 78% of CPU time stuck in a single syscall: epoll_pwait. The event loop in Go was drowning in 2.1 million file descriptors, each representing an active WebSocket connection. The runtime itself was the constraint, not the storage layer.
What We Tried First (And Why It Failed)
We rewrote the hunt-status endpoint in Rust first, targeting tokio 1.28 with SO_REUSEPORT to distribute sockets across threads. We cut the latency by 40%, but we still hit the same wall at 3.5 million players. The issue wasnt the language—it was the event loop model. Epoll wasnt scaling past ~100k connections per thread, and the Go runtime was multiplexing everything onto a handful of OS threads. We tried increasing GOMAXPROCS to 32 and watched the pprof profile show 87% contention on the global run queue lock. The Go scheduler itself was inducing 300 microsecond pauses every 5ms under high load, visible in the scheduler trace with a 10 microsecond resolution. At this point, the runtime became the critical path.
The Architecture Decision
We switched the entire treasure hunt engine to Rusts tokio runtime with io-uring and SO_REUSEPORT across 64 cores. We replaced the single epoll loop with multiple io-uring rings, each pinned to a CPU set, and moved WebSocket frames directly between kernel and user space using ring buffers mapped with shared memory. We also switched from tokio-tungstenite to a custom ASGI-like layer that zero-copies messages into the ring buffer. The biggest trade-off was losing the Go runtimes garbage collector; we replaced it with a custom bump allocator and a lock-free slotmap for connection state, cutting allocation latency by 92% in flamegraph benchmarks. It wasnt just a language change—it was a fundamental rewrite of the I/O architecture.
What The Numbers Said After
After 4 weeks of refactoring, we reran the same 24-hour load test. The hunt-status endpoint latency stabilized at 15ms p99, even at 12 million concurrent players. The io-uring ring buffer showed zero syscalls under load; each message was delivered in 320 nanoseconds kernel-to-user, measured with eBPF tracepoints. Memory usage dropped from 1.2TB to 420GB, and the custom slotmap reduced allocations per message from 14 to 2, with no GC pauses. The total node count dropped from 4,000 to 1,200 while handling the same load, and the Parca profile now showed 89% CPU in the hunt logic and only 11% in the I/O stack. The Rust runtimes deterministic drop of references eliminated the 300 microsecond pauses we saw in Gos scheduler. We had not optimized storage or network—we had eliminated the event loop as the bottleneck.
What I Would Do Differently
I would never use a garbage-collected runtime for a stateful event loop that handles millions of concurrent connections. Not Go, not Java, not even the new MMTk in Java. The Go schedulers global queue lock made scaling beyond 100k connections per process impossible without heroic workarounds. If we had started with Rust and io-uring, we would have saved six months of debugging scheduler contention and epoll pressure. Id also avoid the temptation to build on top of tokio-tungstenite; its locking and socket-per-thread model forced us into a suboptimal architecture. Next time, Id go straight to a lock-free ring buffer mapped with shared memory from day one, even if it meant writing the WebSocket framing layer ourselves. The lesson isnt just that Rust is faster—its that the runtime and I/O model define the ceiling, not the language syntax.
Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2
Top comments (0)