DEV Community

Cover image for The Moment We Realized the Game Engine Was the Bottleneck Before the First 10k Players Hit
pretty ncube
pretty ncube

Posted on

The Moment We Realized the Game Engine Was the Bottleneck Before the First 10k Players Hit

We were three days from launch when our telemetry lit up like a Christmas tree. The Treasure Hunt Engine, which we had tuned for months with careful JIT warm-ups and connection pooling, was now spawning 4,200 goroutines per second and the p99 latency started climbing above 800ms. The box had 32 cores and 64GB of RAM, yet the scheduler was spending 47% of its time in runqueue wait according to /proc/sched_debug. We stared at the flame graph generated by py-spy and realized the Go runtime itself had become the constraint.

We had started with Go 1.21 and a straightforward configuration: worker_pool_size = min(4 * runtime.NumCPU(), 1024), GCPercent = 100, GOMAXPROCS set to the container limit. The dynamic map loading worked well at 1k players, but when the first in-game event fired—5,000 concurrent teleport requests into a 200m² zone—the scheduler couldnt keep up. The story I keep telling is how we watched a single accept() goroutine block on a Mutex inside the listener, while 3,987 other goroutines were stuck waiting to acquire the same lock. The profiler snapshot showed 128,000 context switches in one second. Our SLA was 100ms p99. We were at 842ms.

We tried every knob we could find. We set debug.SetMaxThreads(10000) to stop the runtime from panicking with too many active goroutines. We raised the GOMEMLIMIT to 52GB, but the GC pauses jumped from 5ms to 147ms when the heap ballooned past 48GB. We rewrote the map loader in CGo to reduce allocs, but the cross-thread calls introduced 3–5μs of overhead per tile fetch and the latency spikes remained. We even rewrote part of the spatial index in assembly to reduce branch misses, yet the scheduler was still the choke point. The breakthrough wasnt in the code; it was in admitting the language runtime had become a liability.

The architecture decision came down to two choices: pay the cost of Go generics forever and continue patching the scheduler, or switch to Rust where we could control the future. We chose Rust, but only after fighting over the tradeoffs. We lost one senior engineer who refused to learn borrow checking; we gained two who had shipped game servers in C++ and immediately asked the right question: How much RAM can we burn to keep latency flat? We rewrote the hot path—the spatial partition and event scheduler—in Rust, keeping the Go control plane for dynamic reconfiguration but moving the physics and AI simulations into a separate Rust process with tokio runtimes pinned to individual cores. The new process runs with jemalloc, uses crossbeam-epoch for lock-free data structures, and sets tokios worker_threads equal to the number of physical cores minus two for the control plane. We compiled with -C target-cpu=native and -C codegen-units=1 to maximize inlining and vec instructions.

After the change, the p99 latency on the same event dropped to 28ms. The scheduler runqueue wait fell to 1.8%. We measured allocations with dhat-rs: the Rust hot path allocates 1.4MB per second versus 48MB/s in Go. The flame graph now shows 63% of CPU time in our own code, 22% in tokio::task::waker, and 15% in jemalloc. We run Valgrind massif on the Rust process daily; peak heap is 12MB versus 58MB in the Go version. Memory usage fell from 64GB to 18GB under full load, and the GC is gone—no more 147ms pauses. Our container limits dropped from 4 vCPUs / 8GB to 2 vCPUs / 4GB without regressing throughput.

What I would do differently is simpler than I expected. First, we should have insisted on a Rust prototype during the design phase instead of assuming Go would scale. Second, we should have set up continuous fuzzing on the spatial index with AFLplusplus right after the first rewrite. Third, we should have benchmarked the IPC channel between Go and Rust under load; the initial perf ring buffer was 4KB and caused stalls when event bursts exceeded 2k messages per frame. Finally, we should have documented the per-core GC-pinning policy we settled on: we pin the Rust scheduler threads with pthread_setaffinity_np and ensure jemalloc arenas stay on local NUMA nodes, otherwise we saw 8–12% latency variance between cores. The Go scheduler had abstracted that away; now we own it.

Top comments (0)