The Moment Our Game Server Collapsed Under the Weight of JavaScript Closures

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We needed a real-time hunt locator that could answer Where is the chest? in under 50 ms for 15,000 concurrent players while staying under 400 MB RSS and 0 GC pauses longer than 1 ms. The feature was live on the RC channel—players tapping their compass could see a pulsar on the minimap within 300 ms. Our previous C++ server could do it easily, but the JavaScript prototype had half the code size and integrated faster with the web frontend. So we rewrote the service in Node.js, used worker_threads to isolate heavy pathfinding, and shoved the treasure data into a Redis cluster. It worked for 200 players and started failing at 4,000. The profiler snapshot from 0x showed 62 % of CPU time was spent in v8::internal::Deoptimizer::Reoptimize because every closure Id written to capture player state carried a hidden 8-byte pointer. Multiply that by 6,000 concurrent hunts and you get 48 KB of pointers you neither asked for nor wanted.

What We Tried First (And Why It Failed)

Our first hope was to lift the V8 flag --max-old-space-size=8192 and let it ride. That bought us 40 minutes before the node process hit 8 GB RSS and the kernel OOM killer stepped in. Next we tried worker_threads with shared nothing memory. That reduced RSS but introduced a new latency spike every 250 ms when the main thread paused to compact the heap. The Redis cluster also became a bottleneck; an RTT of 420 µs with 12 % packet loss from the GKE cluster to a single-zone Memorystore instance made the service oscillate between 60 ms and 520 ms. I still remember the Slack alert: Players reported the chest coordinates disappearing for 400 ms and then snapping back. That told me the event loop wasnt just slow; it was jittery.

The Architecture Decision

We evaluated Rust because the compiler promised zero-cost closures and predictable latency. I wrote a minimal proof-of-concept in one weekend: a Tokio-based HTTP server serving a hand-rolled flat array of treasure coordinates with a binary-search routing table. The first build used tokio::spawn for every hunt request, which led to a deadlock under load because I forgot to limit the number of concurrent tasks. After a 2-hour debug session we switched to a work-stealing scheduler and set tokio::runtime::Builder::max_blocking_threads(16). The key change was storing the treasure map as a Vec<(u64, u64)> sorted by Morton code instead of a HashMap<Uuid, Coord>; the cache line density dropped from 19 bytes per entry to 12, and the L1 miss ratio from 18 % to 5 %. We also moved the Redis dependency from a hot path to a cold path: only when a player actually opened the chest did we fetch the loot table, cutting Redis QPS from 2,100 to 400 and tail latency from 15 ms to 1.8 ms.

What The Numbers Said After

After the Rust rewrite we ran a 30-minute sustained load test at 18,000 concurrent players. The latency histogram showed P50 19 ms, P95 38 ms, P99 64 ms. RSS never exceeded 310 MB, and the GC pauses measured by tokio-console were all under 200 µs. The 60-second allocation rate was 1.2 MB/s in the hot path, which meant the CPU spent 3 % of cycles on memory operations versus 18 % in the Node version. Flamegraph output from perf showed the bottleneck was no longer GC or closures but the Morton code lookup—a classic case of the data structure being the real constraint. We deployed the binary to Kubernetes with a memory request of 384 Mi and CPU request of 2000 millicores. During the Black Lotus event we peaked at 22,400 concurrent players for three minutes; the Prometheus graphs show zero 5xx errors and no pod evictions. Thats when I knew we had chosen the right language not because it was safe, but because it removed the language from the critical path.

What I Would Do Differently

I would not have reached for Rust so early. The Node prototype taught us the treasure-hunt query pattern: read-only, deterministic, latency-critical. Had we first tried Go with a single threaded HTTP server and a pre-sorted flat slice, we could have delivered a production-grade service in two days instead of two weeks. The mistake was assuming JavaScript closures would scale simply because Node.js felt familiar. Second, I would have instrumented the Node version earlier with clinic.js and flamegraphs before it hit 4,000 players. We lost two weeks chasing --max-old-space-size and worker_threads when the profiler would have shown the closure bloat in five minutes. Finally, I would have used jemalloc from day one in Rust instead of the default glibc allocator; after swapping allocators, allocation latency dropped another 8 %, proving that even in a low-level language the allocator can still be the bottleneck if you dont measure.