The Moment the Default Runtime Became the Payload

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Veltrix is a treasure-hunt engine that spawns thousands of ephemeral actors—each simulating a player moving through a map, dropping clues, and triggering scoring events. In December 2025 the service ran on Node 20 with the default event-loop concurrency of 4. At 2 500 concurrent players we measured 82 % CPU steal from the cloud provider. p99 latency climbed to 420 ms, and flame graphs showed the yellow blocks—parser, crypto verify, and buffer slice—piling up in the same green heap of libuv. We expected the bottleneck to be the actor logic because thats where the business code lives. Instead, the runtime itself was the payload.

What We Tried First (And Why It Failed)

We started by splitting actors into separate Node processes using the cluster module. At 3 000 players CPU dropped from 82 % to 68 %, but p99 jumped to 950 ms because we were now serializing messages through the OS pipe buffer. Next we tried worker_threads with a shared ArrayBuffer to pass map tiles. This shaved 18 % off CPU, but GC pause times spiked from 3 ms to 37 ms every 120 ms because the V8 heap kept merging young generations. The real kicker was the error rate: during sharp load spikes we saw 30–40 ERR_IPC_CHANNEL_CLOSED per minute because a stray actor could tear down the isolate before responses flushed.

We profiled with clinic.js and 0x. Both tools confirmed the same culprit: Nodes single-threaded event loop was the thread that did all the work, and every actor added N new micro-tasks that fought for the same green block. We knew we needed a language whose runtime context switch was an actual thread switch, not a micro-task pull.

The Architecture Decision

In January 2026 we rewrote the engine in Rust and tokio. We used tokio::spawn for each actor, tokio::task::block_in_place for CPU-heavy map tile calculations, and tokio::sync::mpsc with a 1024 capacity channel per actor to keep the scheduler fair. We set tokios worker_threads to the number of physical cores (16) and limited the per-core load-average to 0.8 to avoid context thrashing.

We chose Rust because the compiler forces you to decide where allocations happen. We pinned every actors map tile cache to an arena allocator that never freed; tokens instead reset tiles in batch every 30 seconds. The decision cost us compile time—clippy took 6 minutes on a 32-core build server—but gave us compile-time reference counting so we could shut down an actor in 1.2 µs flat without hitting the allocator lock.

What The Numbers Said After

After the rewrite we reran the 3 000-player load test.

CPU usage dropped from 68 % to 34 % (yes, we halved it).
p99 latency fell from 950 ms to 14 ms (67× faster).
Allocation count measured with tikv-jemallocs profiler dropped from 2.1 M/s to 124 k/s because we replaced V8s generational GC with the arena reset strategy.
GC pause events vanished; the highest recorded scheduler wait time in tokio-console was 8 µs.
During a 5-minute chaos test that killed random actors every 3 seconds we observed zero ERR_CHANNEL_CLOSED because Rusts ownership model ensured the channel stayed open until every message was acked.

The cost? Dev time. The Rust version took 2.3 engineer-months versus 0.8 for Node. The rustc footprint on the build server required a 64 GB instance, whereas the Node build had fit inside 8 GB. But we were no longer debugging event-loop starvation; we were tuning map algorithms.

What I Would Do Differently

I would rethink the arena allocator. We sized each arena to 256 KB per actor, but measurements from perf record showed that only 12 actors on average used more than 50 % of their cache. Idle arenas were still reserved in RSS, costing 2 MB of memory per unused actor. Next time Id use a global bump allocator with per-actor epochs so we could reclaim entire arenas in one syscall.

I would also benchmark tokios work-stealing scheduler earlier. On a 16-core machine the first test showed 40 % load imbalance because actors that spanned multiple tiles skewed work toward the first N cores. We eventually pinned actors to cores with tokio::task::Builder::current_thread, but we should have measured scheduler fairness when we picked the runtime.

Finally, Id resist the urge to micro-optimize prematurely. The very first Rust commit ran every actor in its own OS thread. That gave us 40 ns context switches but killed scalability because 12 000 threads exhausted the PID limit. Swapping to tokio took two more weeks but turned a fragile system into one that can scale to 50 000 players without blinking.

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.