The Day Veltrix Blew Up Under Peak Load (And How Rust Saved It)

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The Veltrix operator team thought the bottleneck was network saturation or database indexing, but 64-bit jemalloc profiles showed 89% of allocations were coming from three locations: the treasure-hunt event bus, the spatial-index worker, and the reward-distribution micro-service—each written in Node.js. Our on-call rotation spent the first 45 minutes of the incident toggling CPU quotas and increasing heap limits, only to watch the pause times climb from 400 ms to 720 ms as the live heap grew to 3.4 GB. The server was perfectly healthy; the runtime was the constraint.

What We Tried First (And Why It Failed)

We tried upgrading Node.js from 18.16 to 20.9, which reduced GC pressure by 11% according to --perf-basic-prof, but the incremental GC still paused for 680 ms every 4 MB of live set growth. Next we moved the treasure-event bus to Rust using Tokios mpsc channel, thinking that would isolate the problem. The GC pauses disappeared, but the matchmaker service, still on Node.js, started spinning at 100% CPU trying to keep up with the Rust buss backpressure. We then ported the spatial-index worker to Rust, hoping to shed the GC load entirely. After one night of Rust porting, the workers latency dropped from 89 ms p99 to 12 ms, but the Node.js reward-distribution service suddenly became the last bottleneck, spiking to 700 MB of heap under 50k parallel reward claims.

The Architecture Decision

The operator team made the call: port the reward-distribution service to Rust as well. We chose Rust because the Tokio runtimes task scheduler gives us bounded latency under backpressure without the GC pauses we saw in Node.js. We kept the existing PostgreSQL schema for persistent treasure claims, but rewrote the service in Rust with Actix-web and sqlx to avoid both GC and ORM overhead. The biggest tradeoff was memory safety: we had to rewrite the reward-claim accumulation logic to avoid iterator invalidation bugs that would have crashed the Node.js version under high concurrency.

What The Numbers Said After

After the fourth rewrite cycle, the reward-distribution service stabilized with 451 MB live heap under 60k concurrent reward claims, a 76% reduction from the Node.js version. Flame graph shows 99.8% of time in user-space; the remaining 0.2% is syscalls to PostgreSQL. Latency p99 dropped from 410 ms to 32 ms, and the dreaded PlayerLeftGameBeforeComplete errors vanished entirely. The Rust service now handles 68k QPS with a 95th-percentile latency of 22 ms while using 42% less CPU per request than the Node.js baseline. jemallocs allocated counter sits at 487 MB, down from 1.8 GB in the Node.js heap under the same load.

What I Would Do Differently

I would have ported the reward-distribution service first instead of last. The Node.js reward pipeline was the biggest single allocator, and keeping it in Node.js forced us to keep the event bus and spatial-index workers on Node.js to maintain protocol compatibility. If we had started with the reward service, we could have cut the incident recovery time in half. Id also add a dead-code lint pass earlier; Rusts compiler caught 37 unreachable branches that the JavaScript linter missed, but we only caught them during integration testing when the reward service started rejecting malformed treasure IDs.