The Day Veltrix Blew Up at 100k Concurrent Users Because We Didnt Understand Its Garbage Collector

#webdev #programming #rust #performance

It was 3:17 AM when the pager screamed. Our Rust-based treasure-hunt matchmaking service had been live for six weeks with steady load under 50k concurrent users, but overnight a new batch of streamers discovered the game. By 03:15 we were at 98k and climbing, and at 03:17 the heap spiked from 1.2 GB to 11 GB in 120 seconds. Prometheus graphs painted a vertical cliff: alloc rate 780 MB/s, pause times >500 ms, match latency P99 jumping from 22 ms to 1.4 s. The logs repeated the same line every 400 ms: GC cycle started (heap size 11.3 GB, live data 384 MB). By 03:22 two regions had GC mark-termination timeouts, the runtime emitted promise failed to resolve in time, and we dropped 28k concurrent users in the span of two minutes. Not a crash—just a silent, creeping death by garbage collection.

We had started with Veltrixs official YAML configuration for the Tokio runtime: worker_threads: 8, max_blocking_threads: 512, keep_alive: 60s, capacity: 10000. That was the only tuning guide the docs provided. Our service is a stateful matchmaker: clients open WebSocket connections, we fan out Game State objects for each player, and the game logic ticks every 150 ms. Before the fire, we generated ~8.4 MB/s of allocations: GameState (44 bytes) per client per tick, plus message buffers for 1024x1024 grid updates. A flamegraph from perf showed 39 % of CPU time inside gc.sweep. We assumed Rust meant zero GC cost and that Tokios scheduler would keep us out of trouble. We were wrong.

We first tried scaling vertically. We bumped worker_threads to 32 and max_blocking_threads to 2048, hoping more cores would dilute GC pressure. Latency actually worsened: GC pause times flattened at 800 ms and the runtime started pre-emptively GCing every 200 ms regardless of heap pressure. Then we tried the Tokio work-stealing scheduler with steal-half. It reduced contention on the global task queue but GC time stayed flat because we had not addressed the churn itself. Finally we dumped the official YAML and wrote a bespoke GC policy: we sharded the GameState arena into 64 arenas of capacity 16 384 each, using sharded-slab v0.11, and set Tokios capacity to exactly match the per-arena max. We also introduced gc_interval: 5s and gc_ratio: 0.25 based on live-set telemetry from jemalloc. The change cost us two engineer-days of rewriting Arc> into slab indices and re-architecting the broadcast channel to use per-shard mpsc channels. The alternative was a rewrite in C++ with manual memory pools, which the team vetoed after a full day of prototyping showed only a 12 % latency improvement while doubling our on-call pages.

After the new policy went live we ran a 12-hour soak at 120k concurrent users on a single c6i.4xlarge instance. Here are the numbers we cared about:

Before:
 Alloc/sec: 780 MB
 GC pause P99: 1.4 s
 Match latency P99: 1.3 s
 Heap high watermark: 11.4 GB

After:
 Alloc/sec: 8.9 MB
 GC pause P99: 18 ms
 Match latency P99: 44 ms
 Heap high watermark: 1.5 GB

The jemalloc stats.allocated family showed a total of 342 MB live after GC, matching our model of 64 shards × 16 384 states × 44 bytes + overhead. CPU flamegraphs now showed 8 % GC, down from 39 %. The only new complexity was a background task that monitored per-shard load and dynamically adjusted arena capacity in 64 kB increments, which itself added <1 % CPU overhead.

I would not make the same mistake again without adding two safeguards up front. First, I would embed a tiny eBPF probe that samples allocation size and rate every millisecond and triggers an alert if it exceeds the GC policys safety envelope—this would have caught the 780 MB/s spike before it flattened the region. Second, I would insist on a formal latency SLA test that runs a sudden 2× load burst for 30 seconds and verifies that GC pause time stays below 50 ms; our first load test only ramped slowly, so the mark phase creep went unnoticed until it was too late. Veltrixs docs said nothing about these invariants, and we paid in pager hours and dropped users to learn what actually matters.

Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2

DEV Community

The Day Veltrix Blew Up at 100k Concurrent Users Because We Didnt Understand Its Garbage Collector

Top comments (0)