The Day the JVM Died Under 200k QPS and Lived to Tell the Story

#webdev #programming #rust #performance

The Problem We Were Actually Solving

At Veltrix our treasure hunt engine was running on the JVM stack—OpenJDK 17, GraalVM Native Image, and a Kotlin coroutine pipeline. By month six the pipeline was falling apart at 180k QPS. Not the CPU cliff—the GC pause cliff. Every 250 ms the ZGC cycle would spike 18 ms and half the player actions timed out. The heap was 16 GB, but NewSpace had a 300 ms evacuation window at that load. Wed tuned everything: -XX:MaxGCPauseMillis=20, transparent huge pages, isolated cores. Still the coroutine scheduler would block during the safepoint, and clients would see 503s on /hunt/next.

What We Tried First (And Why It Failed)

First we threw money at the JVM. We moved to Azul Zulu Prime with 4 ms pauseless guarantees. That cut safepoint time to 4 ms, but the allocation rate was 14 GB/s and the nursery still evacuated before it could finish. We tried Shenandoah on Amazon Corretto; same pause cliff, just shifted to final marking. Then we put the coroutines on Project Loom virtual threads. Latency per request dropped, but the allocation rate jumped to 17 GB/s because every virtual thread pinned a stack. Flame graphs showed 32 % of time in Unsafe.park, 19 % in j.u.c.ForkJoinPool stalls. The JVM scheduler was the constraint, not the language.

The Architecture Decision

We rewrote the hot hunt path in Rust—specifically 1.75 on stable with tokio 1.35 and mimalloc 1.7. The coroutine pipeline became an async stream with tokio::spawn_blocking only on the blocking call to Redis. We used Arc<Mutex<_>> only for the shared player inventory; everything else passed by &mut or Arc<_> with AtomicUsize. We kept the JVM for lobby state and matchmaking, so the two runtimes talked REST over local TCP. The cut-over cost was 4 engineer-weeks: the worst week was when the Rust side lost 10 % of events to a zero-copy buffer overflow we missed in fuzz tests.

What The Numbers Said After

After the change our p99 latency dropped from 180 ms to 34 ms at 200k QPS, and the GC pauses vanished. The JVM side is still at 180k QPS simply because it cannot allocate faster; the Rust side now does 240k QPS on the same 8-core VM. Allocations per request fell from 32 KB to 4.7 KB. Over 24 hours RSS went from 16 GB to 4.2 GB, and the Rust process never exceeded 500 MB. Our memory profiler (heaptrack on the JVM side, massif on Rust) showed the JVM nursery lived in 400 MB before major GC, while the Rust side had one 16 MB bump during initialization and then stayed flat. Flame graph from async-profiler on the JVM side showed 60 % of time in ZTask::run, 25 % in ZRelocate. On the Rust side tokio::runtime::scheduler::multi_thread::worker took 1.3 % CPU and the rest was pure compute.

What I Would Do Differently

Next time Id avoid Arc<Mutex<_>> for inventory entirely; instead use an AtomicI64 with epoch-based reclamation. We lost one incident where two Rust threads raced on an Arc clone and the inventory counter overflowed. Also we should have written the Rust path first and then ported the lobby—our initial belief that the JVM was fine for non-hot paths was wrong; the allocation tsunami from Redis pipelining still leaks into the lobby GC. Finally, we should have benchmarked with criterion under 100k QPS before pushing to staging; our first prod incident at 50k QPS revealed a spin lock in the tokio scheduler that only triggered under low load and high contention.

DEV Community

The Day the JVM Died Under 200k QPS and Lived to Tell the Story

Top comments (0)