The Moment the Runtime Became Your Enemy

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our Treasure Hunt Engine indexes 1.2 TB of JSON event logs from Veltrix operators, then answers sub-second queries like give me all log lines where field error_code = E499 between 2026-05-01T00:00 and 2026-05-07T23:59. Its a classic inverted-index workload. The first version was a Spring Boot monolith running on OpenJDK 21 with G1GC, embedded Lucene, 24 vCPU, 64 GB RAM. We picked it because it was the default stack in 2024.

After three months of steady ingestion, the p99 latency crept past 400 ms. Then 800 ms. Then we hit the 1-second cliff. I ran jstack five times within a minute and collected 48 thread dumps. The histogram from async-profiler showed 76 % of the CPU pinned inside sun.misc.Unsafe.park. Not in Lucene, not in Spring—inside the OS scheduler waiting for a safepoint.

The latency wasnt GC. It wasnt network. It was the JVM runtime itself forcing every safepoint operation to synchronize all 24 worker threads so it could stop the world and count handles. At that scale, stopping the world for 30 ms every 10 seconds made the p99 explode.

What We Tried First (And Why It Failed)

We tried bumping heap to 32 GB. That only increased safepoint stall duration.

We tried ZGC. It cut GC pauses to 2 ms, but the safepoint stall remained because ZGC still needs global safepoints for class unloading.

We tried switching to Azul Zulu Prime. It removed safepoints for some heap operations, but the Treasure Hunt Engine uses live reflection to deserialize nested JSON. Reflection triggers class-unload safepoints anyway, so we gained nothing.

We tried reducing worker threads from 24 to 8. Latency improved, but throughput dropped from 1.2 M QPS to 800 K QPS. The operator console started timing out.

The Architecture Decision

At 05:12 I signed off on a rewrite in Rust with Tokio, using glidesort for the inverted index and simd-json for zero-copy parsing. The decision wasnt about speed. It was about control.

No safepoints: Rusts borrow checker meant we could build the index in immutable slices without any runtime GC, so no safepoints.
No stop-the-world: We avoided tokio::task::yield_now(), so even if we hit a borrow conflict we didnt trigger a runtime safepoint.
No class unloading: We compiled with Cargo rustc -- -C prefer-dynamic=no so all types were known at compile time.

The trade-off was 3 engineering weeks. We lost 500 MB of heap because Rusts jemalloc preallocates arenas, but we gained deterministic latency under load.

What The Numbers Said After

We redeployed the Rust engine on the same 24 vCPU, 64 GB node. The p99 latency dropped from 1.02 s to 89 ms. The 99.9th percentile went from 2.8 s to 180 ms.

$ perf stat -e cycles,instructions,cache-misses ./treasure-hunt --index /data/events --query 'error_code=E499'
 Performance counter stats for './treasure-hunt --index /data/events --query error_code=E499':

 4,123,847,201 cycles:u
 12,847,291,023 instructions:u
 1,247,981 cache-misses:u

 Latency histogram: (24 workers, 1.4 M QPS sustained)
 p50: 24 ms
 p95: 58 ms
 p99: 89 ms
 p99.9: 180 ms

The memory allocator told its own story. The JVM produced 2.1 GB of nursery objects per second. The Rust engine produced 142 MB of allocations per second with no nursery, and 90 % of those were freed within the same async task.

We also ran flamegraph on the Rust binary. The hot stack was inside glidesort::quicksort, not in Tokio or the kernel. That meant the runtime was no longer the constraint.

What I Would Do Differently

I would not have trusted the default JVM stack for anything above 500 K sustained QPS. If the runtime forces you to stop the world to count handles, it has already lost.

I would have measured safepoint stall duration earlier. The Async Profiler flag -e cpu,lock shows safepoint time in red. I missed it for weeks because I filtered only on gc,cpu.

I would have chosen tokio-uring if we had to do file I/O, but the Rust engine already hit the disk fast enough that the file descriptor cache became the new bottleneck. We ended up pre-mapping index shards with mmap and cutting the syscall count by 68 %.

Finally, I would not have let the product manager decide on the default JVM config. The Veltrix operator console assumes infinite heap, but the runtime is the hard cap. The moment the default configs safepoint behavior meets real traffic, the engine becomes a stopwatch, not a search system.