The Moment the JVM Tuning Knob Broke Our Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our treasure hunt engine at Veltrix was a real-time geospatial matching service that processed 50 million location events daily. By month six it handled bursts of 2M concurrent users during events like Black Friday flash sales. The heap profile from YourKit showed a 15-second GC pause every 47 minutes, coinciding with the games daily reward drop. The GC logs screamed OldGen exhaustion. We had tuned G1GC with -Xms8G -Xmx8G -XX:MaxGCPauseMillis=100, but the pause times werent improving. The team argued over whether we needed Azul Zing or just better partitioning. I suspected the language runtime was the bottleneck, not the GC algorithm.

What We Tried First (And Why It Failed)

We doubled the heap to 16G and increased MaxGCPauseMillis to 200. That dropped the pause frequency but widened the window: 22-second GC pauses started appearing every 70 minutes. The safepoint logs from JVMCI revealed 32ms safepoint sync times per millisecond of mutator work. The allocation rate hit 7.2 MB per second during peak, and despite off-heap caching with Chronicle Map, the Eden space was collapsing under object churn from our spatial index rebalancing.

We tried Azul Zing. It cut safepoint time to 8ms, but introduced long JIT warmup pauses during traffic surges. The cost per instance jumped 40% on our Kubernetes nodes, and we still leaked direct buffers at 2.3 MB/s due to improper Netty arena sizing. At this point I pulled flame graphs using async-profiler and saw the real culprit: the JVMs biased locking and biased revocation events were consuming 18% of CPU during index splits. The spatial index used a red-black tree with fine-grained locks, and each tree rotation triggered revocation storms.

The Architecture Decision

I rewrote the core index in Rust with jemalloc as the allocator and no runtime GC. The spatial index became a lock-free k-d tree using crossbeams epoch-based reclamation. I benchmarked it against the JVM tree using criterion.rs and saw 3.4x lower median latency and 6.8x lower 99th percentile latency at 2M QPS. The binary size dropped from 47 MB to 7 MB, and RSS stayed flat under load. We deployed it behind a thin Go shim that handled TLS and load balancing.

The tradeoff was time-to-market. It took three engineers six weeks to port the index and validate correctness under property-based tests with quickcheck. We lost feature velocity while iterating on the tree invariants, but gained predictable tail latency. I used perf to record cache misses: the Rust version had 0.4 misses per instruction versus 1.8 for the JVM tree under the same load.

What The Numbers Said After

After two weeks in production with the Rust index, the P99 latency at 500k QPS dropped from 210ms to 42ms. GC pauses disappeared entirely because the tree owned its memory. The Kubernetes node count dropped from 12 to 8 under the same load, saving $18k/month in compute. Error rate went from 0.032% to 0.0018%.

But the Go shim became the new bottleneck. It allocated 1.2 MB per second per connection due to its default connection pool sizing. We switched to a Rust-based proxy using hyper and tokio, cutting allocations to 180 KB/s per connection.

What I Would Do Differently

We should have profiled the JVMs biased locking earlier. The biased lock revocation events were visible in async-profilers lock contention view, but we dismissed them as noise until we saw the safepoint logs.

Also, we underestimated the cost of logging. The Rust service initially wrote 8 GB/day of debug logs to stdout, which caused Docker to throttle I/O and added 40ms latency spikes. We switched to tracing with opentelemetry and reduced log volume by 94%.

Finally, we should have started with a microservice boundary between the index and the rest of the system. The Rust rewrite blurred those boundaries, making future language migrations harder. A clean service boundary would have let us test the index in isolation before swapping it into production.