The Moment Our JVM Tuning Hit the Language Wall

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our treasure-hunt service had two hot paths: the matchmaker, which finds compatible players in <5 ms, and the cache warmer, which pre-fetches game assets 50 ms ahead of a match start. Both were written in Kotlin 1.9 on OpenJDK 21, running inside containers on Kubernetes. Everything scaled linearly until about 8 000 concurrent matches, then the JVM would pause for 300–500 ms every 60 s. The GC log showed a single cycle consuming 18 s CPU time and 4 GB heap.

We ran jstat -gc 100 ms intervals and saw that the allocation rate was 850 MB/s during peak, but the survivor spaces were always 99 % full, which meant G1 was evacuating 9 GB/s of objects that should have been tenured. Adding -Xmx8G only postponed the inevitable. It was clear the JVM couldnt keep up with allocation pressure from Kotlins eager string interning and Kotlinx.serializations object graphs.

What We Tried First (And Why It Failed)

The first reflex is always tuning. We added -XX:MaxGCPauseMillis=50, -XX:G1NewSizePercent=30, -XX:G1MaxNewSizePercent=60, disabled class unloading with -XX:+ClassUnloading, and even enabled ZGC with -XX:+UseZGC. Latency at 10 000 matches improved to 280 ms p99, but the tail was now dominated by ZGCs 2 ms safepoints every 20 ms. The safepoint sync time showed up as 5 % CPU in flame graphs, and the p95 of match latency jumped from 45 ms to 65 ms. At that point the GC was no longer the bottleneck; the safepoints were.

We tried de-flating monitors, using biased locking, and setting -XX:+UseBiasedLocking=0, but the JVM still spent 3 % CPU in spin loops. Profiling with async-profiler 2.9 showed that 14 % of the time was spent in Klass::oop_size, which meant every polymorphic call was probing the vtable. In other words, the JVMs dispatch overhead was now a measurable part of our latency budget.

The Architecture Decision

At that moment I did something I rarely do: I looked at the language itself instead of the runtime. The Kotlin compiler emits invokevirtual for every interface call, and the JVMs vtable indirection is 2–3 cycles per call. In our matchmaker we had 78 interfaces for match rules, and each rule was called 400 000 times per second. The allocation pressure from rule objects plus the indirection overhead meant we were hitting a fundamental limit of the JVMs object model.

I proposed rewriting the matchmaker in Rust. The plan was to use zero-cost monomorphization via const generics, avoid heap allocation for rules, and guarantee no GC pauses. The cache warmer stayed in Kotlin because its workload is mostly I/O bound and the GC pressure is tolerable there.

The decision wasnt trivial. Rust was new to the team, the build pipeline needed a cargo vendor step to integrate with Bazel, and we had to port 12 000 lines of Kotlin to Rust in two weeks. The other risk was the unknown cost of crossing the FFI boundary between Kotlin and Rust for every match rule evaluation. We mitigated it by bulk-evaluating rule sets inside Rust and minimizing JNI calls.

What The Numbers Said After

We deployed the Rust matchmaker on 5 % of traffic on November 28 and watched the metrics. The GC pressure on the JVM side dropped to 30 MB/s allocation rate. On the Rust side, perf stat showed 0.08 instructions per cycle for the rule evaluation hot loop, and the branch misprediction rate was 2.3 %. The matchmaker latency p99 on Rust was 18 ms versus 35 ms on Kotlin.

Here are the concrete numbers we collected over a 72-hour window with 12 000 concurrent matches:

JVM matchmaker (Kotlin): p99 latency 42 ms, p95 28 ms, safety margin 18 ms under SLA.
Rust matchmaker: p99 latency 18 ms, p95 12 ms, safety margin 32 ms.
Max RSS per pod dropped from 1.8 GB to 380 MB.
Allocation rate per pod dropped from 850 MB/s to 12 MB/s.
Safepoint sync time dropped from 5 % CPU to 0.02 % CPU.

We ran flame graphs with perf for 30 minutes at 12 000 matches. The top entry in the Kotlin version was SafepointSynchronize::begin at 4.8 %, while the Rust version had no safepoint overhead at all. The only remaining GC pressure was from the Kotlin cache warmer, which we tolerated because it runs asynchronously.

What I Would Do Differently

I would not have wasted two weeks tuning the JVM. Once I saw that the vtable indirection was 3 cycles per call and multiplied by 400 000 calls per second, it was obvious the language model was the constraint, not the GC algorithm.

The second lesson is about ownership discipline. We initially tried to share rule objects between Kotlin and Rust via Arc>, which added 12 ns per match. We had to refactor to a bulk evaluation API that copies rule parameters once per match batch, reducing the cross-boundary cost to 2 ns. That refactor alone saved another 1.3 ms p99 latency.

The third mistake was not measuring FFI cost early