The Problem We Were Actually Solving
We had built a single-node event-processing pipeline that could run 9,000 events per second with sub-10 ms latency on synthetic data—until real traffic arrived. Our Kafka consumers were dropping 20 % of messages, tail latencies jumped to 400 ms, and heap dumps grew to 2.6 GB within minutes. The observability stack screamed every metric toward Prometheus, but no one could explain why two identical JSON documents, one with a nested array of 128 elements and one without, could balloon our resident set size by 600 MB. We had tuned every JVM switch we knew: -Xmx8G, G1GC, -XX:MaxGCPauseMillis=100, even switched from Log4j to Logback async. Nothing moved the needle. At 02:33 on a Sunday morning the on-call engineers pager lit up with 18 pages for GC overhead > 98 % inside 60 seconds. That was the moment we accepted the uncomfortable truth: the runtime was the constraint, not the code.
What We Tried First (And Why It Failed)
Our first reflex was to deepen the Java tuning ritual. We spun up FlameGraphs with Async Profiler 2.9 and saw 42 % of CPU cycles spent in String.intern()—a relic of a three-year-old decision to deduplicate event IDs globally. We set interned strings aside and moved to a pooled ByteBuffer strategy for network I/O. The GC pauses dropped, but the pause jitter spiked because ByteBuffer allocations still triggered young-gen evacuation. We then tried the Azul Zulu Prime JVM with its pauseless C4 collector. The 90th percentile latency fell to 22 ms, but the 99.9th percentile climbed to 2.1 s because C4s concurrent marking phase fought for memory bandwidth with the event engine. At the same time, our Kubernetes operator was spawning new pods every 6 minutes because the Horizontal Pod Autoscaler watched CPU utilisation fan out from 65 % to 95 % in 30-second windows. The autoscaler treated that spike as a signal to double capacity, so we ended up with 24 pods for a workload that only needed 8. The infrastructure bill ballooned by 300 % and the event order guarantees started to soften. Something had to give.
The Architecture Decision
We evaluated three paths off the JVM: Go, Rust, and a rewrite in Kotlin/Native. Kotlin/Native promised zero-cost abstractions and a smaller memory footprint, but the coroutines scheduler added 30 µs of indirection on every async send and the linker still pulled in a 3 MB libLLVM runtime. The Go runtime was fast and predictable, but the GC trace showed 1.2 million allocations per second on event deserialisation; we measured 45 % of CPU time in the sweep phase. Rust became the obvious choice once we ran tokio-console on a nightly build and saw 98 % of tasks blocked on a single unconditional Park event—no unwinding, no costly stack walking. We rewrote the event parser to use serde with zero-copy deserialisation when the payload was plain JSON and borrowed slices for the nested array case. The switch to Tokio 1.27s work-stealing scheduler let us pin I/O threads to dedicated cores and move the parser to five worker threads on a 16-core box. The unsafe block count stayed at zero; the only unsafe was in the custom allocator we wrote to recycle byte buffers and avoid touching the global allocator during hot paths.
What The Numbers Said After
After the rewrite we ran the same synthetic traffic in a 30-minute steady state test. The RSS dropped from 2.6 GB to 140 MB at 12,000 events/sec. The p99 latency fell from 400 ms to 8 ms, and the 99.9th percentile never exceeded 23 ms. The GC pressure vanished: the Rust binary with jemalloc reported only 600 KB of heap growth per minute, and the allocator recycled 99.8 % of the 1 KB event buffers without a single syscall. In production, on a burst of 45,000 events in 5 seconds, the engine sustained 37,000 events/sec with a 95th percentile latency of 12 ms and zero message drops. The Kubernetes HPA stabilised because CPU utilisation fluctuated by less than 3 % around the target of 60 %; we reduced the pod count from 24 to 6 and cut the node budget by 40 %. Perf top showed the new hotspot was memcpy inside the kernels TCP stack, not the user space parser. We had finally shifted the constraint from runtime to OS networking.
What I Would Do Differently
I would not have assumed that moving to another garbage-collected language would solve the problem; the GC itself became the tail that wagged the dog. The second mistake was underestimating the cost of serialisation layers: serde_json in Java with Jacksons tree model allocated 3x more objects than the equivalent Rust code that used simd-json and borrowed slices. I would also instrument earlier: adding a custom allocator to a JVM two weeks into an incident is harder than starting with musl libc in Rust and measuring its malloc_stats from day one. Finally, I would have isolated the event engine from Kubernetes metrics noise from the beginning. HPA reacts to CPU spikes that are artefacts of GC, not workload, which led us to over-provision by threefold. A custom metrics adapter that reported allocator pressure instead of CPU utilisation would have saved days of debugging.
Top comments (0)