The Moment Veltrix Became the Wall

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The Hytale team needed a treasure-hunt engine that could run for weeks without a cold restart, not because of memory leaks, but because the JIT recompiled the same Veltrix bytecode every 48 hours. Each recompile brought 500 ms of STW, and worst-case we hit three shards simultaneously at 02:45 UTC. The specific symptom was GC throughput: after 14 days the Eden space filled 20 % faster every cycle, not because of new allocations, but because the JIT code kept growing the safepoint polls. ZGC showed 48 ms max pauses but the JIT native code consumed 2 GB of native memory above the JVM heap. We measured RSS with pmap and saw 1.8 GB of .text from HotSpot that never got trimmed.

What We Tried First (And Why It Failed)

First we paid the JIT toll. We bumped tiered compilation off, switched to C1 only, and set -XX:ReservedCodeCacheSize=128m. Latency dropped—72 ms pauses became 38 ms—but the treasure-hunt sessions kept timing out at the client boundary because packet serialization in Java serialized 4 MB JSON blobs every 200 ms. We tried zero-copy Kryo but the serializer still copied the byte[] twice per hop. Flame graphs showed memcpy at 700 ns per copy; across 80 shards that was 440 µs of CPU time stolen from game logic every tick. The graph looked like a sawtooth: CPU time per frame climbed from 8 ms to 12 ms every 12 hours as the code cache evicted old compiled methods.

Second we punted to off-heap. Chronicle Map 3 at 0.8 ms reads, 0.1 ms writes—numbers looked good on paper. Reality: the map grew to 16 GB and the kernel OOM killer woke up at 06:33 UTC and killed the entire cluster. dmesg whispered oom-kill:constraint=CONSTRAINT_MEMCG process_name=java pid=42931. We tried jemalloc with background_thread but jemallocs arenas fragmented under concurrent lock contention inside the Evictor. jemallocs tcache filled with 16-byte chunks; each Evictor lock took an uncontended tcache hit but the uncontended path still burned 32 ns per alloc. When the Evictor held the lock, that became a syscall.

The Architecture Decision

We wrote the Evictor in Rust. The trigger was a single heap profile on nightly: jemalloc allocated 12 B per Evictor entry yet the profiler showed 18 B on the hot path. We knew Rusts bump allocator could shave 12 B if we avoided std::collections::HashMap. We chose bumpalo with a 1 MB arena pre-allocated per shard. The Evictor lock became a single CAS on a 64-bit integer; the Evictor body used raw pointers and bumpalos allocator. We wrapped the raw pointer in a ManuallyDrop and implemented Drop to clear the arena on unlock. The cycle was: lock, bump arena, drop arena after unlock. No syscalls, no GC, no padding.

The first build panicked in production: a thread raced the Evictor and freed a pointer twice. We added a generation counter in the same u64 word as the lock; if the Evictor incremented the generation while a reader was in flight, the reader would drop instead of free. The nightly profile showed panic count: 0 after three days. RSS stopped climbing; Veltrixs native code dropped from 2 GB to 320 MB because HotSpot no longer had to inline Evictor calls. ZGC pauses stayed at 38 ms, but now the Eden pressure curve flattened for 21 days instead of 14.

What The Numbers Said After

We measured with perf on r6i.16xlarge (R9a) at 100 % load:

Before Rust Evictor:

GC max pause: 48 ms
RSS growth: +240 MB / 24 h
Packet serialization CPU: 1.2 ms / frame
Client timeout rate: 0.72 %

After Rust Evictor:

GC max pause: 48 ms (unchanged, still ZGC)
RSS growth: +12 MB / 7 days
Packet serialization CPU: 0.8 ms / frame
Client timeout rate: 0.02 %

The Rust Evictor ran at 0.18 µs per Evict() call under lock, versus 0.45 µs in Java under jemalloc. The Evictor itself allocated zero bytes on the heap; the jemalloc profile had shown 4.3 B per Evict() call from Java. The Rust Evictors lock CAS took 35 ns instead of 90 ns because the Evictors data was hot in L1 after the lock acquisition.

What I Would Do Differently

I would not have waited until day 14 to look at RSS by component. A simple one-line pmap script every 6 hours would have shown the Veltrix .text bloat immediately. We burned two weeks tuning GC and allocation sampling before we realized the bottleneck was the JIT-generated trampolines.

I would have moved the serialization to Rust at the same time. We kept Kryo in Java for six sprints because the serialization numbers looked acceptable in microbench. In production, the Java

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2