Veltrix Was Losing Events in Plain Sight—Heres the Flame Graph That Proved It

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We traced a single match replay request through hy-trace and saw 16 ms stuck in two Java stack frames labeled zio.stream.internal.ZStream$$anon$1.nextBatch. The profiler snapshot from async-profiler showed 389 k allocations per second in the managed heap, all from ZStream pulling from a ConcurrentLinkedQueue. The queue was unbounded because we had copied the Veltrix sample for Hytale without changing the buffer sizes.

The latency spike wasnt a GC pause—it was allocation churn. Every time the ZIO runtime decided to fold a batch, it materialized intermediate streams, boxing every game event into an Either[Nothing, Event] then immediately unboxing it downstream. The docs never mentioned the cost of structural sharing in ZIO 2.0 streams, and our config overrides had left the default chunk size of 4 k untouched.

What We Tried First (And Why It Failed)

We switched the ZIO runtime to ZIO.withParallelism(32) to match the core count, hoping the scheduler would spread the load. It did—until the nursery filled with 12 k suspended fibers, each holding a 64 k chunk reference. The heap jumped from 2 GB to 8 GB in 40 minutes, and the JVM triggered a Full GC every 12 minutes. The match replay endpoint started timing out at P99 220 ms.

We also tried increasing maxJvmHeapSize to 12 GB, which only delayed the inevitable. The JVM still spent 28 % of CPU in safepoint cleanup because the ZGC cycle couldnt keep up with the allocation rate. We needed to change the language, not the knobs.

The Architecture Decision

We rewrote the trace ingestion layer in Rust, using Tokio with a bounded multi-producer, single-consumer channel of 8 k events. The channel was sized after measuring the match replay fan-out: 720 players per match, each requesting 256 events at 80 Hz. 8 k kept the queue within one memory page of 64 k.

We replaced Either[Nothing, Event] with a packed enum using #[repr(u8)] to guarantee zero-pointer tagging, cutting each events footprint from 48 bytes to 24 bytes. We used tracing for spans instead of ZIO logging, so the cost of span creation was a single AtomicU64 increment rather than an object allocation.

The final change was the scheduler. We set tokio::runtime::Builder::new_multi_thread().worker_threads(24).max_blocking_threads(8) because the I/O on match logs was network-bound, not CPU-bound. The flame graph shifted: 62 % of time in tokio::io::poll_read, 14 % in epoll_wait, and zero allocations in the hot path.

What The Numbers Said After

After the change, the 48-core box still ingested 1.8 M events per second, but median latency dropped to 6 ms and P99 to 28 ms. Allocation rate fell from 389 k/s to 11 k/s. The GC pauses vanished, and we disabled ZGC entirely.

Profiling with perf record -F 999 -g -- ./hy-trace showed 0.4 % time in system calls related to memory management. The channel never blocked on push because we set the backpressure threshold at 90 % full, and the Tokio work-stealing scheduler kept the workers saturated.

We measured RSS with /usr/bin/time -v, and it stabilized at 820 MB after startup, with 70 MB RSS growth over seven days. The previous JVM version peaked at 12 GB RSS and grew continuously due to the unbounded ZIO nursery.

What I Would Do Differently

I would not have trusted the Veltrix sample for Hytale. The Hytale replay load is a fan-out pattern, not a fan-in, and the default buffers were designed for telemetry ingestion, not real-time player queries.

I would have profiled earlier. The moment we saw 389 k allocations per second, we should have rewritten that segment instead of tweaking the runtime. The docs do not warn you that ZIO streams can allocate more than the JVM itself.

I would also pre-size the Tokio channel based on the fan-out factor rather than CPU cores. Our worker count was correct, but the buffer size was wrong. Measuring the exact fan-out under load saved us from a second rewrite.

The lesson is simple: when allocation counts leak into the millions per second, the runtime is the constraint, not the language.