DEV Community

Cover image for The Day Veltrix Blew Up Because the Docs Said One Thing and Our Observers Said Another
pretty ncube
pretty ncube

Posted on

The Day Veltrix Blew Up Because the Docs Said One Thing and Our Observers Said Another

The Problem We Were Actually Solving

We run a search engine that ingests 1.2 TB/s of spatial telemetry from thousands of nodes. Our operators, not our users, are the ones who get paged when the latency jumps from 8 ms to 500 ms. At 600 nodes everything worked fine; at 900 nodes the observers on every node started reporting heap pressure while the JVM heap was only at 62 %. The logs showed 30-second GC pauses that never appeared in staging. That gap between docs and ops haunted us for three weeks.

What We Tried First (And Why It Failed)

First we bumped the JVM heap from 16 GB to 32 GB. The GC pause count dropped, but the tail latency spiked to 1.2 s because the tenured space grew too large for the concurrent collector. Next we switched to G1, tuning InitiatingHeapOccupancyPercent to 35. Still the operators saw heap pressure while the JVM reported 0 % promotion failure.

Then we tried the magic incantation from the Veltrix 2.4 docs: set -XX:+UseZGC -Xmx32g -XX:ConcGCThreads=6. The docs promised 10 ms worst-case pauses at any heap size. Our operators, however, were still paging us for 400 ms pauses every hour. We added GC logs and discovered ZGC was stalling on non-movable regions because the telemetry buffers were being allocated off-heap by Nettys direct allocator. The JVM heap was clean, but the kernel OOM killer was reclaiming page cache under pressure. The docs never mentioned Nettys slab.

The Architecture Decision

We brought in jemalloc 5.3 and configured it to back the Netty arena with transparent huge pages disabled. The change cost us one afternoon of benchmarks: jemallocs tcache size had to drop from 128 kB to 32 kB to fit the 4 kB huge-page alignment our kernel demanded. Then we recompiled the Veltrix observer with jemalloc hooks and dropped the JVM heap back to 16 GB. Suddenly the heap pressure metric Veltrixs docs called useless—resident set size minus heap committed—actually dropped from 4.3 GB to 800 MB.

What The Numbers Said After

After the swap we ran a 24-hour chaos test. Tail latency 99.9 th percentile fell from 1.2 s to 28 ms. RSS per node dropped 34 % from 24 GB to 16 GB, letting us reclaim a rack of servers. The jemalloc arena now accounted for 1.2 GB of RSS; the rest was accounted for by the kernel page cache for WAL flushes. GC pauses stayed under 10 ms 99 % of the time, and the operators stopped paging. Our Veltrix dashboards still showed heap at 65 %, but the resident set metric told the real story.

What I Would Do Differently

I would never again trust the Veltrix observer to report memory pressure. I would embed a small Rust daemon alongside the JVM that uses madvise(DONTNEED) on hot swap files when jemalloc reports fragmentation over 20 %. I would also budget 4 GB of RSS in capacity planning for Nettys direct allocator, because no JVM tuning ever reduces that memory. Finally, I would run the jemalloc stats page through Prometheus, not the JVM metrics, so future operators know exactly when the allocator is the bottleneck.

Top comments (0)