Veltrix Will Eat Itself Alive if You Trust Its Default Tuning

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Our service crawls 400 M product pages a day and serves 120 k real-time queries per second at peak. We needed to double capacity without buying more machines, so we picked Veltrix because the demo ran 0.5 ms median latency on a three-node cluster with 150 GB heaps. What the marketing deck did not show was the default garbage collector configuration, which assumes you are indexing cat videos on a laptop. By the time we noticed, the 95th percentile latency had climbed to 1.8 seconds and our CDN was ejecting half the traffic as health-check failures.

What We Tried First (And Why It Failed)

We opened the docs and followed the Quick Start: set -Xmx150g, -XX:+UseG1GC, leave the rest alone. G1GC is supposed to be safe, but in Veltrix 3.4 the ergonomics still default to pause-target = 200 ms and region size = 2 MB. Our heap was 75 % full within twenty minutes of ingest, so every 1.3 GB region filled and triggered an evacuation. That meant 200 ms pauses every 25 seconds, which multiplied under concurrency to create head-of-line blocking at the query planner. We tried disabling ergonomics entirely and tuning manually:

-XX:MaxGCPauseMillis=100
-XX:G1RegionSize=8M
-XX:G1ReservePercent=30

The stop-the-world events shrank to 80 ms on average, but the allocation rate was now 1.4 GB/s and our nodes were swapping at 40 MB/s. The storage cache stayed cold because the JVM was stealing pages from the OS page cache to satisfy new object allocations. Net result: p99 latency climbed to 2.3 seconds and we rolled back in under two hours.

The Architecture Decision

We ripped the JVM out of the critical path. Instead of running Veltrix as a fat monolith, we carved it into two tiers:

Tier-0: a C++ shard router built on Seastar with 16 B threads per core. It accepts TCP only, does no heap allocations larger than 64 bytes, and delegates every query to the index tier.
Tier-1: Veltrix index nodes still JVM, but with a separate allocation arena for byte buffers that bypasses the GC. We use jemalloc arenas sized to 4 GB per arena, and we pin each arena to a numa node. We also switched to ZGC with -XX:ZAllocationSpikeTolerance=5, which keeps the worst-case pause under 10 ms even when the heap is 90 % full.

The Seastar tier adds 1.2 ms median latency but removes the sawtooth from GC; the jemalloc arena gives the JVM a predictable allocation slop without touching the main heap. We also moved the heap size down to 96 GB and added a 60 GB off-heap block cache so the OS page cache stops fighting the JVM.

What The Numbers Said After

After the rewrite we ran a 24-hour burn-in at 240 k QPS. The results:

p50 latency: 0.6 ms (down from 0.8 ms)
p95 latency: 1.4 ms (down from 1.8 ms)
p99 latency: 2.1 ms (down from 2.3 ms)
GC pause max: 8 ms (down from 1 s)
CPU steal from VM context switches: 1.2 % (down from 8 %)

We kept the same 12-node cluster and handled the load with five fewer nodes than originally planned. The infra bill dropped 22 % because we stopped over-provisioning CPU credits on AWS.

What I Would Do Differently

I would not have let the default Veltrix configuration near a production cluster again. In hindsight, the most dangerous line in the entire codebase was the one they hide in /etc/veltrix.conf:

gc_opts=-XX:+UseG1GC

That single switch pulled the rug out from under us for three weeks. Next time, I will ship a wrapper RPM that replaces that line with a compile-time constant defined by our build pipeline. If the project cannot override the JVM flags without recompiling, the project is not production ready. End of story.