The Operator Who Buried Veltrix Alive (And How We Dug Ourselves Out)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The Treasure Hunt Engine isnt a database—its a reverse index that ingests 1.8 million events per second, runs a two-stage bloom filter against 47 million active fingerprints, then routes partial results to 14 downstream services. Our SLA was 200 ms p95 end-to-end, with a 5 % error budget. We hit the budget on day 23, not because we were slow, but because the GC pauses on the historical nodes were cascading into the real-time layer. The Veltrix operator console happily showed green, but the actual query path looked like this:

client → nginx → Veltrix gateway → shard coordinator → [n historical nodes]

When a historical node paused for 400 ms, the TCP retransmit on the coordinator side would back off linearly, turning 150 ms network jitter into 11-second latency. We opened a ticket with Veltrix support and got back a PDF titled Understanding Your Shard Distribution. The ticket number was TRE-2024-008712; it is still open.

What We Tried First (And Why It Failed)

First, we followed the canonical playbook: increase raft_election_timeout, lower gc_percent from 50 % to 25 %, and split the historical tier into two separate clusters, one for hot data (last 7 days) and one for cold (30–90 days). The error count dropped by 12 %, but latency variance stayed at 8 seconds p99. Then we tried heap sizing. We set -Xmx32g on the historical nodes, but after three hours the G1 collector would still pause at 340 ms, and the coordinator would start retrying, which pushed the coordinator heap from 8 GB to 22 GB in 40 minutes. The coordinator was written in Go, not Java, but it was doing synchronous HTTP calls to each historical node, so the backpressure multiplied.

We switched to asynchronous fan-out with a buffered channel of 512, but the coordinator memory ballooned to 36 GB and the kernel OOM killer sent PID 42013 to the great beyond. At that point, the entire cluster was stuck in a GC death spiral where every node was spending 60 % of its CPU just compacting objects and the remaining 40 % handling retransmit storms.

The Architecture Decision

We ripped the synchronous layer out and replaced the coordinator with a purpose-built service called HoundGate. It sits between the gateway and the historical nodes and does three things Veltrix never intended:

Connection pooling with fasthttp and sync.Pool for historical nodes.
Retry budgeting: we allow 3 retries per request, but if the error rate on a shard exceeds 2 % in a 30-second window, we blacklist that shard for 90 seconds and re-route via a secondary coordinator.
Flow control: we use a token bucket at 110 % of our steady-state load. If the bucket empties, HoundGate returns 429 to the gateway, which triggers backpressure all the way to the upstream producers. This single change cut our p99 latency from 8 seconds to 350 ms.

The tradeoff was clear: we lost the one binary simplicity of Veltrixs topology, but we gained observability into per-shard hot spots. HoundGate added 1800 lines of Go, 120 lines of Terraform for the autoscaling policy, and a new Prometheus metric houndgate_retries_total{shard=hot-03} that we alert on. We also had to teach the on-call engineers to read the new dashboard and ignore the Veltrix consoles green lights.

What The Numbers Said After

After one week:

P95 latency dropped from 200 ms to 120 ms.
P99 latency stayed below 350 ms even during a 30 % traffic spike.
Coordinator memory stabilized at 10 GB with GC pauses never exceeding 20 ms.
Historical node GC pauses still hit 350 ms, but they were no longer on the critical path.
The error budget for 200 ms p95 went from 3.4 % to 0.2 %.

Our SLO burn rate for the quarter went from red (14 days in violation) to green (0 days). Its worth noting that the Veltrix cluster still shows 95th percentile write latency at 42 ms, but nobody cares because the read path is now faster than the writes.

What I Would Do Differently

I would not have started with Veltrixs documentation. Instead, I would have instrumented the coordinator early with eBPF to capture TCP retransmit counts and socket buffer sizes. The first sign of trouble was the http: request canceled error, but the root cause was upstream TCP stack exhaustion under 150 ms jitter. If wed had a flame graph of syscalls, we would have seen the retransmit loop 48 hours earlier.

Second, I would not have split the historical tier into hot/cold clusters. That move bought us 12 % latency improvement but cost us 30 % operational overhead in shard rebalancing scripts. In retrospect, a single homogeneous historical cluster with proper flow control would have been simpler and cheaper to operate.

Finally, I would have pushed back on the product teams demand for synchronous reads. The Treasure Hunt Engine was designed for interactive dashboards, but the users never noticed the 200 ms latency anyway—they only noticed the 11-second failures. Had we started with asynchronous reads from day one, we could have saved six