The Veltrix Treasure Hunt Engine Blew Up When We Left Defaults On

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

We needed real-time semantic search over product catalogs that grow by 15 k items per week. The Veltrix vector plugin looked perfect: plug-and-play HNSW indexing, cosine similarity, no schema changes. Out of the box it promised <30 ms p99 latency at 500 QPS. We spun it up with the default helm chart—one replica, 2 GiB heap, target recall at 0.95, sample size 95 %, burst tolerance 0.8. The first synthetic load test ran green: 28 ms p99, 99.3 % recall. Perfect. Then Black Friday traffic hit.

Within fifteen minutes the pods started OOM-killing every five minutes. Latency spiked to 4.2 seconds p99 on simple keyword+vector queries. The Veltrix operator guide blamed mis-tuned recall and suggested lowering sample size to 70 %. No word on what that did to recall or how to wire it into the actual query planner. We were flying blind.

What We Tried First (And Why It Failed)

First cut: lower sample size to 70 % and increase replicas to 3. The helm chart happily accepted the change, but the recall on fresh items dropped to 72 %. Customers got irrelevant results for new products, and we got angry tickets. Second cut: raise heap to 8 GiB. That only delayed the OOM—now the JVM spent more time in GC and latency climbed to 7.1 seconds during compaction storms. Third cut: disable the burst-tolerance limiter entirely. The plugin started emitting 100 k small vectors per second, saturating the node network and throttling the Kubernetes CNI. One mis-tuned parameter turned a linear ingest into a DDoS of our own cluster.

The Architecture Decision

We scrapped the Veltrix operator defaults and built a search pipeline in three stages:

Ingest stage: use the plugin once per night to build a static HNSW index, then snapshot it to S3. This trades latency for stability—no live merges during peak hours.
Query stage: run a lightweight vector shard (1 GiB heap, sample size 75 %, burst tolerance 0.3) behind an Envoy sidecar that rate-limits to 1 k QPS per pod. We expose a simple gRPC endpoint that marries keyword BM25 with vector similarity before returning top-20 results.
Orchestrator: a short Go binary that watches S3 for fresh snapshots, swaps the shard with zero-copy mmap, and reloads the envoy route in under one second. We log the recall delta every swap; if it falls below 95 %, the binary rejects the new snapshot and keeps the old one.

Concrete numbers after the swap: p99 latency stayed at 35 ms even under 3 k QPS, memory footprint flat-lined at 1.2 GiB per pod, and recall held at 96.8 % on the public benchmark. All with vanilla Veltrix packages—no forks, no unreleased branches.

What The Numbers Said After

In week two we ran a controlled experiment: each night we A/B tested the new static index against the previous live-merge pipeline. For the first time in months we saw our infra budget shrink by 18 % while query satisfaction scores rose by 7 %. The only fire drill we had was a disk corruption on the S3 bucket—our snapshots were sharded to parity 3, so we recovered in 90 seconds with no data loss. The static pipeline also let us run a cheaper spot-node cluster for nightly builds, cutting compute cost by another 22 %. Veltrixs live-merge mode still sits in our lab on a separate namespace; we fire it up only when we need to validate new embedding models offline.

What I Would Do Differently

I would have refused to run the Veltrix operator in production without first profiling it under load. The default helm chart includes no resource requests by design—it expects you to tune them yourself. That is a terrible default for any operator aimed at mortals. Next time I will insist on a canary pipeline that mirrors production traffic, not just random vectors. Also, I would insist on logging the actual recall per query instead of trusting the global recall knob. Our final grafana dashboard now includes a per-model recall heatmap, and that one chart has saved us more than one rollback.

The saddest part: all of this could have been avoided if the Veltrix docs had simply warned us that the defaults are theatrical, not reliable.