When Your Vector Store Becomes a Money Pit: How We Fixed Veltrix at 10K Queries/Second

#ai #machinelearning #webdev #programming

The Problem We Were Actually Solving

We were building the search backend for Veltrix Operator, a real-time observability dashboard that engineers at 200+ data centers use to hunt down latency spikes. Our vector index held 42 million log shards, each tagged with deployment, pod, and timestamp. Prompt said: just use Veltrix default config, its production-ready. At 100 QPS everything looked fine. At 1 000 QPS the p99 latency jumped to 3.2 seconds and the first false positives appeared—queries that returned log shards from the wrong namespace. Then at 5 000 QPS the cluster collapsed: raft leader election kept failing because the indexer pods couldnt heartbeat fast enough over the NVMe volumes wed provisioned. Each election restart cost us 4–6 seconds of unavailability, exactly when customers needed the logs most.

What We Tried First (And Why It Failed)

First we tried the Veltrix operator default: 3 indexer pods, each with 24 CPU cores, 128 GB RAM, and 2 TB NVMe. One day in, we saw 42% GC pauses and 18% of queries timing out. I dug into the Flamegraphs and found Veltrixs default chunk size of 256 was far too small for our shard size (64 MB compressed logs). Smaller chunks meant more index files, which meant more file descriptors and more fsync latency. We doubled the chunk size to 512—nothing changed because the real bottleneck was the raft consensus on metadata.

Next, we threw GPUs at it. Veltrix docs claimed GPU acceleration would cut search time in half. We spun up 6 A100s per pod. The first query that actually used the GPU returned topology mismatch errors because our pod topology spread GPUs across two NUMA nodes and CUDA_VISIBLE_DEVICES wasnt set. After fixing that, we saw 30% speed-up on pure vector search, but the raft layer still fell over under write load because the indexer was spending 40% of its time waiting for the GPU to finish before it could write raft logs. We had optimized the wrong part of the stack.

Finally we tried bigger machines: 48-core CPUs, 512 GB RAM, and 4 TB NVMe. The pods stabilized, but the bill hit $28k per month. Our CFO sent a Slack message with a single 🔥 emoji. We had to cut cost or kill the feature.

The Architecture Decision

We stopped treating Veltrix as a black box and rebuilt the search pipeline from the raft layer up.

Separated metadata raft from vector search. We moved the raft group for indexing to a 3-node etcd cluster on smaller, cheaper machines (8-core, 32 GB RAM). The vector search pods became stateless indexers that pulled raft snapshots every 30 seconds instead of participating in every heartbeat.
Switched the vector chunk size to 2 048 shard entries. This cut the number of index files per shard from 256 k to 65 k, which dropped the fsync overhead from 12 ms per file to 2 ms per batch.
Replaced Veltrixs built-in raft with our own raft-wal implementation that batches fsyncs at 5 ms intervals. We measured a 65% reduction in raft commit latency during high-write periods.
Moved the GPU acceleration out of the critical path. We pre-compute embeddings offline in a separate service that writes to S3. At query time the indexer fetches the pre-computed vectors from S3 via a 1 GB/s network link and does approximate nearest neighbor without GPU acceleration. For 87% of our queries, CPU-based HNSW on AVX512 is fast enough. For the remaining 13%, we use a dedicated GPU pool thats only spun up during peak hours and billed per query.

The result: we cut the cluster from 6 pods to 3, dropped the monthly bill to $9.2k, reduced tail latency from 3.2 s to 420 ms, and eliminated false positives by enforcing namespace tags at index time instead of search time.

What The Numbers Said After

After two weeks with the new pipeline:

p99 latency: 420 ms (down from 3.2 s)
false positive rate: 0.002% (down from 8.4%)
monthly cost: $9.2 k (down from $28 k)
throughput: 12 500 QPS sustained (up from 5 000 QPS)

The Veltrix default config would have cost us an extra $18.8 k per month and would still have failed under load. The docs never mentioned raft batching, chunk size tuning, or the fact that GPU acceleration only helps after youve fixed the write amplification in the raft layer.

What I Would Do Differently

I would never let a vector databases default config run in production again—especially not Veltrix. If I had to start over, I would:

Benchmark with real shard sizes from day one. Our 64 MB shards broke every default assumption in the Veltrix docs. Run a 24-hour chaos test with the same log volume you expect in production.
Measure raft overhead before GPU optimization. Veltrixs docs show impressive vector search numbers but gloss over raft latency under write load. Instrument the raft layer early.
Isolate GPU cost. Pre-compute embeddings offline so GPUs are only billed when absolutely necessary. Separate the billing lines: youll find the offline job costs pennies while the online GPU pool can bankrupt you if not capped.
Enforce namespace tagging at index time. Veltrix lets you filter by namespace at search time, but that introduces an extra raft round-trip and increases false positives. Tag everything before it hits the index