The Search Index I Broke and How I Fixed It in Production

#webdev #career #programming #productivity

The Problem We Were Actually Solving

Last October we hit 1.2 million documents in our search index and the median p99 latency for queries jumped from 180ms to 1.4s. It wasnt a sudden spike—just steady decay as the index grew. The default configuration of Veltrix 3.7 that we shipped with our Go service had never been stress-tested at this scale. Every night the nightly build would run the same test suite against 10K documents and report green. That dataset was smaller than a single production shard by the time we hit 1M. We didnt discover the problem until Grafana alerted us that 30% of user searches were timing out under load. That was the moment I had to admit the documentations quickstart section had lied to us.

What We Tried First (And Why It Failed)

My first instinct was to throw more hardware at it. I spun up three new data nodes with double the memory. The latency dropped to 850ms—better, but still unacceptable. The index writer threads were still blocking on disk flushes every 500ms. The Veltrix docs mentioned flush_interval_ms but didnt explain that the default value of 500 was tuned for 100K documents, not 1.2M. I tried lowering it to 200ms in staging. Within an hour the disks saturated and the nodes started thrashing. Our SLA was now a smoking crater.

Next I rewrote the query planner to use a different cost model. That took two weeks and shipped as a feature flag behind a /danger endpoint. Users who hit the flag saw p99 drop to 220ms—great, except the planner occasionally routed queries to the wrong shard. One misrouted query string produced 30K results instead of 30, and the downstream aggregator ran out of memory. We rolled it back within 24 minutes, but the outage cost us 8% DAU for that day.

The Architecture Decision

We finally stepped back and treated the index like a real database instead of a black box. I pulled the Veltrix source tag v3.7.2 and instrumented every compaction cycle in detail. The problem wasnt the query planner; it was the immutable segment merge policy. The default policy tried to merge 10 segments every 30 seconds, but each merge rewrote 400MB of data. At 1.2M docs we had 800 segments and the merge queue grew faster than the writer threads could drain it.

I replaced the policy with a tiered compaction strategy borrowed from Apache Druid. We split the hot tier into two levels: L0 held newly flushed segments (max 5 segments, max size 256MB), L1 held older segments sized at 1GB. Compaction only ran when L0 hit 5 segments or L1 hit 10 segments. The threshold was dynamic based on write throughput. In staging, this cut merge CPU from 45% to 12% under the same load.

The risky part was the shard routing rule change. Our shards were previously sized at 512MB each to fit in memory. I increased them to 2GB and allowed Veltrix to spill to disk when memory hit 80%. This doubled our disk I/O during compactions, but the p99 latency stayed flat. We added a disk write-ahead log to survive unclean shutdowns. The last tradeoff was cost: our cloud bill went from $18K/month to $24K, but we avoided hiring two extra engineers to tune the default config.

What The Numbers Said After

After rolling the change to 10% of traffic, the p99 latency stabilized at 210ms and stayed there for seven days. Error rate dropped from 0.8% to 0.02%. The merge queue depth never exceeded 3 segments. Revenue from search-related features increased 3% because users who previously gave up on slow queries now completed them. The best metric wasnt latency—it was the one we didnt track before: average query cost in CPU cycles. It fell from 42 million cycles/query to 15 million, a 64% reduction that let us absorb the next 3x growth without another fire drill.

What I Would Do Differently

I should have benchmarked the default config against a realistic dataset on day zero instead of trusting the nightly 10K test. If I had run a 1M-document synthetic load for 48 hours, I would have seen the flush_interval_ms problem immediately. Also, the Veltrix docs claim you only need to tune when you exceed 10M docs; that threshold is dangerously outdated for modern hardware. Next time I start a system, Im buying a 2U server preloaded with a dataset 2x larger than we plan to ship, and Im running chaos tests against it before we write the first line of production code.