Treasure Hunt Engine: The Moment the Documentation Stopped Telling the Truth

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Our SRE team ran the Treasure Hunt Engine against a corpus that grew from 8 TB to 147 TB in six months. Every Tuesday at 03:47 UTC the cluster would report 1.2 million search requests with zero latency violations—until the week operators noticed the UI freezing for 4.3 s while the backend returned 27 K completely unrelated document IDs. The Veltrix docs said this was impossible; the on-call runbook said to scale the query shards. Both were wrong.

The docs implied the engine used a deterministic BM25 variant, but the actual query plan showed a two-stage retrieval: an approximate nearest neighbor (ANN) filter built on DiskANN v1.2, followed by a reranker that ran on a GPU cluster. The ANN stage was supposed to cap latency at 80 ms, yet on that Tuesday it spiked to 412 ms. The reranker never saw a chance to correct the error because the ANN layer had already shipped garbage.

What We Tried First (And Why It Fails)

We added six more GPU reranker nodes to absorb the load—Veltrixs recommended fix. Latency dropped back to 72 ms, but the same Tuesday next week we saw 4 K hallucinated documents. The reranker was using a 7 billion parameter T5 reranker fine-tuned on MS MARCO. The fine-tuning had been done on a static dataset from 2023. Our corpus evolved daily; the reranker had never seen the new vocabulary. The score threshold of 0.75, hard-coded in the engines YAML, was completely miscalibrated.

We tried bumping the threshold to 0.82. Precision improved from 63 % to 71 %, but recall fell from 89 % to 78 %. Our search KPI was strict: 85 % recall at 95 % precision. We were now below target and still hallucinating.

The Architecture Decision

The real culprit was the ANN index refresh cadence. DiskANN v1.2 rebuilt the index every 48 hours, but our indexing pipeline pushed 30 GB of new documents every hour. The gap meant the ANN index was always 6–10 hours behind reality. The reranker then had to compensate for an index that no longer reflected the corpus.

We made a brutal call: switch from DiskANN to a custom HNSW implementation in Rust with an in-memory buffer for the latest 24 hours of documents. The index rebuilt every minute, not every two days. The ANN index grew from 1.8 GB to 2.4 GB, but we gained an order of magnitude in indexing latency. We also switched the reranker to a distilled 220 M parameter model fine-tuned weekly on a rolling 30-day window. The score threshold became dynamic, set by a Bayesian calibration service that ran every 15 minutes on a 1-hour query log slice.

The tradeoff was memory: the HNSW index used 600 MB more RAM per node, but we were willing to burn 10 % more infrastructure to fix the hallucination rate.

What The Numbers Said After

After the change, Tuesday 03:47 UTC became boring again. The ANN stage stabilized at 58 ms ± 12 ms. The reranker precision hit 92 % and recall 87 %, beating our KPI. The hallucination rate—documents returned with a relevance score > 0.9 but completely unrelated—dropped from 0.98 % to 0.03 %. The memory tax was 640 MB per node, but we absorbed it by decommissioning the six GPU nodes we had added earlier. Net cost: +1.4 % infra budget, -60 % on-call pages.

What I Would Do Differently

I would have insisted on instrumenting the ANN index staleness metric on day one. The DiskANN refresh interval should have been surfaced as a first-class SLO, not buried in a JIRA ticket. We also trusted the rerankers fine-tuned weights too long; a weekly regression test on a 24-hour slice of live data would have caught the calibration drift before it became an outage. Finally, we should have tested the entire pipeline with 10 % synthetic stale data injected into the ANN index—a 30-minute chaos experiment that would have exposed the failure mode months earlier. The docs never mentioned any of these checks; the failure found us because we assumed the engine was deterministic when it was, in fact, theatrical.