The bug that took me four hours to find had nothing to do with the model

#ai #debugging #llm #rag

It was 11pm. The AI assistant had been returning slightly wrong answers for three days and nobody could figure out why. Not wrong enough to obviously break anything. Wrong enough that two engineers had opened tickets saying "I think the AI is getting worse?"

I started where I always start: what actually got retrieved.

Added five lines of logging to dump the retrieval results before they hit the LLM. Ran the same query that had been producing bad answers.

The top result was from a document dated 14 months ago.

The current document, the one with the right information, was ranked fourth.

Similarity scores: 0.89 for the old document, 0.81 for the new one. The old document won because it was written more cleanly and the semantic match was slightly stronger. The model did exactly what it was supposed to do. It used the best matching document. The best matching document was outdated.

Not a model problem. Not a prompt problem. A data problem that looked like a model problem for three days.

The fix was two parts. Metadata filtering so that documents tagged as superseded never enter the retrieval pool. And a freshness signal in the ranking so that when two documents match similarly, the newer one gets a small boost.

# What we added to the retrieval call
results = vectorstore.similarity_search(
    query=query,
    k=10,
    filter={"status": {"$ne": "superseded"}}
)

# Re-rank by blending similarity score with freshness
def freshness_score(doc, max_age_days=365):
    age = (datetime.now() - doc.metadata["last_modified"]).days
    return max(0, 1 - (age / max_age_days))

def rerank(results):
    return sorted(results, key=lambda r: (
        0.8 * r[1] +  # similarity
        0.2 * freshness_score(r[0])  # freshness
    ), reverse=True)

The fix took forty minutes once I understood the actual problem.

The lesson I keep relearning: when an AI system gives bad answers, the instinct is to look at the model or the prompt. Start with the retrieval. Most of the time the model is doing exactly what you told it to do. The question is whether what you told it to do was right.

Top comments (1)

Vinicius Pereira • Jul 1

This is the right diagnosis and it's the one nobody reaches for first, everyone stares at the prompt while the retrieval quietly hands the model the wrong evidence. "The model did exactly what you told it to, the best matching document was outdated" is the whole thing in one line.

Two things I'd add from getting bitten by this same class of bug. First, the reranking is treating the symptom, the actual disease is that a superseded doc was retrievable at all, which makes it a document lifecycle problem more than a ranking one. The interesting question is who tags something superseded and when, because if that tagging is manual it drifts the moment someone forgets, and you're back here in three months. Reranking saves you when the tag is missing, but the tag going stale is the real long-term risk.

Second, be a little careful with freshness as a ranking signal, because it can quietly trade your staleness bug for a recency bug. Newer isn't always more correct, some of your best docs are old precisely because they're stable and canonical, and a linear decay will start burying a two-year-old policy that never changed under a fresher but worse match. The 0.8/0.2 blend is a reasonable default but it really wants a small golden set to tune against, and honestly that same golden set is the thing that would've caught this on day one instead of after three days and two tickets. Subtly-wrong-for-days is exactly the failure a handful of known-good query and answer pairs flags immediately. Good writeup, this is the part of RAG people underinvest in.