5 RAG Failure Modes Nobody Warns You About in the Tutorials

#ai #database #llm #rag

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Database Playbook: Choosing the Right Store for Every System You Build
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The tutorial RAG works. You ingest a PDF, you ask it three pre-baked questions, the demo blog post writes itself. Then you put it in front of customers and watch the ground move under you.

Five failures keep showing up. They are not the ones you read about. They are the ones that survive your eval suite, slip past your CI, and only surface when an account manager forwards a screenshot of a confidently wrong answer. The RAG-as-data-engineering essay on Datalakehousehub and the field postmortem on Decompressed catalogue the same shape. This post walks through five of them, each with a code mitigation small enough to drop into a real pipeline.

1. Stale embeddings nobody re-indexed

The corpus is alive. The pricing page changes in February. The contract template changes in March. The product taxonomy changes in April. Your embeddings were minted in January. Nothing errors. Retrieval just gets slowly, silently worse, until a customer asks a question whose answer is in the new copy and the model cites the old one with conviction. A team I spoke with hit this when they swapped embedding models without re-embedding the stored documents — retrieval broke without a single failed deploy, exactly the Decompressed post-mortem on embedding-model migrations.

The mitigation is a freshness filter that refuses to retrieve documents whose embeddings predate the last write to the source. It is the cheapest test you will ever add and it catches the boring 80% of stale-embedding bugs.

from datetime import datetime, timedelta, timezone

MAX_AGE = timedelta(days=14)


def fresh_filter(now: datetime, max_age: timedelta = MAX_AGE):
    cutoff = now - max_age
    return {
        "must": [
            {"range": {"source_modified_at": {"lte": "$embedded_at"}}},
            {"range": {"embedded_at": {"gte": cutoff.isoformat()}}},
        ]
    }


def retrieve(client, query_vec, k=5):
    return client.search(
        index="docs",
        vector=query_vec,
        filter=fresh_filter(datetime.now(timezone.utc)),
        top_k=k,
    )

Two clauses. The first refuses chunks whose source has been modified after the chunk was embedded. The second refuses chunks older than your re-index SLO. When a customer file changes, you want the retriever to return nothing rather than something wrong — a cold start beats a confident lie.

2. Tables and figures the splitter never saw

Recursive character splitters do not know what a table is. They see whitespace, newlines, and bullets, and they walk through a structured document the way a blindfolded person walks through a museum. The pricing table that answers the question gets sliced into header-only and body-only chunks; the figure caption ends up two chunks downstream of its figure. Every paper on chunking quality reaches the same conclusion — see the KX Systems table-heavy RAG writeup for a clean teardown.

The mitigation is to use a partitioner that knows what document elements are. unstructured produces typed elements (Title, NarrativeText, Table, ListItem, Image), and its chunk_by_title keeps tables intact, isolating them as their own chunks rather than smearing them across two halves of a recursive split. The Unstructured chunking docs lay out the contract: a Table is never combined with another element.

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition(
    filename="contract.pdf",
    infer_table_structure=True,
    strategy="hi_res",
)

chunks = chunk_by_title(
    elements,
    max_characters=1200,
    combine_text_under_n_chars=200,
)

for c in chunks:
    if c.category == "Table":
        html = c.metadata.text_as_html
        embed(html)  # send the HTML to the embedder

The infer_table_structure=True flag costs you a layout-model call at ingest. You pay it once. In return, the table arrives at the embedder as HTML that survives intact through your retriever and into the prompt.

3. Two documents disagree and the LLM averages them

This one is sneaky. The retriever returns five chunks. Three say "the SLA is 99.9% uptime." Two say "the SLA is 99.95% uptime." The model, faced with five plausible authorities, blends them: answers "around 99.9%", or worse, picks one without flagging the disagreement. There is no warning. The eval set, written at one point in time, never had two versions of the SLA to disagree about.

You catch this before the LLM call. Run a tiny entailment check across the retrieved set. If the chunks contradict each other on the entities the question is about, surface that to the generator instead of letting it average. Cross-encoders are fast enough to do this in the request path.

from sentence_transformers import CrossEncoder

nli = CrossEncoder("cross-encoder/nli-deberta-v3-base")


def find_contradictions(chunks: list[str]) -> list[tuple[int, int]]:
    pairs = [(i, j) for i in range(len(chunks))
             for j in range(i + 1, len(chunks))]
    inputs = [(chunks[i], chunks[j]) for i, j in pairs]
    scores = nli.predict(inputs)  # [contradiction, entailment, neutral]
    return [
        pairs[k] for k, s in enumerate(scores) if s.argmax() == 0
    ]


def guarded_generate(question, chunks, llm):
    conflicts = find_contradictions(chunks)
    if conflicts:
        prompt = (
            f"Sources disagree. Quote each source and flag the "
            f"conflict. Do not pick a side.\n\nQ: {question}\n\n"
            + "\n---\n".join(f"[{i}] {c}" for i, c in enumerate(chunks))
        )
    else:
        prompt = f"Q: {question}\n\n" + "\n---\n".join(chunks)
    return llm.complete(prompt)

A sub-second NLI pass per response, and a model that says "Source A says 99.9%, source B says 99.95%, these conflict" instead of inventing a number that satisfies neither.

4. Off-corpus questions answered with confidence

Your corpus is internal HR docs. A user asks who won the 2026 World Cup. The retriever, operating on cosine similarity, returns the five nearest chunks: they are about parental leave policies. The model dutifully tries to answer the question using the parental leave policy. You laugh. Then a user asks "is feature X scheduled for Q3?" (feature X is not in the corpus) and the model invents a roadmap.

The fix is a retrieval-confidence threshold and an explicit abstention path. The HALT-RAG paper on calibrated abstention shows precision climbing roughly 8 points on summarization (and a comparable lift on dialogue) when the bottom decile is refused rather than answered. The sufficient-context paper from late 2024 reaches the same conclusion from a different angle: when the retrieved set does not actually contain the answer, the model is far more likely to hallucinate than abstain.

ABSTAIN = "I don't have enough information to answer that."
MIN_TOP_SCORE = 0.42        # example only — calibrate to your eval set
MIN_AVG_TOP3 = 0.34         # example only — calibrate to your eval set


def with_abstention(retrieved, llm, question):
    scores = [r.score for r in retrieved]
    if not scores:
        return ABSTAIN
    top = scores[0]
    avg3 = sum(scores[:3]) / max(len(scores[:3]), 1)
    if top < MIN_TOP_SCORE or avg3 < MIN_AVG_TOP3:
        return ABSTAIN
    chunks = [r.text for r in retrieved]
    return llm.complete(build_prompt(question, chunks))

Calibrate the thresholds against a labeled set of in-corpus and out-of-corpus questions. The exact number matters less than the existence of the gate. A RAG system that knows how to say "I don't know" buys back the trust a hallucinating one spends.

5. Eval-set leakage hiding real degradation

You write an eval set in week one. You re-run it every deploy. Numbers go up. You ship faster. Six months later, somebody points out that 30% of the eval questions have crept into the corpus — added as FAQs, indexed as customer docs, embedded into the same store the retriever queries. The eval is now testing the retriever's ability to find a verbatim match, not its ability to do RAG. Your scores look great. Your real users see degradation that your dashboard cannot see.

The mitigation is a hash-based contamination check that runs in CI. Hash every chunk in the corpus. Hash every question in the eval set. If the cosine between any eval question and any corpus chunk is above a tight threshold, that question is contaminated and gets quarantined.

import hashlib
import numpy as np


def fingerprint(text: str) -> str:
    return hashlib.sha256(text.lower().encode()).hexdigest()[:16]


def detect_leakage(eval_qs, corpus_chunks, embedder, threshold=0.92):  # example only — calibrate to your eval set
    eval_vecs = embedder.encode([q["question"] for q in eval_qs])
    corpus_vecs = embedder.encode([c.text for c in corpus_chunks])
    sims = eval_vecs @ corpus_vecs.T
    leaked = []
    for i, q in enumerate(eval_qs):
        top = sims[i].max()
        if top >= threshold:
            leaked.append({
                "qid": q["id"],
                "fingerprint": fingerprint(q["question"]),
                "max_sim": float(top),
                "match_idx": int(np.argmax(sims[i])),
            })
    return leaked


def assert_clean(eval_qs, corpus_chunks, embedder):
    leaked = detect_leakage(eval_qs, corpus_chunks, embedder)
    if leaked:
        raise AssertionError(
            f"{len(leaked)} eval questions leaked into corpus"
        )

Wire it into CI. Every corpus update triggers the check, every eval update triggers the check, and the build refuses to pass with leakage above your tolerance. The Microsoft Azure 2026 RAG-shifts piece lists eval-set hygiene as one of the underrated production controls; this is the cheapest version of it.

These are the kind of gates you add the second time you ship a RAG system. If you are about to ship your first one, save yourself the round trip and add them now.

If this was useful

The RAG Pocket Guide is the long version of this post — chunking, retrieval, reranking, eval, and the production controls that keep these failure modes from reaching customers. The Database Playbook is the companion: how to pick the right store for the corpus you actually have, when Postgres is enough, and when it isn't.