Your RAG Agent Is Retrieving the Wrong Chunk: 5 Failure Modes We Fix in Production

#ai #rag #programming #machinelearning

A client called us last month with a simple complaint: "Our support agent confidently quotes the wrong refund policy." The model was fine. The prompt was fine. The problem was three layers down, in the part nobody demos: retrieval. The agent was pulling the wrong chunk of text and then reasoning beautifully over the wrong facts.

This is the quiet truth about Retrieval-Augmented Generation (RAG). When an agent gives a wrong answer, the instinct is to blame the model or "prompt it harder." But in production, the majority of bad answers we debug are retrieval failures, not generation failures. The model did exactly what it was told - it just got handed the wrong context. Here are the five failure modes we see most often, and how we fix them.

1. Chunking that splits a fact in half

The default move is to slice documents into fixed 500-token windows. That works until a fact straddles a boundary - the eligibility rule is in chunk 14, the exception that voids it is in chunk 15, and your retriever returns only chunk 14. The agent now states a rule with total confidence and zero awareness of the exception.

The fix: chunk on structure, not character count. Split on headings, table rows, clauses, and list items. Add a small overlap (10-15%) so a fact and its caveat never get cleanly severed. For policy and contract data, we often store the whole section as one chunk even if it is long - a slightly bloated context beats an amputated fact.

2. Embeddings that confuse "similar words" with "same meaning"

Vector search retrieves what is semantically near the question. But "Can I cancel my subscription?" and "Can I cancel my appointment?" live close together in embedding space while meaning entirely different things in your system. Pure semantic search will happily hand back the appointment policy.

The fix: hybrid retrieval. Combine dense vector search with old-fashioned keyword (BM25) search and merge the results. Keywords catch the exact terms - product names, error codes, SKUs - that embeddings smudge together. In our experience this single change removes a large share of "close but wrong" retrievals.

3. No re-ranking, so the best chunk sits at position seven

Your retriever returns the top 20 candidates. The genuinely correct chunk is in there - at rank 7. But you only pass the top 3 to the model, so it never sees it. Recall was fine; ranking failed.

The fix: add a re-ranker. Pull a generous candidate set (say 20-30), then run a cross-encoder re-ranker that scores each chunk against the actual question and reorders them. Pass the top few after re-ranking. It is one extra step and it consistently lifts answer quality more than swapping to a bigger LLM.

4. Stale or duplicated documents

The 2023 pricing PDF and the 2026 pricing PDF both live in the index. Retrieval finds the 2023 one because it happens to be a tighter semantic match. Now your agent quotes prices from three years ago, and it is not wrong about the document - it is wrong about which document.

The fix: treat the index as a living dataset, not a one-time dump. Attach metadata (effective date, version, source) and filter on it at query time. Run a de-duplication pass. Re-index on a schedule. The most expensive RAG bugs we have untangled were not algorithmic - they were a forgotten stale file nobody removed.

5. No "I don't know" path

If retrieval returns nothing relevant, a naive pipeline still stuffs whatever it found into the prompt, and the model dutifully invents an answer. That is the hallucination everyone fears - except it was avoidable.

The fix: score the retrieval. If the top result's relevance is below a threshold, do not answer from it - say you do not have that information, or hand off to a human. An agent that knows the edge of its own knowledge is worth far more than one that bluffs.

How we test this before it ships

You cannot eyeball your way to a reliable RAG system. We build a small evaluation set - 50 to 100 real questions with known-correct source documents - and measure two things separately: retrieval accuracy (did we fetch the right chunk?) and answer accuracy (was the final response correct?). Splitting them tells you where the failure actually lives. Nine times out of ten, fixing retrieval fixes the answer, and you never needed a more expensive model at all.

RAG is not magic and it is not plumbing you can ignore. It is the layer that decides whether your agent is grounded in your business or improvising. Get it right and a mid-sized model outperforms a frontier model running on bad context. Get it wrong and no model on earth will save you.

About Shanti Infosoft: Shanti Infosoft is a CMMI Level 5 AI development company that has delivered 700+ projects across 16+ industries. We help teams move from AI ideas to dependable, production-grade software - shantiinfosoft.com | machine learning development services.

If your agent is confidently retrieving the wrong context, we can audit your retrieval pipeline and tune it against your own documents. Talk to our team.

Rishabh Jain is a Director at Shanti Infosoft, where the team builds AI agents and automation for real business operations.