DEV Community

Cover image for RAG in Practice — Part 7: Your RAG System Is Wrong. Here's How to Find Out Why.
Gursharan Singh
Gursharan Singh

Posted on

RAG in Practice — Part 7: Your RAG System Is Wrong. Here's How to Find Out Why.

Part 7 of 8 — RAG Article Series

Previous: RAG, Fine-Tuning, or Long Context? (Part 6)

The Team That Blamed the Model

TechNova's RAG system worked well at launch. Return policy questions got correct answers. Troubleshooting queries surfaced the right procedures. The team shipped, moved on to other work, and checked the dashboard occasionally.

Three months later, support tickets started referencing bad AI answers. A customer was told the return window was thirty days. Another got a troubleshooting procedure that did not match their firmware version. The team's first instinct: the model must be degrading. They started evaluating newer, more expensive models.

The root cause was not the model. TechNova's return policy had changed from thirty days to fifteen days after launch, but the ingestion pipeline had not been re-run. The old chunks were still in the index. The retriever was faithfully returning outdated content. The model was faithfully generating from it. Both were doing their jobs. The data between them was stale.

This is the failure that evaluation exists to catch. Not "is the model good enough?" but "is the system returning the right answers, and if not, which part is wrong?"

Two failures can produce the same wrong answer. The retriever can return the wrong chunks, or the model can mishandle the right ones. To the user, both look identical — a confidently incorrect response. They are not the same problem and they do not have the same fix. The rest of this article separates them, because every useful debugging habit in RAG starts with knowing which one you are looking at.

Retrieval Metrics

Retrieval metrics answer one question: did the retriever return the right content? These metrics evaluate what happened before the model saw anything.

Context Precision

Of the chunks you retrieved, how many were actually relevant to the question? If you retrieve five chunks and three are useful, precision is 60%. The other two are noise — irrelevant content that the model has to read, reason about, and hopefully ignore. High noise means the retriever is casting too wide. The fix is usually in chunking (smaller, more focused chunks) or retrieval approach (adding reranking — a second pass that re-orders the retrieved chunks — or switching to hybrid search).

Context Recall

Of all the relevant content in your knowledge base, how much did you retrieve? If the correct answer requires information from two chunks and the retriever found both, recall is 100%. If it found only one, recall is 50% and the model is generating from incomplete information. Low recall means you are missing signal — the right content exists but the retriever did not find it. The fix is usually increasing the number of chunks retrieved (top_k), improving the embedding model, or adding query expansion — approaches that widen what the retriever finds.

Mean Reciprocal Rank

Was the best chunk ranked first? If the most relevant chunk is at position 1, MRR is 1.0. If it is at position 3, MRR is 0.33. This matters because many systems use only the top 1–3 chunks for prompt assembly. If the best chunk is consistently at position 4 or 5, it never reaches the model. And even when a low-ranked chunk does make it into the prompt, the model is more likely to overlook it — deeper positions in long contexts are easier for the model to miss, the "Lost in the Middle" effect. Low MRR is a signal that reranking would help — the retriever finds the right content but does not rank it well enough.

Generation Metrics

Generation metrics answer a different question: did the model use the retrieved context correctly? These metrics only make sense after you have confirmed that retrieval is working. If the retriever returned the wrong chunks, generation metrics tell you nothing useful.

A note on what not to use. BLEU and ROUGE — common metrics for comparing generated text to a reference answer — are the wrong tool for RAG. They measure surface overlap with a reference answer, which works for translation and summarization, where a single correct output exists. RAG has no single correct answer; it has a correct answer for the retrieved context. A faithful, relevant response can score poorly on BLEU if its wording differs from the reference, and a plausible-sounding hallucination can score well. The three metrics below measure what actually matters: did the model stick to the retrieved context, did it answer the question, and did it cover what the context supports.

Faithfulness

Did the model stick to the retrieved context, or did it add facts that were not in any chunk? A faithful answer draws only from the provided context. An unfaithful answer introduces information the model pulled from its training data — which may be outdated or wrong. This is the RAG-specific version of hallucination: the model was given the right context but generated beyond it.

TechNova example: the retriever returns the correct return policy chunk (15 days), but the model adds "You can also exchange the product within 30 days" — a fact from its training data that is no longer true. The retrieval was correct. The generation was unfaithful.

Answer Relevance

Did the model actually answer the question that was asked? A relevant answer addresses the user's query directly. An irrelevant answer may be factually correct but off-topic. If the user asks about the return policy and the model responds with warranty information — even though the warranty chunk was correctly retrieved alongside the return policy chunk — the answer is irrelevant. The model chose to answer from the wrong chunk.

TechNova example: the customer asks "How do I reset my WH-1000?" The retriever returns both the troubleshooting guide and the return policy. The model answers with the return process. Factually correct, but irrelevant to the question.

Completeness

Did the answer cover what the context supports? A complete answer addresses all the conditions and details present in the retrieved chunks. An incomplete answer cherry-picks. If the return policy chunk says "15 days from date of delivery, original packaging required, open-box items have a 7-day window," and the model responds only with "15 days," it is faithful and relevant but incomplete. The customer may return an open-box item expecting 15 days and get denied.

Two Types of Metrics for Two Types of Problems — Retrieval (blue) vs Generation (purple)

The Diagnostic Spine

This is the single most important debugging habit in RAG: when the answer is wrong, inspect the retrieved chunks first.

If the chunks are wrong — irrelevant, stale, too broad, from the wrong document — the problem is retrieval. No amount of prompt engineering or model upgrading will fix it. The model is generating from bad input.

If the chunks are right but the answer is still wrong — the model hallucinated beyond the context, misinterpreted a condition, or ignored a relevant chunk — the problem is generation. Tighten the prompt, lower the model's temperature setting (the setting that controls randomness), or try a model that follows instructions more closely.

Four diagnostic signals have appeared across this series. Fluent but wrong answers — well-structured, confident, incorrect — almost always mean the retriever returned the wrong chunks. Vague or hedging answers ("the return policy may vary") usually mean the chunks are too broad or generic — a chunking problem. Contradictions across sessions ("thirty days" today, "fifteen days" tomorrow) point to stale data in the index alongside current data — the data freshness problem Part 8 addresses. And correct but irrelevant answers usually mean adjacent content was retrieved instead of the right one, or the model picked the wrong chunk from a right retrieval — check retrieval first, and if the chunks are good, it's a generation-side selection issue.

The same four signals collapse into a quick lookup table when you are debugging in the middle of an incident:

User-visible symptom Likely issue area First thing to inspect
"AI says it doesn't know, but the answer is in the docs." Retrieval — the right chunk was not returned Context recall. Inspect the retrieved chunks for that query.
"Answer is detailed and confident but factually wrong." Usually retrieval (wrong chunks); sometimes generation (hallucinated beyond context) Inspect retrieved chunks first. If chunks are right, check faithfulness.
"Answer is correct but off-topic." Retrieval (adjacent content) or generation (wrong chunk selected) Context precision. Then answer relevance.
"System gives different answers across time for the same question." Data freshness — stale and current chunks both in the index Inspect the index for duplicates and version conflicts. (Covered in Part 8.)

The Diagnostic Spine — wrong answer → inspect chunks first → retrieval problem or generation problem

LLM-as-a-Judge

Manually inspecting every answer is not sustainable. LLM-as-a-judge uses a model to evaluate another model's outputs automatically: you give the judge the question, the retrieved chunks, and the generated answer, and ask it to score faithfulness, relevance, and completeness on a 1–5 scale with a short written reason.

How LLM-as-a-Judge Works — three inputs in, three scored dimensions out, aggregate over the eval set

The shape of a faithfulness judge prompt is small enough to sketch:

You are evaluating a RAG answer for faithfulness.

Question: {question}
Retrieved context: {chunks}
Generated answer: {answer}

Score the answer's faithfulness from 1 to 5,
where 5 = every claim is supported by the context
and 1 = the answer contradicts the context.

Return: score, one-sentence reason.
Enter fullscreen mode Exit fullscreen mode

The same shape works for answer relevance and completeness — only the criterion in the scoring instruction changes.

Two refinements worth knowing. Judge prompts are usually rubric-based — anchored at each score level rather than left to the model's interpretation, which usually improves evaluator consistency. And when comparing two versions of a system, teams often switch to pairwise evaluation ("which answer is better?"), which is more sensitive than absolute scores at small differences.

The value of running a judge is interpretation. When faithfulness drops week over week, something changed in the generation path — a new prompt, a new model, a prompt-injection slipped through (a user input crafted to override the system prompt). When answer relevance drops while faithfulness holds, the retriever is likely pulling adjacent-but-off-topic content. The trend line is what matters, not the single run.

The advantage is throughput — a judge can score thousands of answers in the time a human scores ten — at the cost of subtlety and consistency. A judge model can miss subtle hallucinations that sound plausible but are not in the context. It can be inconsistent: the same answer may score 4 on one run and 3 on the next. LLM-as-a-judge is a useful automation layer, not a replacement for human evaluation. Use it for continuous monitoring. Use human review for building and validating your evaluation set, and for investigating failures the judge flags. And don't overlook the cheapest form of human signal — thumbs-up/thumbs-down buttons in the production app give you a continuous stream of real-user feedback, and the negative ones are your next eval-set candidates.

Building an Evaluation Set

Every metric in this article requires test queries with known-good answers. Without them, you are measuring nothing.

Start with 20–50 queries, manually curated. For each query, record: the question, the expected answer, and which chunks should be retrieved. This is tedious but irreplaceable — the quality of your evaluation set determines whether your metrics catch real problems or generate false confidence.

Once you have a curated foundation, synthetic generation is a useful coverage extender — frameworks like RAGAS can generate test queries directly from your documents, including multi-hop questions that require combining chunks. Treat the generated set as a complement to the curated one, not a replacement: the curated set is your human-verified ground truth, the synthetic set is your reach. Whatever the synthetic generator produces, the answers it grades against should still be checked by a human.

A good evaluation set is not a long list of similar questions. It is a small, deliberate mix of query shapes that stress different parts of the pipeline. For TechNova's product support corpus, that mix looks roughly like this: a straightforward factual lookup ("What is the warranty period on the WH-1000?") tests whether the retriever can find a single canonical chunk; a boundary or condition question ("Can I return an open-box WH-1000 after 10 days?") tests whether the model honors qualifiers in the retrieved chunk instead of giving the headline answer; a multi-condition or multi-chunk question ("What is covered under warranty if I bought it refurbished?") tests whether the system can combine information from two chunks — warranty terms and refurbished-product policy; and a stale-data or version-sensitive question ("What does firmware v3.2 fix?") tests whether the index reflects the current changelog and not an older version. A handful of queries from each category will surface more failure modes than fifty variations of a single shape.

A "known-good answer" is not an exact reference string the model has to match word for word. It is a set of facts and conditions the answer must include to be considered correct. For the open-box question, that set might be: 15-day window, original packaging required, 7-day window for open-box items. The phrasing the model uses does not matter; the presence of those three facts does. This is also why faithfulness, answer relevance, and completeness are useful metrics here — they evaluate the answer against the retrieved context and the required facts, not against a fixed reference string.

Sources for good evaluation queries: real customer questions from your support logs, edge cases you discovered during the Part 5 build, and questions that exercise the specific retrieval challenges your documents create.

Run your retrieval pipeline against the evaluation set after every change. Compare retrieval metrics before and after. If precision dropped, you introduced noise. If recall dropped, you lost signal. If MRR dropped, ranking degraded. Without this discipline, optimization is guesswork. This is the offline half of evaluation; the other half is monitoring real production queries and responses and feeding the failures you find back into the curated set — the offline set defines what you measure, production tells you what you missed.

The evaluation set is not a one-time artifact. As documents change — the return policy is updated, a new firmware version ships, a product is retired — the expected answers and the chunks the retriever should return must be updated alongside them. An evaluation set that drifts out of sync with the corpus quietly produces false failures and, worse, false confidence.

In practice, most teams do not build every scorer from scratch. Common starting points are RAGAS (open-source, metric implementations, test-set generation), LangSmith (LangChain-ecosystem traces and evaluation workflows), and the evaluation features built into cloud platforms like Amazon Bedrock and Vertex AI. Pick whichever fits your stack — the patterns above apply either way.

Three Takeaways

1. Separate retrieval metrics from generation metrics — they diagnose different problems. Retrieval metrics tell you whether the right content was found. Generation metrics tell you whether the model used it correctly. Fix retrieval first.

2. When the answer is wrong, inspect the retrieved chunks first. Always. The diagnostic spine: wrong answer → inspect chunks → retrieval problem or generation problem. This is the single most important debugging habit in RAG.

3. Start with a small evaluation set of 20–50 curated queries. Expand from real user questions. Manually curated test queries with known-good answers. Run them after every change. Without measurement, optimization is guesswork.

You can measure it. Now ship it safely. Metrics tell you what is wrong today. They do not tell you what will quietly go wrong six months from now — when the policy changes, the index drifts, a prompt-injection slips past the judge, and the dashboard still looks green. Part 8 is about that gap: what it takes to keep a RAG system correct in production after the launch adrenaline wears off.

Next: RAG in Production: What Breaks After Launch (Part 8 of 8)


Part of AI in Practice.

TechNova is a fictional company used as the running example throughout this series.

Sample code: github.com/gursharanmakol/rag-in-practice-samples

Top comments (2)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The diagnostic spine is the kind of thing that seems obvious after you read it, but I've watched teams do exactly the opposite in production incidents. Wrong answer? Must be the model. Swap it. Wrong again? Must be the prompt. Rewrite it. Nobody checks the chunks because the chunks are infrastructure and infrastructure is supposed to be boring.

What lands for me is the quiet implication that most RAG failures are actually data engineering problems dressed up as AI problems. The TechNova return policy example is almost too perfect—both the retriever and the model did their jobs correctly. The system was working. The data pipeline wasn't. But the support tickets don't say "data pipeline failure." They say "the AI is giving wrong answers." The nomenclature shapes where the debugging effort goes.

The evaluation set lifecycle is the other part I'm turning over. The idea that an eval set has a shelf life—that it can drift out of sync and start generating false confidence while the dashboard stays green—feels like the thing nobody budgets for. It's not hard to build an eval set once. It's hard to maintain it as the corpus changes, and the maintenance burden grows silently. Six months in, you're measuring against questions that no longer match the documents you're retrieving from, and the metrics are lying to you.

I wonder how much of this gets solved by tying eval set versioning to document versioning directly—like, when the return policy doc gets updated, the eval queries that reference it get flagged for review automatically. Otherwise it's a manual discipline, and manual disciplines decay. Has anyone built that coupling into their pipeline, or is it still in "we'll get to it" territory?

Collapse
 
gursharansingh profile image
Gursharan Singh

Your "data engineering problems dressed up as AI problems" framing is sharper than anything I wrote. The nomenclature really does shape where engineers look first.
On eval-set versioning tied to documents, I have not seen anyone publish a really clean implementation yet. Most of the patterns I have come across are discipline-based, and disciplines decay.
That is exactly the territory Part 8 is going to push into.