WonderLab

Posted on May 7

RAG Series (9): When RAG Gives Bad Answers — Root Cause Diagnosis with RAGAS

#rag #ragas #ai #llm

"It Feels Off" Is Not a Diagnosis

You've deployed a RAG system. Users are saying the answers "aren't quite right."

So you tweak the Prompt — feels a bit better. Then you switch Embedding models — better again. After a few rounds of this, you have no idea which change actually helped, and the next time it breaks, you're back to square one.

This is the most common trap in RAG engineering: tuning by intuition without quantified diagnosis.

The previous article built an evaluation framework with RAGAS and explained the 4 core metrics. This article turns those 4 metrics into a diagnostic toolkit — by deliberately inducing 3 classic failure modes, we can use data to pinpoint root causes instead of guessing.

The Core Diagnostic Approach: A Decision Tree

When a RAG system gives poor answers, the root cause falls into one of two categories: retrieval failed or generation failed.

User reports "bad answer"
        ↓
Check context_recall
        ├─ Low ────→ Retrieval problem
        │             ├─ Key content not retrieved
        │             ├─ Chunks too small or too large
        │             └─ Top-K too low
        │
        └─ Normal ──→ Check faithfulness
                          ├─ Low ────→ Generation problem (hallucination)
                          │             └─ Prompt encourages model to go beyond context
                          │
                          └─ Normal ──→ Check answer_relevancy
                                            └─ Low ──→ Off-topic answer
                                                       └─ Prompt forces rigid structure

The logic is straightforward:

Check context_recall first — did we retrieve the relevant content? If not, no Prompt trick will save you; the problem is upstream.
Then check faithfulness — does the answer contain claims that can't be found in the retrieved context? If so, you have hallucination, and the fix is in the Prompt.
Finally check answer_relevancy — does the answer directly address the question? If not, the Prompt structure is the issue.

Reproducing Three Classic Failure Modes

We use the same knowledge base and test set throughout, deliberately changing the configuration to induce each failure mode, then letting RAGAS quantify the metric signatures.

Baseline Configuration

baseline = RAGPipeline(
    chunk_size=512,
    chunk_overlap=50,
    top_k=4,
    prompt_type="baseline",  # Normal, grounded prompt
)

The baseline Prompt explicitly instructs the model to stay grounded in context:

PROMPT_BASELINE = ChatPromptTemplate.from_messages([
    ("system", "You are a professional technical Q&A assistant. Answer strictly based on the provided reference material. "
               "If the material doesn't contain the answer, say so explicitly. Be concise and accurate."),
    ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nAnswer:"),
])

Problem 1: Low Retrieval Recall (Tiny Chunks + Insufficient Top-K)

p1_pipeline = RAGPipeline(
    chunk_size=64,   # Extremely small — docs shattered into fragments
    chunk_overlap=0,
    top_k=1,         # Only 1 chunk retrieved — most info is lost
    prompt_type="baseline",
)

Why does this cause problems?

Each document gets split into dozens of 64-character fragments. A single concept is scattered across multiple chunks. With top_k=1, we only retrieve 1 chunk — even if it's the most relevant one, it doesn't contain nearly enough context to answer the question completely.

Expected outcome: context_recall drops sharply — key content wasn't retrieved. faithfulness and answer_relevancy suffer as a knock-on effect.

Problem 2: Hallucination (Prompt Encourages Going Beyond Context)

p2_pipeline = RAGPipeline(
    chunk_size=512,
    chunk_overlap=50,
    top_k=4,
    prompt_type="hallucination",  # Hallucination-inducing prompt
)

The hallucination Prompt explicitly encourages the model to "expand":

PROMPT_HALLUCINATION = ChatPromptTemplate.from_messages([
    ("system", "You are an encyclopedic AI assistant with broad knowledge. Answer questions comprehensively "
               "drawing on your extensive training. The reference material below is just a starting point — "
               "feel free to expand with additional background knowledge beyond what's provided. "
               "Make the answer as rich and informative as possible."),
    ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nProvide a comprehensive answer:"),
])

Why does this induce hallucination?

The model is explicitly told it doesn't have to stay within the reference material. It will pull from its pre-training knowledge to generate additional content. That content might even be factually correct — but RAGAS's faithfulness metric asks: can every claim in the answer be traced back to the retrieved context? Any claim that can't is flagged as a hallucination.

Expected outcome: faithfulness drops sharply, while context_recall remains the same (retrieval is fine).

Problem 3: Off-Topic Answers (Prompt Forces Rigid Academic Structure)

p3_pipeline = RAGPipeline(
    chunk_size=512,
    chunk_overlap=50,
    top_k=4,
    prompt_type="offtopic",  # Forced academic survey format
)

The off-topic Prompt mandates a fixed academic structure:

PROMPT_OFFTOPIC = ChatPromptTemplate.from_messages([
    ("system", "You are a senior technical researcher writing academic surveys. "
               "For every question, structure your response exactly as follows:\n"
               "1. Technical Background and Historical Evolution\n"
               "2. Major Technical Schools and Comparative Analysis\n"
               "3. Current Challenges and Future Trends\n"
               "Each section must be at least 200 words. Use academic language."),
    ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nWrite the survey:"),
])

Why does this lead to off-topic answers?

The user asks "What is RAG?", but the model is forced to output an 800-word academic survey with three fixed sections. RAGAS's answer_relevancy metric measures how directly the answer addresses the actual question. A lengthy survey format scores poorly on relevancy for a direct factual question.

Expected outcome: answer_relevancy drops sharply, while faithfulness and context_recall remain normal (the content comes from context, it's just the format that's off).

Experimental Results

After running diagnose.py, the comparison report looks like this:

================================================================================
  RAG Diagnostic Comparison Report
================================================================================
  Metric               Baseline    Problem 1: Low Recall  Problem 2: Hallucination  Problem 3: Off-Topic
  ──────────────────────────────────────────────────────────────────────────────────────────────────────
  faithfulness            0.829            0.750              0.320 ← ✗                 0.817
  answer_relevancy        0.502            0.191              0.487                 0.183 ← ✗
  context_precision       0.583        0.375 ← ⚠                  0.583                 0.550
  context_recall          0.625        0.250 ← ✗                  0.625                 0.613
  ──────────────────────────────────────────────────────────────────────────────────────────────────────
  Average                 0.635            0.392              0.504                 0.541
================================================================================

The diagnostic story in numbers:

Scenario	Problem Metric	Other Metrics	Root Cause
Baseline	No issues	—	—
Problem 1	context_recall ↓ 0.375	context_precision also drops	Chunks too fragmented + top_k too low; key content never retrieved
Problem 2	faithfulness ↓ 0.509	context_recall unchanged	Prompt tells model to go beyond context; hallucination introduced
Problem 3	answer_relevancy ↓ 0.319	faithfulness unchanged	Forced academic format; answer doesn't directly address the question

Note that in Problem 1, context_precision also drops. With tiny chunks, even retrieved ones carry too little information to be genuinely useful — they're "relevant in topic but useless in content."

Decision Tree Diagnostic Output

The script also automatically outputs the decision tree analysis:

================================================================================
  Decision Tree Diagnosis
================================================================================

  [Problem 1: Low Retrieval Recall] (Expected: context_recall low)
    Step 1 → Check context_recall: dropped +0.375 (significant)
    ✗ Diagnosis: Retrieval stage has a problem
       → Key content was not retrieved; check chunk_size and top_k
       → top_k may be too small, or chunks too fragmented to carry complete semantics
    ⚠ Additional finding: context_precision dropped +0.208 — noise in retrieved results

  [Problem 2: Hallucination] (Expected: faithfulness low)
    Step 1 → Check context_recall: dropped +0.000 (normal)
    Step 2 → Check faithfulness: dropped +0.509 (significant)
    ✗ Diagnosis: Generation stage producing hallucinations
       → Answer contains claims not found in the retrieved context
       → Fix: Update Prompt to explicitly require staying within reference material

  [Problem 3: Off-Topic] (Expected: answer_relevancy low)
    Step 1 → Check context_recall: dropped +0.012 (normal)
    Step 2 → Check faithfulness: dropped +0.012 (normal)
    Step 3 → Check answer_relevancy: dropped +0.319 (significant)
    ✗ Diagnosis: Answer is off-topic; doesn't directly address the question
       → Prompt forces rigid academic structure, answer unfocused
       → Fix: Simplify Prompt, remove forced formatting constraints
================================================================================

All three diagnostic paths fired correctly, matching the root cause we deliberately induced.

How to Fix Each Problem

Problem 1: context_recall Low → Fix Retrieval Config

# Before: chunks too small, top_k too low
RAGPipeline(chunk_size=64, chunk_overlap=0, top_k=1)

# After: reasonable chunk size + sufficient top_k
RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)

Reference values:

Use Case	chunk_size	overlap	top_k
Short Q&A (technical FAQ)	256–512	20–50	3–5
Long document comprehension	512–1024	50–100	4–6
Code repository search	Per function/class	0	3–5

Problem 2: faithfulness Low → Harden the Prompt

The core principle: explicitly tell the model to only use the reference material, and specify what to do when the material is insufficient.

PROMPT_STRICT = ChatPromptTemplate.from_messages([
    ("system", """You are a technical Q&A assistant.
Rules:
1. Answer ONLY based on the provided reference material
2. If the reference doesn't contain the answer, respond: "The available material does not address this question"
3. Do NOT add information beyond the reference material
4. Keep responses concise — under 150 words"""),
    ("human", "Reference material:\n{context}\n\nQuestion: {question}"),
])

Problem 3: answer_relevancy Low → Simplify Prompt Structure

# Fix: remove forced format, let the model answer naturally
PROMPT_FOCUSED = ChatPromptTemplate.from_messages([
    ("system", "You are a technical Q&A assistant. Answer the question directly and concisely. "
               "No unnecessary structure or headers."),
    ("human", "Reference material:\n{context}\n\nQuestion: {question}"),
])

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/09-rag-diagnosis

Key files:

rag_pipeline.py — RAG Pipeline with 3 configurable Prompt types
diagnose.py — 3 failure scenarios + decision tree diagnosis

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 09-rag-diagnosis

cp .env.example .env   # Fill in your LLM and Embedding API keys
pip install -r requirements.txt
python diagnose.py

Summary

The core idea behind this diagnostic framework:

Use metrics, not intuition — context_recall, faithfulness, and answer_relevancy each isolate retrieval, generation, and relevance independently
Follow the decision tree in order — check context_recall first, then faithfulness, then answer_relevancy; skipping steps leads to misdiagnosis
Build controlled comparisons — the metric gap between a good config and a bad config is your evidence for root cause

In practice: when something breaks, run a RAGAS evaluation first. See which metric is lowest. Follow the decision tree to find the direction. That's a lot more reliable than tweaking things until it "feels better."

DEV Community