DEV Community

Cover image for RAG-Based Testing Series — Part 2: Testing Retrieval Quality — Are You Fetching the Right Data?
Faizal
Faizal

Posted on

RAG-Based Testing Series — Part 2: Testing Retrieval Quality — Are You Fetching the Right Data?

RAG-Based Testing Series — Part 2: Testing Retrieval Quality — Are You Fetching the Right Data?

"A RAG system is only as good as what it retrieves. Get retrieval wrong — and everything that follows is built on sand."

In Part 1, we established the big picture.

We learned what RAG is, why traditional testing breaks down with AI systems, and what the major failure modes look like.

If you haven't read Part 1 yet — go read it first. This series builds on itself. 🔗

Now in Part 2 — we get our hands dirty. 🛠️

We're going to talk about retrieval quality — the first and most critical layer of any RAG system.

Because here's the thing most people miss 👇


🎯 Why Retrieval Is Everything

Think back to the RAG pipeline from Part 1:

User Query
    │
    ▼
[Embedding Model] — converts query to vector
    │
    ▼
[Vector Database] — finds most similar document chunks   ← WE ARE HERE
    │
    ▼
[Retrieved Context] — top N relevant chunks
    │
    ▼
[LLM Prompt] — question + context combined
    │
    ▼
[LLM] — generates final answer
    │
    ▼
Final Response to User
Enter fullscreen mode Exit fullscreen mode

Notice where retrieval sits — right at the top of the chain.

If the retriever fetches the wrong documents, the LLM has no chance. It will either hallucinate, give an incomplete answer, or worse — give a confidently wrong answer based on irrelevant context.

Garbage in. Garbage out.

A brilliant LLM cannot save bad retrieval. 🗑️

This is why retrieval testing is not optional — it's the foundation of your entire RAG test strategy.


🧠 What Does "Good Retrieval" Even Mean?

Before we can test something, we need to define what success looks like.

For retrieval, success has two dimensions:

Precision — Of the documents I retrieved, how many were actually relevant?

Recall — Of all the relevant documents that exist, how many did I actually retrieve?

Let me make this concrete with an example. 👇

Imagine your knowledge base has 5 documents relevant to the question "What is the cancellation policy?"

Your retriever returns 4 documents.

  • 3 of those 4 are actually about cancellation policy ✅
  • 1 of those 4 is about shipping policy ❌
  • 2 relevant documents were never retrieved at all ❌

So:

  • Precision = 3 relevant retrieved / 4 total retrieved = 75%
  • Recall = 3 relevant retrieved / 5 total relevant = 60%

Both matter. But they pull in different directions.

High precision means you're not polluting the LLM's context with junk.
High recall means you're not missing critical information the LLM needs.

The goal is to balance both. ⚖️


📐 The Retrieval Metrics You Need to Know

Now let's go deeper. In production RAG systems, we measure retrieval quality with four key metrics.

1. Precision@K

"Of the top K documents I retrieved — how many were relevant?"

Precision@K = (Relevant documents in top K) / K
Enter fullscreen mode Exit fullscreen mode

Example:
You ask the retriever for top 5 documents (K=5).
3 of them are relevant.
Precision@5 = 3/5 = 0.6

This tells you how much noise is in your retrieved context. High noise = confused LLM.


2. Recall@K

"Of all relevant documents in the knowledge base — how many did I find in my top K?"

Recall@K = (Relevant documents in top K) / (Total relevant documents)
Enter fullscreen mode Exit fullscreen mode

Example:
There are 8 relevant documents in total.
Your top 5 retrieved 4 of them.
Recall@5 = 4/8 = 0.5

This tells you how complete your retrieval is. Low recall = incomplete answers.


3. MRR — Mean Reciprocal Rank

"How high up in the list is the FIRST relevant document?"

RR = 1 / (rank of first relevant document)
MRR = average of RR across all queries
Enter fullscreen mode Exit fullscreen mode

Example:

  • Query 1: First relevant doc appears at rank 1 → RR = 1/1 = 1.0
  • Query 2: First relevant doc appears at rank 3 → RR = 1/3 = 0.33
  • Query 3: First relevant doc appears at rank 2 → RR = 1/2 = 0.5

MRR = (1.0 + 0.33 + 0.5) / 3 = 0.61

Why does this matter? Because the LLM gives more weight to context that appears earlier in the prompt. If your most relevant document is buried at rank 5 — the LLM might not "see" it properly.


4. NDCG — Normalized Discounted Cumulative Gain

"Are the most relevant documents appearing at the top of the list?"

This is the most sophisticated metric. It doesn't just ask "did you retrieve the right documents?" — it asks "did you retrieve them in the right order?"

NDCG rewards systems that put the most relevant documents first and penalizes systems that bury important documents at the bottom of the list.

Score ranges from 0 to 1. Higher is better. An NDCG of 1.0 means perfect ranking.

This is the metric you'll see most often in research papers and production RAG evaluations.


Quick Reference Table

Metric What It Measures Best For
Precision@K Relevance of retrieved docs Noise control
Recall@K Completeness of retrieval Coverage
MRR Position of first relevant doc Single-answer queries
NDCG Quality of ranked order Multi-doc, complex queries

🛠️ Let's Write Actual Retrieval Tests

Enough theory. Let's build something.

For this series, we're using Python and RAGAS — the leading open-source framework for RAG evaluation.

If you don't have Python set up yet, here's all you need:

pip install ragas
pip install openai
pip install chromadb
Enter fullscreen mode Exit fullscreen mode

Step 1 — Set Up a Simple Knowledge Base

First, we need something to retrieve from. Let's create a minimal knowledge base using ChromaDB (a lightweight vector database):

import chromadb
from chromadb.utils import embedding_functions

# Set up ChromaDB with OpenAI embeddings
client = chromadb.Client()
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-openai-api-key",
    model_name="text-embedding-ada-002"
)

collection = client.create_collection(
    name="support_docs",
    embedding_function=embedding_fn
)

# Add documents to the knowledge base
documents = [
    "Premium subscribers are eligible for a full refund within 30 days of purchase. Requests must be submitted via the support portal.",
    "Standard subscribers can cancel their subscription at any time. Cancellation takes effect at the end of the billing period.",
    "Shipping for all orders is processed within 2-3 business days. Express shipping is available at checkout.",
    "To reset your password, click the Forgot Password link on the login page. A reset link will be sent to your registered email.",
    "Premium subscribers get access to priority customer support with a 2-hour response time guarantee.",
]

collection.add(
    documents=documents,
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"]
)

print("Knowledge base created ✅")
Enter fullscreen mode Exit fullscreen mode

Step 2 — Define Your Test Cases

This is the most important step. You need a ground truth dataset — a set of questions paired with the documents that should be retrieved.

This is what separates random testing from proper evaluation. 👇

# Ground truth: query → which doc IDs are actually relevant
test_cases = [
    {
        "query": "What is the refund policy for premium subscribers?",
        "relevant_doc_ids": ["doc1"]
    },
    {
        "query": "How do I cancel my subscription?",
        "relevant_doc_ids": ["doc2"]
    },
    {
        "query": "What support response time do premium members get?",
        "relevant_doc_ids": ["doc5", "doc1"]  # both are relevant
    },
    {
        "query": "How long does shipping take?",
        "relevant_doc_ids": ["doc3"]
    },
]
Enter fullscreen mode Exit fullscreen mode

Note: Building this ground truth dataset is actual work — and it's work worth doing properly. The quality of your test suite is only as good as your ground truth. We'll talk more about how to build this at scale in Part 5.


Step 3 — Run Retrieval and Calculate Precision@K

def calculate_precision_at_k(retrieved_ids, relevant_ids, k):
    """
    Of the top K retrieved documents — how many were relevant?
    """
    retrieved_at_k = retrieved_ids[:k]
    relevant_retrieved = [doc for doc in retrieved_at_k if doc in relevant_ids]
    return len(relevant_retrieved) / k


def calculate_recall_at_k(retrieved_ids, relevant_ids, k):
    """
    Of all relevant documents — how many did we retrieve in top K?
    """
    retrieved_at_k = retrieved_ids[:k]
    relevant_retrieved = [doc for doc in retrieved_at_k if doc in relevant_ids]
    return len(relevant_retrieved) / len(relevant_ids)


def run_retrieval_test(collection, test_cases, k=3):
    results = []

    for test in test_cases:
        # Retrieve top K documents
        result = collection.query(
            query_texts=[test["query"]],
            n_results=k
        )

        retrieved_ids = result["ids"][0]  # list of retrieved doc IDs
        relevant_ids = test["relevant_doc_ids"]

        precision = calculate_precision_at_k(retrieved_ids, relevant_ids, k)
        recall = calculate_recall_at_k(retrieved_ids, relevant_ids, k)

        results.append({
            "query": test["query"],
            "retrieved": retrieved_ids,
            "relevant": relevant_ids,
            "precision@k": round(precision, 2),
            "recall@k": round(recall, 2),
        })

    return results


# Run the tests
results = run_retrieval_test(collection, test_cases, k=3)

for r in results:
    print(f"\nQuery: {r['query']}")
    print(f"Retrieved: {r['retrieved']}")
    print(f"Precision@3: {r['precision@k']}")
    print(f"Recall@3: {r['recall@k']}")
Enter fullscreen mode Exit fullscreen mode

Step 4 — Calculate MRR Across Your Test Suite

def calculate_mrr(collection, test_cases, k=5):
    """
    Mean Reciprocal Rank — how high is the first relevant document?
    """
    reciprocal_ranks = []

    for test in test_cases:
        result = collection.query(
            query_texts=[test["query"]],
            n_results=k
        )

        retrieved_ids = result["ids"][0]
        relevant_ids = test["relevant_doc_ids"]

        rr = 0
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_ids:
                rr = 1 / rank
                break  # only care about the FIRST relevant document

        reciprocal_ranks.append(rr)

    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return round(mrr, 4)


mrr_score = calculate_mrr(collection, test_cases)
print(f"\nMRR Score: {mrr_score}")
Enter fullscreen mode Exit fullscreen mode

Step 5 — Add Assertions and Make It a Real Test

Now let's wrap this in proper test assertions so it can run in a test suite or CI/CD pipeline:

import pytest

K = 3
MIN_PRECISION = 0.6   # at least 60% of retrieved docs must be relevant
MIN_RECALL = 0.7      # must retrieve at least 70% of relevant docs
MIN_MRR = 0.7         # first relevant doc should appear in top 2 on average


def test_retrieval_precision():
    results = run_retrieval_test(collection, test_cases, k=K)
    for r in results:
        assert r["precision@k"] >= MIN_PRECISION, (
            f"Precision too low for query: '{r['query']}'\n"
            f"Expected >= {MIN_PRECISION}, Got: {r['precision@k']}\n"
            f"Retrieved: {r['retrieved']}\n"
            f"Relevant: {r['relevant']}"
        )


def test_retrieval_recall():
    results = run_retrieval_test(collection, test_cases, k=K)
    for r in results:
        assert r["recall@k"] >= MIN_RECALL, (
            f"Recall too low for query: '{r['query']}'\n"
            f"Expected >= {MIN_RECALL}, Got: {r['recall@k']}\n"
            f"Retrieved: {r['retrieved']}\n"
            f"Relevant: {r['relevant']}"
        )


def test_retrieval_mrr():
    mrr = calculate_mrr(collection, test_cases)
    assert mrr >= MIN_MRR, (
        f"MRR too low. Expected >= {MIN_MRR}, Got: {mrr}\n"
        f"The most relevant documents are not ranking high enough."
    )
Enter fullscreen mode Exit fullscreen mode

Run with:

pytest test_retrieval.py -v
Enter fullscreen mode Exit fullscreen mode

📊 What Do Your Scores Actually Mean?

Once you run these tests, here's how to interpret what you're seeing:

Score Range What It Means What To Do
Precision@K > 0.8 Clean retrieval, low noise ✅ Good — maintain this
Precision@K 0.5–0.8 Some irrelevant docs slipping in ⚠️ Review chunking strategy
Precision@K < 0.5 Retriever is pulling mostly wrong docs 🔴 Embedding model or chunking needs work
Recall@K > 0.8 Capturing most relevant information ✅ Good coverage
Recall@K < 0.5 Missing critical documents 🔴 Increase K or fix embeddings
MRR > 0.8 Relevant docs surfacing near the top ✅ Good ranking
MRR < 0.5 Relevant docs buried in the list 🔴 Ranking needs improvement

🔴 Real Failure Scenarios You'll Actually Encounter

Let me walk you through the retrieval failures I've seen in real RAG systems — and what the metrics look like when they happen.

Failure 1 — Semantic Mismatch

What happens: Your documents use formal language. Your users use casual language. The embedding model struggles to connect them.

Document says: "Subscription termination procedures for enterprise accounts"
User asks:     "How do I cancel my plan?"

Result: Wrong document retrieved. Precision tanks. 📉
Enter fullscreen mode Exit fullscreen mode

Fix: Test with paraphrased versions of your queries. Add query rewriting to your pipeline.


Failure 2 — Chunk Size Problems

What happens: Documents are chunked too small, splitting related information across multiple chunks. Or chunked too large, drowning the relevant sentence in noise.

Policy document is 2000 words.
Chunked into 50-word pieces.
The answer spans chunks 4 and 5 — but only chunk 4 gets retrieved.

Result: Recall drops. Answer is incomplete. 📉
Enter fullscreen mode Exit fullscreen mode

Fix: Experiment with chunk sizes (256, 512, 1024 tokens) and measure recall impact with your test suite.


Failure 3 — Knowledge Base Staleness

What happens: The policy changed. The knowledge base wasn't updated. The old document still ranks highest.

Old doc: "Refunds available within 14 days"  ← retrieved ✅ (but wrong)
New doc: "Refunds available within 30 days"  ← not retrieved ❌

Result: Precision looks fine. But the answer is factually wrong. 🔴
Enter fullscreen mode Exit fullscreen mode

Fix: Add document versioning to your test cases. Assert that the most recently updated version of a document is retrieved for relevant queries. This is a regression test for your knowledge base.


🧩 Putting It All Together

Here's what a complete retrieval test run looks like end to end:

1. Define ground truth dataset (queries + relevant doc IDs)
2. Run retrieval for each query
3. Calculate Precision@K, Recall@K, MRR, NDCG
4. Assert scores meet minimum thresholds
5. Log failures with full context (what was retrieved vs. what should have been)
6. Run in CI/CD on every knowledge base update
Enter fullscreen mode Exit fullscreen mode

By the end of Part 5, step 6 will be fully automated. For now — getting the metrics right is the foundation. 🏗️


🔖 Key Takeaways From Part 2

Let me leave you with the things that actually matter 👇

  • Retrieval is the foundation — every other RAG test depends on this working correctly
  • Ground truth datasets are not optional — you need to define what "correct retrieval" looks like before you can measure it
  • Use multiple metrics — Precision tells you about noise, Recall tells you about completeness, MRR tells you about ranking
  • Thresholds are yours to set — there's no universal "passing score." Set thresholds based on your system's risk level and iterate
  • Retrieval tests should run on every knowledge base change — this is your regression safety net

🚀 What's Next

In Part 3, we go one layer deeper.

We've now verified that the right documents are being retrieved.

But retrieved context being present doesn't mean the LLM is using it.

What if the LLM ignores the context and makes something up anyway?

That's hallucination — and it's exactly what Part 3 is about.

We'll cover:

  • What faithfulness actually means in a RAG context
  • How to measure it programmatically
  • How to detect hallucinations automatically using RAGAS
  • And how to write tests that catch them before they reach production
Part 1 — What Is RAG & Why It Needs Different Testing       ✅ Done
Part 2 — Testing Retrieval Quality: Are You Fetching Right? ← You are here
Part 3 — Faithfulness & Hallucination Detection             ← Up next
Part 4 — Edge Cases: What Breaks RAG & How to Catch It
Part 5 — Building a RAG Test Framework from Scratch
Part 6 — Automating RAG Quality Checks in CI/CD
Enter fullscreen mode Exit fullscreen mode

Follow me so you don't miss Part 3 — it's where things get really interesting. Hallucination detection is one of the hardest problems in AI testing, and we're going to make it approachable. 🧠

Drop a comment below 👇

  • Have you tried measuring retrieval quality before — or was this new territory?
  • What chunk size are you using in your RAG system? Would love to compare notes.
  • Anything from Part 2 you want me to go deeper on before we move to Part 3?

All questions welcome. Let's learn this together. 🙌


Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn

Top comments (0)