Faizal

Posted on Jun 10

RAG-Based Testing Series — Part 2: Testing Retrieval Quality — Are You Fetching the Right Data?

#testing #ai #rag #python

RAG-Based Testing Series — Part 2: Testing Retrieval Quality — Are You Fetching the Right Data?

"A RAG system is only as good as what it retrieves. Get retrieval wrong — and everything that follows is built on sand."

In Part 1, we established the big picture.

We learned what RAG is, why traditional testing breaks down with AI systems, and what the major failure modes look like.

If you haven't read Part 1 yet — go read it first. This series builds on itself. 🔗

Now in Part 2 — we get our hands dirty. 🛠️

We're going to talk about retrieval quality — the first and most critical layer of any RAG system.

Because here's the thing most people miss 👇

🎯 Why Retrieval Is Everything

Think back to the RAG pipeline from Part 1:

User Query
    │
    ▼
[Embedding Model] — converts query to vector
    │
    ▼
[Vector Database] — finds most similar document chunks   ← WE ARE HERE
    │
    ▼
[Retrieved Context] — top N relevant chunks
    │
    ▼
[LLM Prompt] — question + context combined
    │
    ▼
[LLM] — generates final answer
    │
    ▼
Final Response to User

Notice where retrieval sits — right at the top of the chain.

If the retriever fetches the wrong documents, the LLM has no chance. It will either hallucinate, give an incomplete answer, or worse — give a confidently wrong answer based on irrelevant context.

Garbage in. Garbage out.

A brilliant LLM cannot save bad retrieval. 🗑️

This is why retrieval testing is not optional — it's the foundation of your entire RAG test strategy.

🧠 What Does "Good Retrieval" Even Mean?

Before we can test something, we need to define what success looks like.

For retrieval, success has two dimensions:

Precision — Of the documents I retrieved, how many were actually relevant?

Recall — Of all the relevant documents that exist, how many did I actually retrieve?

Let me make this concrete with an example. 👇

Imagine your knowledge base has 5 documents relevant to the question "What is the cancellation policy?"

Your retriever returns 4 documents.

3 of those 4 are actually about cancellation policy ✅
1 of those 4 is about shipping policy ❌
2 relevant documents were never retrieved at all ❌

So:

Precision = 3 relevant retrieved / 4 total retrieved = 75%
Recall = 3 relevant retrieved / 5 total relevant = 60%

Both matter. But they pull in different directions.

High precision means you're not polluting the LLM's context with junk.
High recall means you're not missing critical information the LLM needs.

The goal is to balance both. ⚖️

📐 The Retrieval Metrics You Need to Know

Now let's go deeper. In production RAG systems, we measure retrieval quality with four key metrics.

1. Precision@K

"Of the top K documents I retrieved — how many were relevant?"

Precision@K = (Relevant documents in top K) / K

Example:
You ask the retriever for top 5 documents (K=5).
3 of them are relevant.
Precision@5 = 3/5 = 0.6

This tells you how much noise is in your retrieved context. High noise = confused LLM.

2. Recall@K

"Of all relevant documents in the knowledge base — how many did I find in my top K?"

Recall@K = (Relevant documents in top K) / (Total relevant documents)

Example:
There are 8 relevant documents in total.
Your top 5 retrieved 4 of them.
Recall@5 = 4/8 = 0.5

This tells you how complete your retrieval is. Low recall = incomplete answers.

3. MRR — Mean Reciprocal Rank

"How high up in the list is the FIRST relevant document?"

RR = 1 / (rank of first relevant document)
MRR = average of RR across all queries

Example:

Query 1: First relevant doc appears at rank 1 → RR = 1/1 = 1.0
Query 2: First relevant doc appears at rank 3 → RR = 1/3 = 0.33
Query 3: First relevant doc appears at rank 2 → RR = 1/2 = 0.5

MRR = (1.0 + 0.33 + 0.5) / 3 = 0.61

Why does this matter? Because the LLM gives more weight to context that appears earlier in the prompt. If your most relevant document is buried at rank 5 — the LLM might not "see" it properly.

4. NDCG — Normalized Discounted Cumulative Gain

"Are the most relevant documents appearing at the top of the list?"

This is the most sophisticated metric. It doesn't just ask "did you retrieve the right documents?" — it asks "did you retrieve them in the right order?"

NDCG rewards systems that put the most relevant documents first and penalizes systems that bury important documents at the bottom of the list.

Score ranges from 0 to 1. Higher is better. An NDCG of 1.0 means perfect ranking.

This is the metric you'll see most often in research papers and production RAG evaluations.

Quick Reference Table

Metric	What It Measures	Best For
Precision@K	Relevance of retrieved docs	Noise control
Recall@K	Completeness of retrieval	Coverage
MRR	Position of first relevant doc	Single-answer queries
NDCG	Quality of ranked order	Multi-doc, complex queries

🛠️ Let's Write Actual Retrieval Tests

Enough theory. Let's build something.

For this series, we're using Python and RAGAS — the leading open-source framework for RAG evaluation.

If you don't have Python set up yet, here's all you need:

pip install ragas
pip install openai
pip install chromadb

Step 1 — Set Up a Simple Knowledge Base

First, we need something to retrieve from. Let's create a minimal knowledge base using ChromaDB (a lightweight vector database):

import chromadb
from chromadb.utils import embedding_functions

# Set up ChromaDB with OpenAI embeddings
client = chromadb.Client()
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-openai-api-key",
    model_name="text-embedding-ada-002"
)

collection = client.create_collection(
    name="support_docs",
    embedding_function=embedding_fn
)

# Add documents to the knowledge base
documents = [
    "Premium subscribers are eligible for a full refund within 30 days of purchase. Requests must be submitted via the support portal.",
    "Standard subscribers can cancel their subscription at any time. Cancellation takes effect at the end of the billing period.",
    "Shipping for all orders is processed within 2-3 business days. Express shipping is available at checkout.",
    "To reset your password, click the Forgot Password link on the login page. A reset link will be sent to your registered email.",
    "Premium subscribers get access to priority customer support with a 2-hour response time guarantee.",
]

collection.add(
    documents=documents,
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"]
)

print("Knowledge base created ✅")

Step 2 — Define Your Test Cases

This is the most important step. You need a ground truth dataset — a set of questions paired with the documents that should be retrieved.

This is what separates random testing from proper evaluation. 👇

# Ground truth: query → which doc IDs are actually relevant
test_cases = [
    {
        "query": "What is the refund policy for premium subscribers?",
        "relevant_doc_ids": ["doc1"]
    },
    {
        "query": "How do I cancel my subscription?",
        "relevant_doc_ids": ["doc2"]
    },
    {
        "query": "What support response time do premium members get?",
        "relevant_doc_ids": ["doc5", "doc1"]  # both are relevant
    },
    {
        "query": "How long does shipping take?",
        "relevant_doc_ids": ["doc3"]
    },
]

Note: Building this ground truth dataset is actual work — and it's work worth doing properly. The quality of your test suite is only as good as your ground truth. We'll talk more about how to build this at scale in Part 5.

Step 3 — Run Retrieval and Calculate Precision@K

def calculate_precision_at_k(retrieved_ids, relevant_ids, k):
    """
    Of the top K retrieved documents — how many were relevant?
    """
    retrieved_at_k = retrieved_ids[:k]
    relevant_retrieved = [doc for doc in retrieved_at_k if doc in relevant_ids]
    return len(relevant_retrieved) / k


def calculate_recall_at_k(retrieved_ids, relevant_ids, k):
    """
    Of all relevant documents — how many did we retrieve in top K?
    """
    retrieved_at_k = retrieved_ids[:k]
    relevant_retrieved = [doc for doc in retrieved_at_k if doc in relevant_ids]
    return len(relevant_retrieved) / len(relevant_ids)


def run_retrieval_test(collection, test_cases, k=3):
    results = []

    for test in test_cases:
        # Retrieve top K documents
        result = collection.query(
            query_texts=[test["query"]],
            n_results=k
        )

        retrieved_ids = result["ids"][0]  # list of retrieved doc IDs
        relevant_ids = test["relevant_doc_ids"]

        precision = calculate_precision_at_k(retrieved_ids, relevant_ids, k)
        recall = calculate_recall_at_k(retrieved_ids, relevant_ids, k)

        results.append({
            "query": test["query"],
            "retrieved": retrieved_ids,
            "relevant": relevant_ids,
            "precision@k": round(precision, 2),
            "recall@k": round(recall, 2),
        })

    return results


# Run the tests
results = run_retrieval_test(collection, test_cases, k=3)

for r in results:
    print(f"\nQuery: {r['query']}")
    print(f"Retrieved: {r['retrieved']}")
    print(f"Precision@3: {r['precision@k']}")
    print(f"Recall@3: {r['recall@k']}")

Step 4 — Calculate MRR Across Your Test Suite

def calculate_mrr(collection, test_cases, k=5):
    """
    Mean Reciprocal Rank — how high is the first relevant document?
    """
    reciprocal_ranks = []

    for test in test_cases:
        result = collection.query(
            query_texts=[test["query"]],
            n_results=k
        )

        retrieved_ids = result["ids"][0]
        relevant_ids = test["relevant_doc_ids"]

        rr = 0
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_ids:
                rr = 1 / rank
                break  # only care about the FIRST relevant document

        reciprocal_ranks.append(rr)

    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return round(mrr, 4)


mrr_score = calculate_mrr(collection, test_cases)
print(f"\nMRR Score: {mrr_score}")

Step 5 — Add Assertions and Make It a Real Test

Now let's wrap this in proper test assertions so it can run in a test suite or CI/CD pipeline:

import pytest

K = 3
MIN_PRECISION = 0.6   # at least 60% of retrieved docs must be relevant
MIN_RECALL = 0.7      # must retrieve at least 70% of relevant docs
MIN_MRR = 0.7         # first relevant doc should appear in top 2 on average


def test_retrieval_precision():
    results = run_retrieval_test(collection, test_cases, k=K)
    for r in results:
        assert r["precision@k"] >= MIN_PRECISION, (
            f"Precision too low for query: '{r['query']}'\n"
            f"Expected >= {MIN_PRECISION}, Got: {r['precision@k']}\n"
            f"Retrieved: {r['retrieved']}\n"
            f"Relevant: {r['relevant']}"
        )


def test_retrieval_recall():
    results = run_retrieval_test(collection, test_cases, k=K)
    for r in results:
        assert r["recall@k"] >= MIN_RECALL, (
            f"Recall too low for query: '{r['query']}'\n"
            f"Expected >= {MIN_RECALL}, Got: {r['recall@k']}\n"
            f"Retrieved: {r['retrieved']}\n"
            f"Relevant: {r['relevant']}"
        )


def test_retrieval_mrr():
    mrr = calculate_mrr(collection, test_cases)
    assert mrr >= MIN_MRR, (
        f"MRR too low. Expected >= {MIN_MRR}, Got: {mrr}\n"
        f"The most relevant documents are not ranking high enough."
    )

Run with:

pytest test_retrieval.py -v

📊 What Do Your Scores Actually Mean?

Once you run these tests, here's how to interpret what you're seeing:

Score Range	What It Means	What To Do
Precision@K > 0.8	Clean retrieval, low noise	✅ Good — maintain this
Precision@K 0.5–0.8	Some irrelevant docs slipping in	⚠️ Review chunking strategy
Precision@K < 0.5	Retriever is pulling mostly wrong docs	🔴 Embedding model or chunking needs work
Recall@K > 0.8	Capturing most relevant information	✅ Good coverage
Recall@K < 0.5	Missing critical documents	🔴 Increase K or fix embeddings
MRR > 0.8	Relevant docs surfacing near the top	✅ Good ranking
MRR < 0.5	Relevant docs buried in the list	🔴 Ranking needs improvement

🔴 Real Failure Scenarios You'll Actually Encounter

Let me walk you through the retrieval failures I've seen in real RAG systems — and what the metrics look like when they happen.

Failure 1 — Semantic Mismatch

What happens: Your documents use formal language. Your users use casual language. The embedding model struggles to connect them.

Document says: "Subscription termination procedures for enterprise accounts"
User asks:     "How do I cancel my plan?"

Result: Wrong document retrieved. Precision tanks. 📉

Fix: Test with paraphrased versions of your queries. Add query rewriting to your pipeline.

Failure 2 — Chunk Size Problems

What happens: Documents are chunked too small, splitting related information across multiple chunks. Or chunked too large, drowning the relevant sentence in noise.

Policy document is 2000 words.
Chunked into 50-word pieces.
The answer spans chunks 4 and 5 — but only chunk 4 gets retrieved.

Result: Recall drops. Answer is incomplete. 📉

Fix: Experiment with chunk sizes (256, 512, 1024 tokens) and measure recall impact with your test suite.

Failure 3 — Knowledge Base Staleness

What happens: The policy changed. The knowledge base wasn't updated. The old document still ranks highest.

Old doc: "Refunds available within 14 days"  ← retrieved ✅ (but wrong)
New doc: "Refunds available within 30 days"  ← not retrieved ❌

Result: Precision looks fine. But the answer is factually wrong. 🔴

Fix: Add document versioning to your test cases. Assert that the most recently updated version of a document is retrieved for relevant queries. This is a regression test for your knowledge base.

🧩 Putting It All Together

Here's what a complete retrieval test run looks like end to end:

1. Define ground truth dataset (queries + relevant doc IDs)
2. Run retrieval for each query
3. Calculate Precision@K, Recall@K, MRR, NDCG
4. Assert scores meet minimum thresholds
5. Log failures with full context (what was retrieved vs. what should have been)
6. Run in CI/CD on every knowledge base update

By the end of Part 5, step 6 will be fully automated. For now — getting the metrics right is the foundation. 🏗️

🔖 Key Takeaways From Part 2

Let me leave you with the things that actually matter 👇

Retrieval is the foundation — every other RAG test depends on this working correctly
Ground truth datasets are not optional — you need to define what "correct retrieval" looks like before you can measure it
Use multiple metrics — Precision tells you about noise, Recall tells you about completeness, MRR tells you about ranking
Thresholds are yours to set — there's no universal "passing score." Set thresholds based on your system's risk level and iterate
Retrieval tests should run on every knowledge base change — this is your regression safety net

🚀 What's Next

In Part 3, we go one layer deeper.

We've now verified that the right documents are being retrieved.

But retrieved context being present doesn't mean the LLM is using it.

What if the LLM ignores the context and makes something up anyway?

That's hallucination — and it's exactly what Part 3 is about.

We'll cover:

What faithfulness actually means in a RAG context
How to measure it programmatically
How to detect hallucinations automatically using RAGAS
And how to write tests that catch them before they reach production

Part 1 — What Is RAG & Why It Needs Different Testing       ✅ Done
Part 2 — Testing Retrieval Quality: Are You Fetching Right? ← You are here
Part 3 — Faithfulness & Hallucination Detection             ← Up next
Part 4 — Edge Cases: What Breaks RAG & How to Catch It
Part 5 — Building a RAG Test Framework from Scratch
Part 6 — Automating RAG Quality Checks in CI/CD

Follow me so you don't miss Part 3 — it's where things get really interesting. Hallucination detection is one of the hardest problems in AI testing, and we're going to make it approachable. 🧠

Drop a comment below 👇

Have you tried measuring retrieval quality before — or was this new territory?
What chunk size are you using in your RAG system? Would love to compare notes.
Anything from Part 2 you want me to go deeper on before we move to Part 3?

All questions welcome. Let's learn this together. 🙌

Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn

Top comments (6)

Gursharan Singh • Jun 16

Solid walkthrough. The Precision/Recall/MRR explanation is a good foundation.

One thing that may be worth making a little more explicit for people taking this beyond the demo is chunk provenance. In the examples, each document is basically one chunk with a clean ID, so the connection back to the source document is easy to follow. But with real documents, one source can turn into many chunks. At that point, each chunk needs metadata like parent document ID, and ideally section or version too.

That metadata also helps with two of the failure modes you mentioned: staleness is easier to catch when version is stored, and partial-retrieval debugging gets much easier when you can see which source document the retrieved chunks came from.

Faizal • Jun 16

You're right on both counts. The examples use clean single-chunk docs deliberately to keep the concept clear, but in production that mapping breaks down fast. One PDF becomes 40 chunks, and without parent document ID + version in the metadata, your retrieval test results become almost undebuggable.
The staleness point especially — version in metadata is the only reliable way to assert that the newer document is ranking above the older one. I actually cover a basic version of that in Part 4 (conflicting context test), but the metadata schema deserves its own dedicated section.
Thank you so much for valuable feedback, do check my whole series.

Gursharan Singh • Jun 16

That makes sense. Thanks for clarifying. I’ll check Part 4 too.

Alex Shev • Jun 11

Retrieval quality is the part teams usually under-test. The answer may look fluent even when the wrong evidence was pulled.

A good workflow should make retrieval checks boring and repeatable: run a fixture set, inspect misses, compare sources, and fail loudly before the agent writes a confident answer on weak context.

Faizal • Jun 11

Fail loudly before the agent writes a confident answer on weak context' — that's literally the thesis of this entire series. Well said.
The fluent-but-wrong answer is the most dangerous failure mode because it's invisible to the end user. Boring and repeatable retrieval checks are exactly the goal — fixture set, compare sources, hard thresholds, CI gate. Building that out in Parts 5 and 6.
Curious — have you run into teams that skipped retrieval checks and only caught it after the LLM started confidently hallucinating in prod?

Alex Shev • Jun 11

Yes, I have seen that pattern. Teams often test the final answer style first because it is easy to review, then discover later that retrieval was quietly pulling weak or stale context.

By then the model has already learned to sound confident around bad inputs. The fix is boring: fixtures, expected source sets, thresholds, and a CI failure before the answer layer even gets to be persuasive.