Faizal

Posted on Jun 12

RAG-Based Testing Series — Part 5: Building a RAG Test Framework from Scratch

#ai #python #testing #rag

RAG-Based Testing Series — Part 5: Building a RAG Test Framework from Scratch

"Individual tests tell you what broke. A framework tells you the health of the whole system."

We've come a long way in this series.

In Part 2, we measured retrieval quality — Precision@K, Recall@K, MRR.

In Part 3, we detected hallucinations — faithfulness scoring with RAGAS.

In Part 4, we stress-tested edge cases — empty retrieval, conflicting context, adversarial queries.

Every single one of those was written as an individual, isolated test.

That was intentional. Learning each concept in isolation makes it easier to understand.

But in a real project, isolated tests are a problem. 🔴

You end up with:

Tests scattered across multiple files with no shared structure
Repeated setup code everywhere (embedding model, DB connection, LLM client)
No unified way to run everything and see the full picture
No scoring — just green/red with no sense of how good the system actually is
Nothing configurable — every change requires hunting through multiple files

This is what Part 5 fixes.

We're taking everything we've built and assembling it into a proper, structured, reusable RAG test framework. 🏗️

🗺️ What We're Building

By the end of this article, you'll have a framework with this structure:

rag_test_framework/
├── config/
│   └── settings.py          ← all configuration in one place
├── core/
│   ├── retriever.py         ← retrieval logic + scoring
│   ├── evaluator.py         ← RAGAS evaluation wrapper
│   └── rag_pipeline.py      ← end-to-end RAG call
├── tests/
│   ├── conftest.py          ← shared pytest fixtures
│   ├── test_retrieval.py    ← retrieval quality tests
│   ├── test_faithfulness.py ← faithfulness + hallucination tests
│   └── test_edge_cases.py   ← edge case tests
├── data/
│   └── test_cases.json      ← your ground truth dataset
├── reports/
│   └── (auto-generated)     ← test run reports saved here
├── run_tests.py             ← single entry point to run everything
└── requirements.txt

One command to run the entire suite:

python run_tests.py

Or through pytest for CI/CD integration (Part 6):

pytest tests/ -v --tb=short

Let's build it. 🛠️

📦 Step 1 — Install Dependencies

pip install ragas
pip install openai
pip install chromadb
pip install datasets
pip install langchain-openai
pip install pytest
pip install pytest-json-report

Save this as requirements.txt:

ragas>=0.1.0
openai>=1.0.0
chromadb>=0.4.0
datasets>=2.0.0
langchain-openai>=0.0.1
pytest>=7.0.0
pytest-json-report>=1.5.0

⚙️ Step 2 — Configuration in One Place

The single biggest improvement you can make over scattered one-off tests is centralising all configuration.

# config/settings.py

import os

# ── API Keys ──────────────────────────────────────────────
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")

# ── Models ────────────────────────────────────────────────
EMBEDDING_MODEL = "text-embedding-ada-002"
EVALUATION_LLM  = "gpt-4o-mini"
RAG_LLM         = "gpt-4o-mini"

# ── Retrieval Settings ────────────────────────────────────
DEFAULT_TOP_K             = 3          # how many docs to retrieve
SIMILARITY_THRESHOLD      = 0.8        # max L2 distance to consider relevant
                                        # tune this for your embedding model

# ── Quality Thresholds ────────────────────────────────────
# These are the minimum acceptable scores for your system.
# Adjust based on your risk tolerance.
MIN_PRECISION_AT_K        = 0.6        # >= 60% of retrieved docs must be relevant
MIN_RECALL_AT_K           = 0.7        # >= 70% of relevant docs must be retrieved
MIN_MRR                   = 0.7        # first relevant doc should rank in top 2 on average
MIN_FAITHFULNESS          = 0.8        # >= 80% of answer claims must be grounded
CRITICAL_FAITHFULNESS     = 0.3        # below this = outright fabrication, always fail

# ── Reporting ─────────────────────────────────────────────
REPORTS_DIR               = "reports"
REPORT_FILENAME           = "rag_test_report.json"

Every number here is a starting point, not a gospel. Run your framework on your actual system, look at the score distributions, then set thresholds that match your system's risk level. High-stakes domains (medical, legal, financial) should have tighter thresholds. Internal tooling can be more lenient.

🔌 Step 3 — Core Modules

3a — Retriever

# core/retriever.py

import chromadb
from chromadb.utils import embedding_functions
from config.settings import (
    OPENAI_API_KEY,
    EMBEDDING_MODEL,
    DEFAULT_TOP_K,
    SIMILARITY_THRESHOLD
)


def build_collection(collection_name: str, documents: list[str], doc_ids: list[str]):
    """
    Build a ChromaDB collection from a list of documents.
    Call this once during test setup.
    """
    client = chromadb.Client()
    embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
        api_key=OPENAI_API_KEY,
        model_name=EMBEDDING_MODEL
    )

    # Delete collection if it already exists (clean slate for each test run)
    try:
        client.delete_collection(collection_name)
    except Exception:
        pass

    collection = client.create_collection(
        name=collection_name,
        embedding_function=embedding_fn
    )

    collection.add(documents=documents, ids=doc_ids)
    return collection


def retrieve(collection, query: str, n_results: int = DEFAULT_TOP_K) -> dict:
    """
    Retrieve top N documents for a query.
    Returns documents, IDs, and distance scores.
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "distances", "ids"]
    )
    return {
        "documents": results["documents"][0],
        "ids":       results["ids"][0],
        "distances": results["distances"][0]
    }


def filter_by_threshold(retrieval_result: dict, threshold: float = SIMILARITY_THRESHOLD) -> dict:
    """
    Filter out documents whose distance exceeds the similarity threshold.
    Returns only the documents considered relevant.
    """
    filtered = [
        (doc, doc_id, dist)
        for doc, doc_id, dist in zip(
            retrieval_result["documents"],
            retrieval_result["ids"],
            retrieval_result["distances"]
        )
        if dist < threshold
    ]

    if not filtered:
        return {"documents": [], "ids": [], "distances": []}

    docs, ids, dists = zip(*filtered)
    return {"documents": list(docs), "ids": list(ids), "distances": list(dists)}


def calculate_precision_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
    retrieved_at_k = retrieved_ids[:k]
    hits = [doc_id for doc_id in retrieved_at_k if doc_id in relevant_ids]
    return len(hits) / k if k > 0 else 0.0


def calculate_recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
    retrieved_at_k = retrieved_ids[:k]
    hits = [doc_id for doc_id in retrieved_at_k if doc_id in relevant_ids]
    return len(hits) / len(relevant_ids) if relevant_ids else 0.0


def calculate_mrr(collection, test_cases: list, k: int = 5) -> float:
    reciprocal_ranks = []
    for test in test_cases:
        result = retrieve(collection, test["query"], n_results=k)
        rr = 0.0
        for rank, doc_id in enumerate(result["ids"], start=1):
            if doc_id in test["relevant_doc_ids"]:
                rr = 1.0 / rank
                break
        reciprocal_ranks.append(rr)
    return round(sum(reciprocal_ranks) / len(reciprocal_ranks), 4) if reciprocal_ranks else 0.0

3b — Evaluator

# core/evaluator.py

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from config.settings import OPENAI_API_KEY, EVALUATION_LLM


def build_evaluator():
    """
    Build and return the LLM + embeddings used by RAGAS.
    Call once and reuse — avoids re-initialising on every test.
    """
    llm = ChatOpenAI(
        model=EVALUATION_LLM,
        openai_api_key=OPENAI_API_KEY
    )
    embeddings = OpenAIEmbeddings(
        openai_api_key=OPENAI_API_KEY
    )
    return llm, embeddings


def evaluate_faithfulness(test_data: list, llm, embeddings) -> list[dict]:
    """
    Run RAGAS faithfulness + answer_relevancy evaluation on a list of test cases.

    Each test case must have:
        - question  (str)
        - answer    (str)
        - contexts  (list of str)

    Returns a list of dicts with per-case scores added.
    """
    dataset = Dataset.from_list(test_data)

    results = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy],
        llm=llm,
        embeddings=embeddings
    )

    df = results.to_pandas()

    scored = []
    for i, row in df.iterrows():
        scored.append({
            **test_data[i],
            "faithfulness":     round(float(row["faithfulness"]), 4),
            "answer_relevancy": round(float(row["answer_relevancy"]), 4)
        })

    return scored

3c — RAG Pipeline

# core/rag_pipeline.py

from openai import OpenAI
from core.retriever import retrieve, filter_by_threshold
from config.settings import OPENAI_API_KEY, RAG_LLM, DEFAULT_TOP_K

_openai_client = OpenAI(api_key=OPENAI_API_KEY)

SYSTEM_PROMPT = """You are a helpful customer support assistant.
Answer questions using ONLY the information provided in the context below.
If the context does not contain enough information to answer the question,
say so clearly — do not make up information.
If the question is outside the scope of the provided context, say so politely."""


def run_rag(collection, question: str, n_results: int = DEFAULT_TOP_K) -> dict:
    """
    Run a full RAG call: retrieve context, then generate an answer.

    Returns:
        question   — the original question
        contexts   — list of retrieved document strings (after threshold filter)
        answer     — the LLM's generated answer
        raw_distances — distances for all retrieved docs (for debugging)
    """
    raw_result = retrieve(collection, question, n_results=n_results)
    filtered   = filter_by_threshold(raw_result)

    contexts = filtered["documents"]

    if not contexts:
        # No relevant documents found — return a structured "I don't know"
        return {
            "question":      question,
            "contexts":      [],
            "answer":        "I don't have relevant information to answer this question.",
            "raw_distances": raw_result["distances"]
        }

    context_block = "\n\n".join(contexts)
    prompt = f"{SYSTEM_PROMPT}\n\nContext:\n{context_block}\n\nQuestion: {question}\n\nAnswer:"

    response = _openai_client.chat.completions.create(
        model=RAG_LLM,
        messages=[{"role": "user", "content": prompt}]
    )

    return {
        "question":      question,
        "contexts":      contexts,
        "answer":        response.choices[0].message.content,
        "raw_distances": raw_result["distances"]
    }

📋 Step 4 — Ground Truth Dataset

Store your test cases in one place — not hardcoded across test files.

// data/test_cases.json
{
  "knowledge_base": [
    {
      "id":   "doc1",
      "text": "Premium subscribers are eligible for a full refund within 30 days of purchase. Requests must be submitted via the support portal."
    },
    {
      "id":   "doc2",
      "text": "Standard subscribers can cancel their subscription at any time. Cancellation takes effect at the end of the billing period."
    },
    {
      "id":   "doc3",
      "text": "Shipping for all orders is processed within 2-3 business days. Express shipping is available at checkout."
    },
    {
      "id":   "doc4",
      "text": "To reset your password, click the Forgot Password link on the login page. A reset link will be sent to your registered email."
    },
    {
      "id":   "doc5",
      "text": "Premium subscribers get access to priority customer support with a 2-hour response time guarantee."
    }
  ],

  "retrieval_test_cases": [
    {
      "query":            "What is the refund policy for premium subscribers?",
      "relevant_doc_ids": ["doc1"]
    },
    {
      "query":            "How do I cancel my subscription?",
      "relevant_doc_ids": ["doc2"]
    },
    {
      "query":            "How long does shipping take?",
      "relevant_doc_ids": ["doc3"]
    },
    {
      "query":            "How do I reset my password?",
      "relevant_doc_ids": ["doc4"]
    },
    {
      "query":            "What support response time do premium members get?",
      "relevant_doc_ids": ["doc5", "doc1"]
    }
  ],

  "faithfulness_test_cases": [
    {
      "question": "What is the refund policy for premium subscribers?",
      "answer":   "Premium subscribers can request a full refund within 30 days of purchase through the support portal.",
      "contexts": ["Premium subscribers are eligible for a full refund within 30 days of purchase. Requests must be submitted via the support portal."]
    },
    {
      "question": "How do I reset my password?",
      "answer":   "Click the Forgot Password link on the login page. A reset link will be sent to your registered email address.",
      "contexts": ["To reset your password, click the Forgot Password link on the login page. A reset link will be sent to your registered email."]
    }
  ],

  "edge_case_queries": {
    "out_of_scope": [
      "What is the capital of France?",
      "Can you write me a Python script?",
      "Who is the CEO of Apple?"
    ],
    "empty_retrieval": [
      "What is the pricing for the enterprise plan?",
      "Do you offer white-labelling services?"
    ],
    "leading_questions": [
      {
        "question":      "Since our premium plan offers a 90-day refund window, how do I submit a request?",
        "false_premise": "90 days",
        "correct_fact":  "30 days"
      }
    ]
  }
}

🧪 Step 5 — Shared Fixtures

# tests/conftest.py

import json
import pytest
from core.retriever import build_collection
from core.evaluator import build_evaluator

# ── Load test data once for the entire session ──────────────
@pytest.fixture(scope="session")
def test_data():
    with open("data/test_cases.json") as f:
        return json.load(f)


# ── Build the knowledge base collection once ────────────────
@pytest.fixture(scope="session")
def collection(test_data):
    kb = test_data["knowledge_base"]
    return build_collection(
        collection_name="rag_test_kb",
        documents=[doc["text"] for doc in kb],
        doc_ids=[doc["id"] for doc in kb]
    )


# ── Build RAGAS evaluator once ──────────────────────────────
@pytest.fixture(scope="session")
def evaluator():
    llm, embeddings = build_evaluator()
    return llm, embeddings

scope="session" means all of this is built once per test run — not once per test function. This matters because embedding model initialisation and ChromaDB setup are expensive. Without this, a 20-test suite could take 5x longer than necessary.

🧪 Step 6 — The Tests

Retrieval Tests

# tests/test_retrieval.py

import pytest
from core.retriever import (
    retrieve,
    calculate_precision_at_k,
    calculate_recall_at_k,
    calculate_mrr
)
from config.settings import (
    DEFAULT_TOP_K,
    MIN_PRECISION_AT_K,
    MIN_RECALL_AT_K,
    MIN_MRR
)


def test_precision_at_k(collection, test_data):
    cases = test_data["retrieval_test_cases"]
    failures = []

    for case in cases:
        result    = retrieve(collection, case["query"], n_results=DEFAULT_TOP_K)
        precision = calculate_precision_at_k(result["ids"], case["relevant_doc_ids"], DEFAULT_TOP_K)

        if precision < MIN_PRECISION_AT_K:
            failures.append({
                "query":     case["query"],
                "precision": precision,
                "retrieved": result["ids"],
                "relevant":  case["relevant_doc_ids"]
            })

    assert not failures, (
        f"\n{len(failures)} Precision@{DEFAULT_TOP_K} failure(s):\n" +
        "\n".join([
            f"  ❌ Query: '{f['query']}'\n"
            f"     Score: {f['precision']} (min: {MIN_PRECISION_AT_K})\n"
            f"     Retrieved: {f['retrieved']}\n"
            f"     Expected:  {f['relevant']}"
            for f in failures
        ])
    )


def test_recall_at_k(collection, test_data):
    cases = test_data["retrieval_test_cases"]
    failures = []

    for case in cases:
        result = retrieve(collection, case["query"], n_results=DEFAULT_TOP_K)
        recall = calculate_recall_at_k(result["ids"], case["relevant_doc_ids"], DEFAULT_TOP_K)

        if recall < MIN_RECALL_AT_K:
            failures.append({
                "query":  case["query"],
                "recall": recall
            })

    assert not failures, (
        f"\n{len(failures)} Recall@{DEFAULT_TOP_K} failure(s):\n" +
        "\n".join([
            f"  ❌ Query: '{f['query']}'\n"
            f"     Score: {f['recall']} (min: {MIN_RECALL_AT_K})"
            for f in failures
        ])
    )


def test_mrr(collection, test_data):
    cases = test_data["retrieval_test_cases"]
    mrr   = calculate_mrr(collection, cases)

    assert mrr >= MIN_MRR, (
        f"\nMRR too low: {mrr} (min: {MIN_MRR})\n"
        f"Relevant documents are not ranking high enough in retrieval results."
    )

Faithfulness Tests

# tests/test_faithfulness.py

import pytest
from core.evaluator import evaluate_faithfulness
from config.settings import MIN_FAITHFULNESS, CRITICAL_FAITHFULNESS


def test_faithfulness_above_threshold(evaluator, test_data):
    llm, embeddings = evaluator
    cases   = test_data["faithfulness_test_cases"]
    scored  = evaluate_faithfulness(cases, llm, embeddings)
    failures = [s for s in scored if s["faithfulness"] < MIN_FAITHFULNESS]

    assert not failures, (
        f"\n{len(failures)} faithfulness failure(s):\n" +
        "\n".join([
            f"  ❌ Question: '{f['question']}'\n"
            f"     Score: {f['faithfulness']} (min: {MIN_FAITHFULNESS})\n"
            f"     Answer: {f['answer'][:100]}..."
            for f in failures
        ])
    )


def test_no_critical_hallucinations(evaluator, test_data):
    """
    Any answer below CRITICAL_FAITHFULNESS is an outright fabrication.
    This test should never be allowed to pass in any environment.
    """
    llm, embeddings = evaluator
    cases   = test_data["faithfulness_test_cases"]
    scored  = evaluate_faithfulness(cases, llm, embeddings)
    critical = [s for s in scored if s["faithfulness"] < CRITICAL_FAITHFULNESS]

    assert not critical, (
        f"\n🚨 {len(critical)} CRITICAL hallucination(s) detected:\n" +
        "\n".join([
            f"  🔴 Question: '{f['question']}'\n"
            f"     Score: {f['faithfulness']} (critical threshold: {CRITICAL_FAITHFULNESS})\n"
            f"     Answer: {f['answer'][:100]}..."
            for f in critical
        ])
    )

Edge Case Tests

# tests/test_edge_cases.py

import pytest
from core.retriever import retrieve, filter_by_threshold
from core.rag_pipeline import run_rag
from config.settings import SIMILARITY_THRESHOLD


def test_empty_retrieval_detected(collection, test_data):
    """
    Queries with no relevant documents should return zero docs
    after threshold filtering.
    """
    out_of_kb_queries = test_data["edge_case_queries"]["empty_retrieval"]

    for query in out_of_kb_queries:
        raw      = retrieve(collection, query)
        filtered = filter_by_threshold(raw, threshold=SIMILARITY_THRESHOLD)

        assert len(filtered["documents"]) == 0, (
            f"\n❌ Expected no relevant docs for: '{query}'\n"
            f"   Got {len(filtered['documents'])} doc(s) below threshold.\n"
            f"   Distances: {raw['distances']}"
        )


def test_rag_response_on_empty_retrieval(collection, test_data):
    """
    When no relevant documents exist, the pipeline should return
    an uncertainty response — not a fabricated answer.
    """
    uncertainty_indicators = [
        "don't have", "no information", "unable to find",
        "not available", "contact", "cannot find"
    ]

    out_of_kb_queries = test_data["edge_case_queries"]["empty_retrieval"]

    for query in out_of_kb_queries:
        result       = run_rag(collection, query)
        answer_lower = result["answer"].lower()
        is_uncertain = any(ind in answer_lower for ind in uncertainty_indicators)

        assert is_uncertain, (
            f"\n❌ RAG pipeline did not express uncertainty for: '{query}'\n"
            f"   Answer: {result['answer']}\n"
            f"   Expected one of: {uncertainty_indicators}"
        )


def test_out_of_scope_handling(collection, test_data):
    """
    Out-of-scope queries should produce a decline or redirect — not a fabricated answer.
    """
    decline_indicators = [
        "outside", "scope", "unable to help", "don't have information",
        "not able to assist", "please contact", "beyond", "not within"
    ]

    for query in test_data["edge_case_queries"]["out_of_scope"]:
        result       = run_rag(collection, query)
        answer_lower = result["answer"].lower()
        declined     = any(ind in answer_lower for ind in decline_indicators)

        assert declined, (
            f"\n❌ System did not decline out-of-scope query: '{query}'\n"
            f"   Answer: {result['answer']}"
        )


def test_leading_question_correction(collection, evaluator, test_data):
    """
    When a query contains a false premise, the generated answer should
    contradict the false premise — not accept it.
    """
    from core.evaluator import evaluate_faithfulness
    from config.settings import MIN_FAITHFULNESS

    llm, embeddings = evaluator
    leading_cases   = test_data["edge_case_queries"]["leading_questions"]

    for case in leading_cases:
        result = run_rag(collection, case["question"])

        # The answer should NOT contain the false premise
        assert case["false_premise"] not in result["answer"], (
            f"\n❌ LLM accepted false premise in leading question.\n"
            f"   Question:      {case['question']}\n"
            f"   False premise: {case['false_premise']}\n"
            f"   Answer:        {result['answer']}"
        )

        # The answer SHOULD contain the correct fact
        assert case["correct_fact"] in result["answer"], (
            f"\n❌ LLM did not correct the false premise.\n"
            f"   Question:     {case['question']}\n"
            f"   Correct fact: {case['correct_fact']}\n"
            f"   Answer:       {result['answer']}"
        )

📊 Step 7 — Unified Runner with Reporting

# run_tests.py

import os
import json
import subprocess
from datetime import datetime
from config.settings import REPORTS_DIR, REPORT_FILENAME

def run():
    os.makedirs(REPORTS_DIR, exist_ok=True)

    timestamp   = datetime.now().strftime("%Y%m%d_%H%M%S")
    report_path = os.path.join(REPORTS_DIR, f"{timestamp}_{REPORT_FILENAME}")

    print("=" * 65)
    print("  RAG Test Framework — Full Suite")
    print(f"  Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 65)

    result = subprocess.run(
        [
            "pytest", "tests/",
            "-v",
            "--tb=short",
            f"--json-report",
            f"--json-report-file={report_path}"
        ],
        capture_output=False  # stream output to terminal in real time
    )

    print("\n" + "=" * 65)

    # Parse and print summary from the JSON report
    if os.path.exists(report_path):
        with open(report_path) as f:
            report = json.load(f)

        summary  = report.get("summary", {})
        passed   = summary.get("passed", 0)
        failed   = summary.get("failed", 0)
        total    = summary.get("total", 0)
        duration = round(report.get("duration", 0), 2)

        print(f"  Results : {passed}/{total} passed | {failed} failed")
        print(f"  Duration: {duration}s")
        print(f"  Report  : {report_path}")

        if failed > 0:
            print("\n  ❌ Failed tests:")
            for test in report.get("tests", []):
                if test["outcome"] == "failed":
                    print(f"     • {test['nodeid']}")

        print("=" * 65)

    return result.returncode


if __name__ == "__main__":
    exit(run())

▶️ Running the Framework

# Run the full suite
python run_tests.py

# Run only retrieval tests
pytest tests/test_retrieval.py -v

# Run only faithfulness tests
pytest tests/test_faithfulness.py -v

# Run only edge case tests
pytest tests/test_edge_cases.py -v

# Run with full failure output
pytest tests/ -v --tb=long

Sample output:

=================================================================
  RAG Test Framework — Full Suite
  Started: 2025-06-01 14:32:01
=================================================================

tests/test_retrieval.py::test_precision_at_k    PASSED
tests/test_retrieval.py::test_recall_at_k       PASSED
tests/test_retrieval.py::test_mrr               PASSED
tests/test_faithfulness.py::test_faithfulness_above_threshold  PASSED
tests/test_faithfulness.py::test_no_critical_hallucinations    PASSED
tests/test_edge_cases.py::test_empty_retrieval_detected        PASSED
tests/test_edge_cases.py::test_rag_response_on_empty_retrieval PASSED
tests/test_edge_cases.py::test_out_of_scope_handling           PASSED
tests/test_edge_cases.py::test_leading_question_correction     PASSED

=================================================================
  Results : 9/9 passed | 0 failed
  Duration: 42.3s
  Report  : reports/20250601_143201_rag_test_report.json
=================================================================

🔌 Using This Framework on a Different RAG System

The whole point of a framework is reusability. Here's what you change to point it at a new system:

1. Update data/test_cases.json
Replace the knowledge base documents and ground truth test cases with your system's actual content.

2. Update config/settings.py
Adjust thresholds based on your new system's acceptable quality levels.

3. Update core/rag_pipeline.py
If your RAG system uses a different vector database (Pinecone, Weaviate, pgvector) or a different LLM — swap out those implementations in rag_pipeline.py and retriever.py. The tests themselves don't change.

4. Run

python run_tests.py

That's the power of the architecture. Tests are separated from infrastructure. Changing the underlying system doesn't mean rewriting your tests. 🎯

🧩 The Complete Testing Stack — All Five Layers

Layer 1 — RETRIEVAL QUALITY ✅
          tests/test_retrieval.py
          → Precision@K, Recall@K, MRR

Layer 2 — FAITHFULNESS & HALLUCINATION ✅
          tests/test_faithfulness.py
          → Faithfulness score, Answer Relevancy score

Layer 3 — EDGE CASES ✅
          tests/test_edge_cases.py
          → Empty retrieval, out-of-scope, leading questions

Layer 4 — FULL FRAMEWORK ✅  ← You are here
          All layers combined, one command to run

Layer 5 — CI/CD AUTOMATION ← Up next (Part 6)
          Running automatically on every knowledge base change

🔖 Key Takeaways From Part 5

Configuration in one place — settings.py is your single source of truth for thresholds, model names, and API keys
Shared fixtures with scope="session" — build your KB and evaluator once per run, not once per test
Ground truth in JSON — decouples your test data from your test logic; update cases without touching code
Modular structure — retriever, evaluator, and pipeline are separate; swap implementations without rewriting tests
One entry point — run_tests.py gives you a clean interface for humans and CI/CD alike
Reports are first-class — every run produces a timestamped JSON report; your team can track quality over time

🚀 What's Next

In Part 6 — the final part — we automate everything.

Right now, you run the framework manually. That's useful. But it's not enough.

Every time someone updates the knowledge base, changes the system prompt, or swaps the LLM — the quality of your RAG system could silently change.

Part 6 wires this framework into a GitHub Actions CI/CD pipeline so tests run automatically on every relevant change. Quality regressions get caught before they reach users. The team gets notified. The deployment is blocked if tests fail.

Part 1 — What Is RAG & Why It Needs Different Testing       ✅ Done
Part 2 — Testing Retrieval Quality: Are You Fetching Right? ✅ Done
Part 3 — Faithfulness & Hallucination Detection             ✅ Done
Part 4 — Edge Cases: What Breaks RAG & How to Catch It      ✅ Done
Part 5 — Building a RAG Test Framework from Scratch         ← You are here
Part 6 — Automating RAG Quality Checks in CI/CD             ← Up next

Follow me so you don't miss Part 6 — the final piece that turns this framework into a fully automated quality gate. 🚀

Drop a comment below 👇

Have you tried running this framework yet? What did your scores look like?
Are you using a different vector DB (Pinecone, Weaviate, pgvector)? Drop a comment — I'll cover alternative adapters in a bonus post.
What would you add to this framework for your specific use case?

All questions welcome. Let's learn this together. 🙌

Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn

Top comments (2)

Max Quimby • Jun 16

Pulling P@K/Recall@K/MRR, RAGAS faithfulness, and edge cases into one suite is the right move — but the piece I'd flag from running these in anger is test_cases.json. The ground-truth dataset rots faster than the framework code. Your corpus changes, a doc gets re-chunked, and suddenly your "correct" retrieval target points at a chunk that no longer exists. Curious whether you version the test set alongside the index, or have any drift check that flags when ground truth and corpus diverge.

The other thing I've learned the hard way: gate CI on regression vs a baseline, not absolute thresholds. "Faithfulness must be > 0.8" is brittle — it's either so loose it never catches anything or so tight it blocks every PR on noise. "Faithfulness dropped 5 points vs main" catches real regressions without the flakiness.

Looking forward to Part 6 on the CI/CD wiring — that's usually where the per-PR latency of running RAGAS becomes the real constraint. How long does a full suite run take you?

Faizal • Jun 16

Both of these are hard-won lessons and you've articulated them better than I did in the article.
On ground truth rot — I haven't versioned the test set alongside the index yet, but you've just convinced me that's the right move. The chunk re-ID problem is real and sneaky. A doc gets re-chunked after an update, the old chunk ID disappears, and your ground truth is now pointing at nothing — but the test still runs, just silently wrong. A drift check that validates every relevant_doc_id in test_cases.json actually exists in the current index before the suite runs would catch this early. Adding that to the framework as a pre-flight check.
On regression vs absolute thresholds — this is the better mental model and honestly should have been in Part 6. Absolute thresholds made sense for teaching the concept but you're right that in practice they're either too loose or too tight depending on the day. Baseline comparison — 'faithfulness dropped X points vs main' — is how you'd actually gate a PR without constant noise. I'm going to write this up properly.
Full suite currently takes 40-60 seconds on the demo dataset. The real constraint as you said is RAGAS evaluation latency — each LLM-as-judge call adds up fast at scale. The CI subset strategy in Part 6 helps but it's a band-aid. Async batch evaluation is the proper fix — worth a dedicated post.
Genuinely useful thread — thank you."