Faizal

Posted on Jun 12

RAG-Based Testing Series — Part 6: Automating RAG Quality Checks in CI/CD

#ai #python #testing #rag

RAG-Based Testing Series — Part 6: Automating RAG Quality Checks in CI/CD

"A test that only runs when you remember to run it isn't really a test. It's a hope."

We've built something real over this series.

In Part 2, we gave retrieval quality a number — Precision@K, Recall@K, MRR.

In Part 3, we gave hallucination detection a number — faithfulness scoring with RAGAS.

In Part 4, we tested the edge cases that break RAG systems in production.

In Part 5, we assembled all of that into a structured, reusable framework with one command to run everything.

But there's still a problem. 🔴

The framework only runs when someone decides to run it.

And in a real team, "someone will run the tests before deploying" is not a guarantee. It's an assumption. And assumptions fail at the worst possible moments.

Someone updates the knowledge base at 5pm on a Friday.
Someone tweaks the system prompt and doesn't realise it changed retrieval behaviour.
Someone upgrades the embedding model and the similarity scores shift quietly.

None of these trigger a test run. None of these get caught. And your users discover the regression before your team does.

Part 6 fixes this.

We're wiring the framework from Part 5 into a GitHub Actions CI/CD pipeline so that RAG quality checks run automatically — on every relevant change, without anyone having to remember. 🤖

🗺️ What We're Building

By the end of this article, you'll have:

.github/
└── workflows/
    └── rag_quality_checks.yml   ← GitHub Actions workflow

rag_test_framework/
├── config/
│   └── settings.py
├── core/
│   ├── retriever.py
│   ├── evaluator.py
│   └── rag_pipeline.py
├── tests/
│   ├── conftest.py
│   ├── test_retrieval.py
│   ├── test_faithfulness.py
│   └── test_edge_cases.py
├── data/
│   └── test_cases.json
├── reports/
│   └── (auto-generated, uploaded as CI artifacts)
├── run_tests.py
└── requirements.txt

The workflow will:

Trigger automatically on pushes that touch relevant files
Install dependencies
Run the full test suite
Upload the test report as a downloadable artifact
Post a summary to the GitHub Actions summary page
Block the pipeline if any test fails — no silent regressions

Let's build it step by step. 🛠️

⚙️ Step 1 — Store Secrets Safely

Your framework needs an OpenAI API key. You never hardcode secrets in a repository.

In GitHub:

Go to your repository → Settings → Secrets and variables → Actions
Click New repository secret
Name: OPENAI_API_KEY
Value: your actual OpenAI API key

That's it. GitHub encrypts it. Your workflow accesses it as ${{ secrets.OPENAI_API_KEY }} — never exposed in logs or code.

Now update config/settings.py to read from the environment variable (this already works locally too if you set it with export OPENAI_API_KEY=...):

# config/settings.py

import os

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

if not OPENAI_API_KEY:
    raise EnvironmentError(
        "OPENAI_API_KEY environment variable is not set.\n"
        "Set it locally with: export OPENAI_API_KEY=your-key\n"
        "In CI, add it as a GitHub Actions secret."
    )

Failing loudly with a clear message is better than failing cryptically with an authentication error three steps later. ✅

📄 Step 2 — The GitHub Actions Workflow

Create this file at .github/workflows/rag_quality_checks.yml:

name: RAG Quality Checks

on:
  push:
    paths:
      # Run when test cases or knowledge base changes
      - 'rag_test_framework/data/**'
      # Run when any core framework code changes
      - 'rag_test_framework/core/**'
      # Run when configuration (thresholds, models) changes
      - 'rag_test_framework/config/**'
      # Run when tests themselves change
      - 'rag_test_framework/tests/**'
      # Run when dependencies change
      - 'rag_test_framework/requirements.txt'

  pull_request:
    paths:
      - 'rag_test_framework/data/**'
      - 'rag_test_framework/core/**'
      - 'rag_test_framework/config/**'
      - 'rag_test_framework/tests/**'
      - 'rag_test_framework/requirements.txt'

  # Allow manual trigger from the GitHub Actions UI
  workflow_dispatch:

jobs:
  rag-quality-checks:
    name: RAG Quality Checks
    runs-on: ubuntu-latest

    steps:
      # ── 1. Check out the repository ──────────────────────────
      - name: Checkout repository
        uses: actions/checkout@v4

      # ── 2. Set up Python ─────────────────────────────────────
      - name: Set up Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'   # cache pip installs between runs to speed up the workflow

      # ── 3. Install dependencies ───────────────────────────────
      - name: Install dependencies
        working-directory: rag_test_framework
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      # ── 4. Run the RAG test suite ─────────────────────────────
      - name: Run RAG quality checks
        working-directory: rag_test_framework
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          mkdir -p reports
          pytest tests/ \
            -v \
            --tb=short \
            --json-report \
            --json-report-file=reports/rag_test_report.json \
            --json-report-summary

      # ── 5. Upload report as a downloadable artifact ───────────
      - name: Upload test report
        if: always()   # upload even if tests failed — you want the report either way
        uses: actions/upload-artifact@v4
        with:
          name: rag-test-report-${{ github.run_number }}
          path: rag_test_framework/reports/rag_test_report.json
          retention-days: 30

      # ── 6. Post summary to GitHub Actions summary page ────────
      - name: Post test summary
        if: always()
        working-directory: rag_test_framework
        run: python ci/post_summary.py reports/rag_test_report.json

📋 Step 3 — The Summary Script

The workflow calls ci/post_summary.py to write a clean summary to GitHub's built-in job summary page. Create that file now:

# ci/post_summary.py

import json
import os
import sys


def post_summary(report_path: str):
    """
    Read the pytest JSON report and write a markdown summary
    to the GitHub Actions step summary page (GITHUB_STEP_SUMMARY).
    """
    if not os.path.exists(report_path):
        print(f"Report not found at {report_path}")
        sys.exit(1)

    with open(report_path) as f:
        report = json.load(f)

    summary  = report.get("summary", {})
    passed   = summary.get("passed", 0)
    failed   = summary.get("failed", 0)
    error    = summary.get("error", 0)
    total    = summary.get("total", 0)
    duration = round(report.get("duration", 0), 2)

    # Determine overall status
    if failed > 0 or error > 0:
        status_icon  = "❌"
        status_label = "FAILED"
    else:
        status_icon  = "✅"
        status_label = "PASSED"

    # Build the markdown summary
    lines = [
        f"## {status_icon} RAG Quality Checks — {status_label}",
        "",
        "| Metric | Value |",
        "|--------|-------|",
        f"| Total tests | {total} |",
        f"| Passed      | {passed} |",
        f"| Failed      | {failed} |",
        f"| Duration    | {duration}s |",
        "",
    ]

    if failed > 0 or error > 0:
        lines.append("### ❌ Failed Tests")
        lines.append("")
        for test in report.get("tests", []):
            if test["outcome"] in ("failed", "error"):
                lines.append(f"- `{test['nodeid']}`")
                # Include the failure message if available
                if "call" in test and "longrepr" in test["call"]:
                    # Truncate long failure output for readability
                    longrepr = test["call"]["longrepr"]
                    preview  = longrepr[:500] + "..." if len(longrepr) > 500 else longrepr
                    lines.append(f"  ```
{% endraw %}
\n  {preview}\n
{% raw %}
  ```")
        lines.append("")

    lines += [
        "### Test Breakdown",
        "",
        "| Test File | Tests | Status |",
        "|-----------|-------|--------|",
    ]

    # Group tests by file for the breakdown table
    file_results: dict = {}
    for test in report.get("tests", []):
        file_name = test["nodeid"].split("::")[0]
        if file_name not in file_results:
            file_results[file_name] = {"total": 0, "failed": 0}
        file_results[file_name]["total"] += 1
        if test["outcome"] in ("failed", "error"):
            file_results[file_name]["failed"] += 1

    for file_name, counts in file_results.items():
        icon = "✅" if counts["failed"] == 0 else "❌"
        lines.append(f"| `{file_name}` | {counts['total']} | {icon} |")

    summary_text = "\n".join(lines)

    # Write to GitHub step summary if running in CI
    github_summary_path = os.environ.get("GITHUB_STEP_SUMMARY")
    if github_summary_path:
        with open(github_summary_path, "a") as f:
            f.write(summary_text)
        print("✅ Summary written to GitHub Actions step summary.")
    else:
        # Running locally — just print it
        print(summary_text)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python ci/post_summary.py <path-to-report.json>")
        sys.exit(1)
    post_summary(sys.argv[1])

🔍 Step 4 — Understanding the Trigger Strategy

The paths filter in the workflow is one of the most important design decisions here. Let me explain why it's set up this way.

on:
  push:
    paths:
      - 'rag_test_framework/data/**'      # knowledge base changed
      - 'rag_test_framework/core/**'      # retrieval or pipeline logic changed
      - 'rag_test_framework/config/**'    # thresholds or models changed
      - 'rag_test_framework/tests/**'     # tests themselves changed
      - 'rag_test_framework/requirements.txt'

Why not trigger on every push?

RAG quality tests are expensive. Each test run calls the OpenAI API for embeddings and RAGAS evaluation. Running on every push to every file — including README changes, frontend code, unrelated scripts — wastes time and money.

What actually warrants a RAG quality check?

Change	Should trigger?	Why
`data/test_cases.json` updated	✅ Yes	Ground truth changed — verify scores still hold
New document added to knowledge base	✅ Yes	Retrieval behaviour may shift
`config/settings.py` thresholds changed	✅ Yes	You're redefining what "passing" means
Embedding model changed	✅ Yes	Similarity scores will shift
System prompt changed	✅ Yes	LLM behaviour may change
README.md updated	❌ No	Documentation only
Frontend code changed	❌ No	No impact on RAG pipeline

The paths filter implements exactly this logic. Only relevant changes trigger the quality gate. 🎯

🚦 Step 5 — What Happens When Tests Fail

This is important to understand clearly.

When pytest exits with a non-zero return code (i.e., any test fails), GitHub Actions automatically marks the job as failed. You don't need to add any special logic for this.

What that means in practice:

On a push to main:
The commit is recorded but the workflow run is marked ❌ Failed. Your team sees it immediately in the repository's commit history.

On a pull request:
The PR's status checks show ❌ RAG Quality Checks — Failed. You can configure branch protection rules to block merging until this passes.

Setting up branch protection (strongly recommended):

Go to repository → Settings → Branches
Add a branch protection rule for main
Enable Require status checks to pass before merging
Add RAG Quality Checks as a required check

Now no one can merge a change that breaks RAG quality — not accidentally, not under deadline pressure. The gate is automated. 🔒

📊 Step 6 — Viewing Results

After a workflow run you have three places to check results:

1. GitHub Actions job logs
Full pytest output, line by line. Best for debugging a specific failure.

2. GitHub Actions step summary
The clean markdown table from post_summary.py. Best for a quick pass/fail overview. Visible directly on the workflow run page without opening logs.

3. Downloaded artifact
The full rag_test_report.json. Best for tracking scores over time or doing deeper analysis. Download it from the workflow run's Artifacts section.

💰 Step 7 — Managing API Costs in CI

Running RAGAS evaluations in CI means calling the OpenAI API on every trigger. Here's how to keep costs under control.

Use a Smaller Evaluation Dataset in CI

Your full ground truth dataset might have 50+ test cases. In CI, you don't need to run all of them on every push.

Create a separate, smaller CI dataset:

// data/test_cases_ci.json
{
  "knowledge_base": [ ... ],

  "retrieval_test_cases": [
    // Keep your 5 highest-signal retrieval cases
    // These should represent the most common and most critical query types
  ],

  "faithfulness_test_cases": [
    // 3-4 cases that cover your main faithfulness scenarios
  ],

  "edge_case_queries": {
    "out_of_scope":    ["What is the capital of France?"],
    "empty_retrieval": ["What is the pricing for the enterprise plan?"],
    "leading_questions": [ ... ]
  }
}

Then in conftest.py, read from an environment variable to decide which dataset to use:

# tests/conftest.py

import json
import os
import pytest
from core.retriever import build_collection
from core.evaluator import build_evaluator

@pytest.fixture(scope="session")
def test_data():
    # In CI, use the smaller dataset. Locally, use the full one.
    ci_mode       = os.environ.get("CI", "false").lower() == "true"
    dataset_path  = "data/test_cases_ci.json" if ci_mode else "data/test_cases.json"

    with open(dataset_path) as f:
        return json.load(f)

@pytest.fixture(scope="session")
def collection(test_data):
    kb = test_data["knowledge_base"]
    return build_collection(
        collection_name="rag_test_kb",
        documents=[doc["text"] for doc in kb],
        doc_ids=[doc["id"] for doc in kb]
    )

@pytest.fixture(scope="session")
def evaluator():
    llm, embeddings = build_evaluator()
    return llm, embeddings

GitHub Actions automatically sets CI=true in every workflow run — no extra configuration needed.

Result: CI runs a fast, cost-efficient subset. Full runs happen locally or on scheduled nightly jobs (see below). ✅

🌙 Step 8 — Scheduled Full Runs

For a complete quality audit — run the full dataset on a schedule, not just on push:

# Add this to the `on:` section of your workflow

  schedule:
    # Run every day at 2 AM UTC
    # This uses the full dataset, not the CI subset
    - cron: '0 2 * * *'

And in your workflow, pass an environment variable to tell conftest to use the full dataset:

      - name: Run RAG quality checks
        working-directory: rag_test_framework
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          # Override CI mode for scheduled runs — use full dataset
          CI: ${{ github.event_name != 'schedule' && 'true' || 'false' }}
        run: |
          mkdir -p reports
          pytest tests/ -v --tb=short --json-report --json-report-file=reports/rag_test_report.json

The result:

Trigger	Dataset	Purpose
Push / PR	`test_cases_ci.json` (small)	Fast gate — catch obvious regressions
Scheduled (nightly)	`test_cases.json` (full)	Full quality audit — track score trends

🧩 The Complete Final Architecture

Here's the full picture — everything we've built across all six parts:

rag_test_framework/
│
├── .github/workflows/
│   └── rag_quality_checks.yml     ← CI/CD trigger + orchestration
│
├── ci/
│   └── post_summary.py            ← GitHub Actions summary writer
│
├── config/
│   └── settings.py                ← all thresholds, model names, API keys
│
├── core/
│   ├── retriever.py               ← retrieval + Precision@K, Recall@K, MRR
│   ├── evaluator.py               ← RAGAS faithfulness + answer_relevancy
│   └── rag_pipeline.py            ← end-to-end RAG call
│
├── tests/
│   ├── conftest.py                ← shared session-scoped fixtures
│   ├── test_retrieval.py          ← Part 2 tests
│   ├── test_faithfulness.py       ← Part 3 tests
│   └── test_edge_cases.py         ← Part 4 tests
│
├── data/
│   ├── test_cases.json            ← full ground truth dataset
│   └── test_cases_ci.json         ← smaller CI subset
│
├── reports/
│   └── (timestamped JSON reports)
│
├── run_tests.py                   ← local single-command runner
└── requirements.txt

One push. One workflow. Automated quality gate on every relevant change. 🎯

✅ End-to-End Flow — What Happens on Every Relevant Push

Let's walk through exactly what happens when a developer updates data/test_cases.json:

Developer pushes a commit that updates data/test_cases.json
    │
    ▼
GitHub detects the push matches a path filter
    │
    ▼
GitHub Actions spins up ubuntu-latest runner
    │
    ▼
Python 3.11 installed, pip cache restored
    │
    ▼
pip install -r requirements.txt
    │
    ▼
pytest tests/ runs with CI=true (uses test_cases_ci.json)
    │
    ├── test_retrieval.py   — Precision@K, Recall@K, MRR asserted
    ├── test_faithfulness.py — Faithfulness, no critical hallucinations
    └── test_edge_cases.py  — Empty retrieval, out-of-scope, leading questions
    │
    ▼
rag_test_report.json written to reports/
    │
    ▼
post_summary.py writes markdown table to GitHub step summary
    │
    ▼
Report uploaded as downloadable artifact (kept 30 days)
    │
    ├── All tests pass → ✅ Pipeline green, PR can merge
    └── Any test fails → ❌ Pipeline blocked, team notified

🔖 Key Takeaways From Part 6

Automation removes the "someone will remember" assumption — the gate runs regardless of deadline pressure or human error
paths filtering keeps costs under control — only trigger on changes that can actually affect RAG quality
Separate CI and full datasets — fast feedback on push, deep audit on schedule
scope="session" fixtures + CI dataset = fast CI runs — no repeated expensive setup, no unnecessary API calls
Branch protection rules complete the gate — automated tests mean nothing if merging is still allowed when they fail
Reports as artifacts — every run is recorded; you can track quality score trends over time
if: always() on artifact upload — you always want the report, especially when tests fail

🏁 Series Complete — What You've Built

Let's take a moment to look at how far we've come.

Six parts ago, most engineers testing RAG systems had no framework, no metrics, and no automated gate. They were hoping the final answer "looked right."

You now have something completely different. 👇

Part 1 ✅ — Understood what RAG is and why traditional testing breaks down
Part 2 ✅ — Gave retrieval quality a number: Precision@K, Recall@K, MRR
Part 3 ✅ — Gave hallucination detection a number: faithfulness scoring with RAGAS
Part 4 ✅ — Tested the edge cases that break RAG systems in production
Part 5 ✅ — Assembled everything into a structured, reusable framework
Part 6 ✅ — Automated the framework in CI/CD with GitHub Actions

You can plug this into any RAG system. Swap the vector database. Swap the LLM. The tests stay the same. The gate stays active. The quality stays measurable. 🎯

This is what production-grade RAG testing looks like.

🚀 What's Next — Beyond This Series

The framework you've built is a foundation, not a ceiling. Here's where to take it from here:

NDCG implementation — We covered NDCG conceptually in Part 2. Adding a proper implementation using sklearn.metrics.ndcg_score is a natural next step for more sophisticated retrieval ranking tests.

Alternative vector databases — The framework currently uses ChromaDB. If your production system uses Pinecone, Weaviate, or pgvector, the only change is in core/retriever.py. The tests are untouched.

Score trend tracking — Each run produces a JSON report. Building a simple script to parse historical reports and plot score trends over time will tell you if your RAG quality is improving or degrading with each knowledge base update.

Latency testing — We tested quality but not speed. Retrieval latency and end-to-end response time are worth measuring, especially as your knowledge base grows.

Custom RAGAS metrics — RAGAS supports custom metrics beyond faithfulness and answer relevancy. Context precision and context recall are worth exploring as your test suite matures.

Thank you for following this series all the way to the end. 🙏

Every part was built with real QA engineering principles — not just AI hype. The goal was always to make RAG testing feel like engineering, not magic.

I hope it does. 🎯

Drop a comment below 👇

Have you wired this into your own CI/CD pipeline? How did it go?
Which part of the series was most useful for your specific situation?
What would you like me to cover next — NDCG implementation, alternative vector DBs, score trend tracking?

All questions and feedback welcome. Let's keep building. 🙌

Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn

Top comments (1)

Alex Shev • Jun 12

This is the step that makes RAG testing real. Retrieval quality can drift from content changes, prompt edits, embedding upgrades, or even chunking tweaks, so tying tests only to app code misses the actual risk surface. The useful CI trigger is “anything that can change the answer,” not just “anything that changes the code.”