Faizal

Posted on Jun 10

RAG-Based Testing Series — Part 1: What Is RAG & Why Your Old Testing Playbook Won't Work Here

#testing #ai #rag #automation

RAG-Based Testing Series — Part 1: What Is RAG & Why Your Old Testing Playbook Won't Work Here

"We test what we understand. And most of us don't understand RAG yet."

I've spent 7.5 years in QA automation. API testing, performance testing, UI automation, database validation — you name it, I've tested it.

But nothing prepared me for the moment I had to test an AI system for the first time.

Not because it was complicated.

Because everything I knew about testing — suddenly didn't apply.

This series is my attempt to fix that. Not just for me. For every QA engineer, automation engineer, or developer who is about to walk into the world of AI-powered systems and realize the rulebook has changed.

By the end of this series, we will go from zero to a fully automated RAG-based test framework — built from scratch, step by step, in plain language that anyone can follow.

But before we build anything — we need to understand what we're dealing with.

Let's start from the very beginning. 🎯

🤖 First — What Even Is an LLM?

LLM stands for Large Language Model.

Think of it as an incredibly well-read entity that has consumed billions of pages of text — books, articles, websites, research papers — and learned patterns from all of it.

When you ask it a question, it doesn't "look up" the answer like a search engine. It generates an answer based on everything it has learned.

Examples you already know: ChatGPT, Claude, Gemini, Llama.

They're impressive. But they have one massive problem. 👇

🚨 The Problem With LLMs Alone

LLMs are trained on data up to a certain point in time. After that — they know nothing new.

Ask ChatGPT about something that happened last week? It might not know.

Ask it about your company's internal documents, your product's knowledge base, your customer support policies? It has absolutely no idea.

And worse — when it doesn't know something, it doesn't always say "I don't know."

Sometimes it just makes something up — confidently, fluently, and completely wrong.

This is called a hallucination. 👻

And in a production AI system, a hallucination isn't just embarrassing — it can be catastrophic.

So how do we fix this? Enter RAG. 👇

🔍 What Is RAG?

RAG stands for Retrieval Augmented Generation.

Let me break that down word by word:

Retrieval — Fetch relevant information from somewhere
Augmented — Add that information to the context
Generation — Now generate an answer based on that context

In plain English 👇

Instead of relying only on what the AI already knows, RAG gives the AI fresh, relevant, specific information right before it answers — so the answer is grounded in real, up-to-date data.

🏫 The Classroom Analogy

Imagine two types of students taking an exam:

Student A (LLM without RAG):
Studies hard before the exam. Walks in. Answers purely from memory. If the question is outside what they studied — they guess. Sometimes confidently wrong.

Student B (LLM with RAG):
Walks into the exam with an open-book policy. Before answering, they quickly flip to the relevant page, read the context, and then write their answer — grounded in what's actually there.

Student B will almost always give a more accurate, reliable answer.

That's RAG. The AI gets to "open the book" before it responds. 📖

⚙️ How Does RAG Actually Work? Step by Step.

Let's get a little more technical — but still simple. Here's what happens under the hood every time a user asks a RAG-powered system a question:

Step 1 — The User Asks a Question

"What is the refund policy for premium subscribers?"

Step 2 — The Query Goes to a Retriever

The system doesn't immediately ask the LLM. First, it searches a knowledge base — a collection of documents, FAQs, manuals, policies — for the most relevant pieces of information.

This search is not keyword-based like Google. It uses vector embeddings — a mathematical way of understanding meaning, not just words.

So even if the document says "Premium members can get their money back within 30 days" — the retriever understands that this is relevant to a question about "refund policy for premium subscribers."

Step 3 — Top Documents Are Retrieved

The retriever returns the most relevant chunks of text. For example:

"Premium subscribers are eligible for a full refund within 30 days 
of purchase. Requests must be submitted via the support portal."

Step 4 — Context Is Passed to the LLM

The retrieved documents are now injected into the prompt that goes to the LLM:

Context: "Premium subscribers are eligible for a full refund within 
30 days of purchase. Requests must be submitted via the support portal."

Question: "What is the refund policy for premium subscribers?"

Step 5 — The LLM Generates a Grounded Answer

Now the LLM answers — but instead of guessing from memory, it uses the provided context:

"Premium subscribers can request a full refund within 30 days of 
purchase. You'll need to submit your request through the support portal."

Clean. Accurate. Grounded. ✅

🏗️ The Architecture at a Glance

Here's the full RAG pipeline visually:

User Query
    │
    ▼
[Embedding Model] — converts query to vector
    │
    ▼
[Vector Database] — finds most similar document chunks
    │
    ▼
[Retrieved Context] — top N relevant chunks
    │
    ▼
[LLM Prompt] — question + context combined
    │
    ▼
[LLM] — generates final answer
    │
    ▼
Final Response to User

Each one of these steps is a potential point of failure.

And that's exactly where testing comes in. 👇

🧪 Why Does RAG Need a Different Testing Approach?

This is the core of this entire series. Read this carefully.

Traditional Software Testing

In traditional software:

Input: "GET /api/user/123"
Expected Output: { "id": 123, "name": "Jass", "role": "admin" }
Actual Output:   { "id": 123, "name": "Jass", "role": "admin" }
Result: ✅ PASS

The output is deterministic. Same input, same output. Every single time. Writing assertions is straightforward. Automation is clean.

RAG System Testing

In a RAG system:

Input: "What is the refund policy?"

Run 1: "You can get a full refund within 30 days via the support portal."
Run 2: "Premium members are entitled to refunds within a 30-day window."
Run 3: "Refund requests for premium subscribers should be raised on the portal within 30 days."

All three are: ✅ Correct. ✅ Different. ✅ Valid.

The output is non-deterministic. The same question can produce different — yet equally valid — answers.

Your assertEqual won't work here.
Your exact string matching won't work here.
Your traditional assertion libraries won't work here.

You need an entirely new way to think about what "correct" even means. 🤯

🔴 What Can Actually Go Wrong in a RAG System?

Let me walk you through the real failure modes — the ones that will reach production if nobody tests them properly.

❌ Retrieval Failure

The wrong documents are retrieved. The answer that follows is based on irrelevant or incorrect context.

Example: User asks about refund policy. The retriever returns shipping policy instead.

❌ Hallucination Despite Context

The LLM ignores the retrieved context and generates an answer from its own "memory" — which may be outdated or wrong.

Example: Context says "30-day refund." LLM says "60-day refund" anyway.

❌ Partial Retrieval

Only half the relevant information is retrieved. The answer is incomplete and potentially misleading.

Example: Refund policy has two conditions. Only one chunk is retrieved. User gets half the information.

❌ Context Confusion

Too many documents retrieved. The LLM gets confused by conflicting or overwhelming context and produces a muddled answer.

❌ Silent Failure

The knowledge base has no relevant document. Instead of saying "I don't know" — the system confidently fabricates an answer.

This is the most dangerous failure mode of all. 😬

❌ Staleness

The knowledge base hasn't been updated. The retrieved documents are outdated. The answer is factually wrong — but the system doesn't know that.

💡 What Does "Testing" Mean in This World?

If traditional assertions don't work — what does work?

Here's a preview of the testing dimensions we'll cover in this series:

What We Test	What We're Checking
Retrieval Quality	Are the right documents being fetched?
Answer Relevance	Is the answer actually related to the question?
Faithfulness	Is the answer grounded in the retrieved context?
Completeness	Does the answer cover all the necessary information?
Hallucination Detection	Is the model making things up?
Edge Case Handling	What happens when there's no relevant document?
Pipeline Regression	Did a knowledge base update break anything?
Latency & Performance	Is the retrieval + generation fast enough?

Each one of these is a chapter in this series. 🗂️

🚀 What's Coming in This Series

Think of this series as a complete workshop.

We're not just talking theory. We're building.

Here's the roadmap 👇

Part 1 — What Is RAG & Why It Needs Different Testing       ← You are here
Part 2 — Testing Retrieval Quality: Are You Fetching Right?
Part 3 — Faithfulness & Hallucination Detection
Part 4 — Edge Cases: What Breaks RAG & How to Catch It
Part 5 — Building a RAG Test Framework from Scratch
Part 6 — Automating RAG Quality Checks in CI/CD

By Part 5, you'll have a working test framework — in code — that you can plug into any RAG system.

By Part 6, that framework will be running automatically in your CI/CD pipeline — catching RAG failures before they reach your users.

We're going from zero to automated AI quality assurance. 🏗️➡️🤖

🎯 Who Is This Series For?

QA engineers who are being asked to test AI systems and don't know where to start
Automation engineers who want to expand into AI testing
Developers building RAG systems who want to understand how to validate them
Anyone curious about what quality assurance looks like in the age of AI

No AI/ML background required. No data science degree needed.

If you can write a test case — you can learn RAG-based testing. I'll make sure of that. 💪

🔖 Before You Go

RAG is not the future. It's the present.

Right now, companies are shipping RAG-powered chatbots, internal knowledge tools, customer support systems, and AI copilots — and most of them have never been properly tested.

The engineers who understand how to test these systems are rare. And that gap is only going to grow as AI adoption accelerates.

This series is your head start. 🚀

Follow me so you don't miss Part 2 — where we get hands-on with testing retrieval quality, look at real metrics like NDCG and MRR, and write our first actual RAG test.

Drop a comment below 👇

Are you already working with RAG systems?
What's the biggest challenge you've faced testing AI?
Or are you completely new to this — just getting started?

All levels welcome. Let's learn this together. 🙌

Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn

Top comments (5)

Alex Shev • Jun 11

This is where RAG testing starts to look like normal engineering again: define the failure modes, make retrieval quality observable, and keep the checks close to the workflow.

For agentic tools, I would rather see a small terminal command that proves retrieval quality than a dashboard full of vibes after the fact.

Faizal • Jun 11

Exactly this Alex, making failure modes observable is the whole game. The moment retrieval quality becomes a measurable signal in your pipeline, it stops being 'AI magic' and starts being engineering.
And fully agree on agentic tools — a clean CLI output that fails loudly beats any dashboard. That's actually the direction I'm heading with the framework in Part 5. Would love your take once that drops."

Alex Shev • Jun 11

Would be happy to read it when Part 5 lands. The CLI direction makes sense because it forces the framework to expose signals engineers can act on: missing sources, weak match, stale context, threshold failed.

Dashboards are useful later, but the first win is making the failure impossible to ignore in the workflow where it happens.

Kevin • Jun 10

Really strong introduction to RAG testing 👏 The comparison between traditional testing and non deterministic AI responses explains the challenge perfectly. I also liked how you broke down the different failure modes in such a practical way. A lot of teams still underestimate how quickly hallucinations or retrieval issues can become production problems.

I’m curious: what tools or frameworks are you currently using for automated faithfulness and hallucination evaluation?

Faizal • Jun 10

"Thank you! 🙌 Really glad the failure modes section landed — that's exactly where most teams get caught off guard in prod.
Coming on the question on tools — I'm actually covering faithfulness evaluation and hallucination detection hands-on in Part 3 of this series. Short answer: I work with RAGAS as an evaluation framework, and I'll be breaking down exactly how to set it up and interpret the scores.
Stay tuned — Part 2 drops soon on retrieval quality first, then we get into faithfulness. Would love your take once it's out! 🚀"