I've seen a lot of RAG setups that look great in demos. Clean answers, fast responses, confident tone. Then something breaks silently and nobody notices until a user gets a wrong answer served with full confidence.
The problem isn't the LLM. It's that the RAG layer rarely gets tested properly.
Here's the 5-test framework I use. Each one catches a different class of failure.
1. Faithfulness — Is the answer grounded in your docs?
What breaks: The model answers correctly... but pulls from its training memory, not your knowledge base. You can't audit it. You can't trust it.
How to test: Use an LLM-as-judge to map every claim in the answer back to a retrieved chunk. Score = supported_claims / total_claims. Set a threshold and fail anything below it.
Answer: "Minimum password length is 16 characters"
Chunk: "Passwords must be a minimum of 16 characters..."
→ Claim supported ✓
If the answer says "12 characters (industry standard)" and your doc says 16, that's hallucination. Even if 12 is technically reasonable.
2. Context Precision — Are you even retrieving the right chunks?
What breaks: Garbage in, garbage out. Your retriever pulls irrelevant docs and the LLM does its best with bad context.
How to test: For each query, score the relevance of every retrieved chunk (0 to 1) using an LLM judge. If less than 2 out of 3 chunks score above your threshold, your embedding/retrieval setup has alignment issues.
This one catches problems that faithfulness testing misses. It checks before the answer is generated.
3. Negative Testing — Does it know what it doesn't know?
What breaks: Someone asks about something not in your KB. The model fills the gap with training data and answers confidently. In compliance, legal, or medical contexts, this is genuinely dangerous.
How to test: Write a list of questions that are deliberately outside your knowledge base. For each one, check if the response contains a refusal phrase like "I don't have information about..." and if it doesn't, you've caught a live hallucination.
Query: "How many vacation days do employees get?" (not in KB)
✓ Pass: "I don't have that information in the knowledge base."
✗ Fail: "Employees typically receive 15 days per year."
Simple string matching. No LLM needed. Fast and deterministic.
4. Retrieval Unit Test — Did your re-index break anything?
What breaks: You swap your embedding model, change chunk size, or re-index. Now a query that used to find the right doc doesn't anymore. No errors thrown. Pipeline looks fine. It's just wrong.
How to test: Maintain a ground-truth lookup: query -> expected doc ID. After every infrastructure change, run this. Check that the expected doc ID appears somewhere in your top-K results.
Query: "What is the minimum password length?"
Expected: doc-003-security in top-3
Got: [doc-003-security (0.94), doc-001 (0.41), doc-002 (0.38)] ✓
No LLM needed. Pure regression testing for your retrieval layer.
5. Stale Data — Did your update actually propagate?
What breaks: You update a policy doc and re-ingest it. But vector stores don't automatically replace old embeddings. They add the new one alongside the old one. Now both exist. Queries return either version non-deterministically.
How to test: Two phases.
- Phase 1: Ingest v1, query, confirm old value ("90 days") is returned
- Phase 2: Clear the collection, ingest v2, query, confirm new value ("60 days") is returned AND old value is absent
If you skip the clear step in phase 2, you'll reproduce the bug right in your test suite.
expect(v2Answer).toMatch(/\b60\b/);
expect(data.answer).not.toMatch(new RegExp(`\\b${stale.v1.assertion}\\b`));
I ran this on a real KB. Here's what came back.
14 tests total. 4 failed. 78% pass rate.
| Test | Result |
|---|---|
| Faithfulness | 3/3 |
| Context Precision | 0/3 |
| Negative | 5/5 |
| Retrieval Unit | 4/4 |
| Stale Data | 1/1 |
Here's what the failures actually look like.
Context precision failed all 3 tests with the same pattern. Every query retrieved exactly 1 relevant chunk out of 3. The right document always ranked first, but the similarity scores were too close together. For a password policy query:
doc-003-security 0.374 relevant
doc-002-remote-work 0.209 irrelevant
doc-005-reimbursement 0.191 irrelevant
That 0.165 gap between the right doc and the wrong ones isn't confidence, it's noise. The retriever is finding the right doc by a slim margin. The answers look fine because the LLM is working around weak retrieval. That won't hold as the KB grows.
The takeaway
Most RAG failures aren't spectacular crashes. They're silent. Confident wrong answers. Stale policy info. Hallucinated numbers. A retriever that quietly regressed after a model swap.
These 5 tests cover the whole pipeline:
| Layer | Test |
|---|---|
| Generation | Faithfulness |
| Retrieval quality | Context Precision |
| Out-of-scope handling | Negative |
| Retrieval regression | Unit Test |
| Update propagation | Stale Data |





Top comments (0)