Muhammad Zulqarnain

Posted on May 17

How to Build a RAG Evaluation Framework That Catches Real Problems

#ai #machinelearningpython #rag

Six months into running a production RAG system, I had a problem: my users kept complaining about wrong answers, but my evaluation metrics looked fine. Retrieval accuracy: 87%. User satisfaction: 82%. Everything looked good on paper.

Then I sat with users for a week and watched them interact with the system.

The real problems were invisible to my metrics.

The Metrics You're Probably Using (And Why They're Not Enough)

Most RAG evaluation setups check:

Retrieval recall: "Did we get the relevant documents?"
Answer faithfulness: "Is the answer based on the retrieved content?"
Answer relevance: "Does the answer address the question?"

These are fine starting points. But they miss the failure modes that actually hurt users.

The Failure Modes Nobody Talks About

1. Temporal Confusion

Your system retrieved a correct document from 8 months ago. The policy changed 3 months ago. Your answer is confidently wrong.

Standard metrics: pass. Users: furious.

Fix: Track document age in your evaluation. Flag answers based on documents older than 30 days for manual review. Update your vector store on a schedule.

2. Partial Retrieval

User asks a compound question: "What are the refund limits for premium vs standard accounts?"

Your system retrieves the premium refund policy perfectly. Misses the standard account policy entirely.

Your retrieval recall: 50%. Your answer: half wrong. Your metrics: don't catch this.

Fix: Decompose compound questions before evaluation. Each sub-question needs its own retrieval check.

3. Confidence Calibration

Your model says "Based on the policy, the limit is $500." The document actually says "approximately $400-600 depending on account history."

The model collapsed a range into a specific number. Confidently. Wrongly.

Fix: Add a confidence calibration metric. Compare model confidence signals against actual document specificity.

4. The Hallucination in Context

This one is subtle. The document says X. The model retrieves it correctly. But the model's response adds context from its training data that contradicts X.

Faithfulness metrics typically catch obvious hallucinations. They miss when the model supplements retrieved content with training data.

Fix: Run a "context sufficiency" check. If the retrieved documents don't fully answer the question, flag it rather than letting the model fill gaps.

Building the Evaluation Framework

Here's what actually works in production:

Layer 1: Automated Metrics (Fast, Cheap)

Retrieval recall against a golden dataset
BERTScore for semantic similarity
Answer length relative to question complexity
Response time (slow = bad UX = user churn)

Cost: ~$0.002 per query to evaluate

Layer 2: LLM-as-Judge (Medium Cost, Medium Accuracy)

Have GPT-4 evaluate faithfulness and relevance
Check for temporal issues by passing document metadata
Flag compound question handling

Cost: ~$0.08 per query to evaluate (use on 10% of traffic)

Layer 3: Human Review (Expensive, Ground Truth)

Weekly review of 50 random flagged queries
Monthly review of edge cases identified by Layer 2
Quarterly user satisfaction interviews

Cost: $800-2,000/month in person-hours

The Feedback Loop That Matters

Every failed evaluation goes into a "failure taxonomy." Not just "wrong answer" — but why it was wrong.

After 3 months of this taxonomy, our top failure modes were:

Temporal confusion: 34% of failures
Compound question partial retrieval: 28%
Confidence miscalibration: 19%
Hallucination in context: 14%
Other: 5%

This told us exactly where to invest engineering time.

The Numbers

Before framework: Users reported wrong answers 3-4 times per week.

After 3 months with the framework:

Temporal failures: down 71%
Compound question failures: down 58%
Overall wrong answer reports: down 64%

The evaluation framework cost us 40 engineering hours to build and $800/month to run. The reduction in support tickets saved $3,000/month.

What to Build First

If you're starting from scratch: build Layer 1 first. Get automated metrics running on every query. Create a golden dataset of 200 question-answer pairs that represent your most common use cases.

Run that for 30 days. The failure patterns will tell you what to build next.

Don't let perfect be the enemy of good. An imperfect evaluation framework running consistently beats a perfect one you're still designing.

Build the thing. Measure. Iterate.