Quick Reference: Terms You'll Encounter
Technical Acronyms:
- RAG: Retrieval-Augmented Generation—enhancing LLM responses with retrieved context
- LLM: Large Language Model—transformer-based text generation system
- RAGAS: RAG Assessment—popular open-source evaluation framework
- BLEU: Bilingual Evaluation Understudy—n-gram overlap metric
- ROUGE: Recall-Oriented Understudy for Gisting Evaluation—summary comparison metric
Statistical & Mathematical Terms:
- Precision: Relevant items retrieved / Total items retrieved
- Recall: Relevant items retrieved / Total relevant items
- F1 Score: Harmonic mean of precision and recall
- Ground Truth: Known correct answers for evaluation
- Inter-rater Reliability: Agreement between human evaluators
Introduction: You Can't Improve What You Can't Measure
Imagine you're a restaurant owner. A customer complains: "The food was bad." That's useless feedback. Was it too salty? Undercooked? Wrong dish entirely? You need specific, measurable criteria to improve.
RAG systems face the same challenge. "The answer was wrong" doesn't tell you whether:
- The retrieval failed (wrong documents)
- The generation failed (right documents, wrong interpretation)
- The question was ambiguous
- The knowledge base was incomplete
RAG evaluation is like a medical diagnosis. You don't just ask "is the patient sick?" You measure temperature, blood pressure, heart rate, and specific biomarkers. Each metric isolates a different potential problem, guiding treatment.
Here's another analogy: Evaluation metrics are quality control checkpoints on an assembly line. You don't just inspect the final car—you check the engine, the transmission, the electrical system separately. A failed brake test tells you exactly where to look.
A third way to think about it: Metrics are unit tests for AI systems. Just as you wouldn't ship code without tests, you shouldn't ship RAG without evaluation. The difference is that AI "tests" are probabilistic, not deterministic.
The RAG Evaluation Stack: Four Layers of Quality
RAG quality breaks down into four distinct layers, each requiring different metrics:
┌─────────────────────────────────────────────────┐
│ Layer 4: End-to-End Quality │
│ "Did we solve the user's actual problem?" │
│ Metrics: Task success, user satisfaction │
├─────────────────────────────────────────────────┤
│ Layer 3: Answer Quality │
│ "Is the final answer correct and useful?" │
│ Metrics: Correctness, completeness, relevance │
├─────────────────────────────────────────────────┤
│ Layer 2: Faithfulness │
│ "Does the answer match the retrieved context?" │
│ Metrics: Faithfulness, hallucination rate │
├─────────────────────────────────────────────────┤
│ Layer 1: Retrieval Quality │
│ "Did we find the right documents?" │
│ Metrics: Precision, recall, MRR, nDCG │
└─────────────────────────────────────────────────┘
Critical insight: Problems cascade upward. Bad retrieval guarantees bad answers. But good retrieval doesn't guarantee good answers—the generation can still fail. You need metrics at every layer.
Layer 1: Retrieval Metrics—Did We Find the Right Documents?
Context Precision
What it measures: Of the documents we retrieved, how many were actually relevant?
Why it matters: Low precision means you're stuffing the context window with noise. The LLM has to work harder to find the signal, increasing hallucination risk.
The analogy: You're researching a legal case. Context precision asks: "Of the 10 documents your assistant pulled, how many are actually relevant to this case?" If 3 are relevant, precision is 30%.
Calculation:
Context Precision = Relevant chunks retrieved / Total chunks retrieved
Target: > 0.7 for most applications. Below 0.5 suggests retrieval needs work.
Context Recall
What it measures: Of all the relevant documents that exist, how many did we find?
Why it matters: Low recall means you're missing important information. The answer might be technically accurate but incomplete.
The analogy: You're studying for an exam on World War II. Context recall asks: "Of all the important facts you need to know, how many did your study materials cover?" Missing the D-Day invasion means low recall.
Calculation:
Context Recall = Relevant chunks retrieved / Total relevant chunks in corpus
Target: > 0.8 for comprehensive answers. Can be lower for simple factual queries.
Mean Reciprocal Rank (MRR)
What it measures: How high does the first relevant document appear in results?
Why it matters: If the best document is ranked #47, the LLM might never see it (context window limits). Position matters enormously.
The analogy: You Google a question. MRR measures whether the answer is in the first result (score: 1.0), the second (score: 0.5), the tenth (score: 0.1), or buried on page 5 (score: ~0).
Calculation:
MRR = Average of (1 / rank of first relevant result) across queries
Target: > 0.6. Below 0.4 means relevant results are buried too deep.
Normalized Discounted Cumulative Gain (nDCG)
What it measures: Overall ranking quality, accounting for position and graded relevance.
Why it matters: Not all relevant documents are equally relevant. nDCG captures whether highly relevant documents rank above somewhat relevant ones.
The analogy: You're ranking restaurants. nDCG rewards putting the 5-star restaurant first, the 4-star second, and the 3-star third—not just having all three somewhere in the list.
Target: > 0.7 for production systems.
Layer 2: Faithfulness—Does the Answer Match the Context?
Faithfulness is the hallucination detector. It measures whether the generated answer is actually supported by the retrieved documents.
The Faithfulness Problem
Consider this scenario:
- Retrieved context: "The company was founded in 2015 in Austin, Texas."
- Generated answer: "The company was founded in 2015 in Austin, Texas by John Smith."
The founder's name is hallucinated—it's not in the context. Faithfulness metrics catch this.
Measuring Faithfulness
Claim decomposition approach:
- Break the answer into atomic claims
- For each claim, check if it's supported by the context
- Faithfulness = Supported claims / Total claims
Example:
Answer: "Python was created by Guido van Rossum in 1991. It's the most popular programming language."
Claims:
1. "Python was created by Guido van Rossum" → Check context → Supported ✓
2. "Python was created in 1991" → Check context → Supported ✓
3. "Python is the most popular programming language" → Check context → NOT FOUND ✗
Faithfulness = 2/3 = 0.67
Hallucination Categories
Not all hallucinations are equal:
| Type | Severity | Example |
|---|---|---|
| Fabricated facts | High | Inventing statistics, names, dates |
| Exaggeration | Medium | "Always" when context says "often" |
| Conflation | Medium | Mixing details from different sources |
| Extrapolation | Low | Reasonable inference not explicitly stated |
Target faithfulness: > 0.9 for factual applications. Financial, medical, and legal domains should aim for > 0.95.
Layer 3: Answer Quality—Is the Response Actually Good?
Answer Relevance
What it measures: Does the answer actually address the question asked?
The problem it catches: The answer might be faithful to the context but completely miss the point.
Example:
- Question: "What is the return policy?"
- Retrieved: Company FAQ about returns
- Answer: "Our company was founded in 2010 and has grown to serve millions of customers."
Technically faithful (if that's in the FAQ), but completely irrelevant.
Measurement approach:
- Generate questions that the answer would address
- Compare to the original question
- Higher similarity = higher relevance
Answer Correctness
What it measures: Is the answer factually correct according to ground truth?
When you can measure it: Only when you have known correct answers (golden dataset).
The challenge: Ground truth is expensive to create and maintain. But without it, you're flying blind.
Answer Completeness
What it measures: Does the answer cover all aspects of the question?
Example:
- Question: "What are the pros and cons of React?"
- Incomplete answer: "React has a large ecosystem and component reusability."
- Complete answer: Lists both advantages AND disadvantages
Measurement approach: Compare answer coverage against a reference answer or checklist of expected points.
Layer 4: End-to-End Metrics—Did We Actually Help?
Task Completion Rate
What it measures: Did the user accomplish their goal?
Why it's the ultimate metric: All other metrics are proxies. This is the outcome that matters.
How to measure:
- Explicit signals: User clicks "resolved," completes purchase, etc.
- Implicit signals: User doesn't ask follow-up questions, doesn't contact support
User Satisfaction
What it measures: Subjective quality as perceived by users.
Methods:
- Thumbs up/down on responses
- Follow-up surveys
- Implicit signals (session length, return rate)
The challenge: Low response rates and selection bias. Users who bother to rate skew negative.
Building Golden Evaluation Sets
A golden set is your ground truth—questions with known correct answers that you use to benchmark your system.
What Makes a Good Golden Set
Diversity: Cover different question types, topics, and difficulty levels.
Question Types to Include:
├── Factual lookup ("What is X's revenue?")
├── Comparison ("How does A differ from B?")
├── Procedural ("How do I configure X?")
├── Reasoning ("Why did X happen?")
├── Multi-hop ("What's the CEO's alma mater's mascot?")
└── Unanswerable ("What will revenue be in 2030?")
Realistic distribution: Match your production traffic. If 60% of questions are "how do I," your golden set should reflect that.
Edge cases: Include the hard stuff—ambiguous questions, questions requiring multiple documents, questions with no good answer.
Golden Set Size Guidelines
| Use Case | Minimum Size | Recommended |
|---|---|---|
| Quick sanity check | 20-50 | 50 |
| Development iteration | 100-200 | 200 |
| Pre-release validation | 300-500 | 500 |
| Comprehensive benchmark | 500-1000 | 1000+ |
Rule of thumb: More is better, but 200 well-chosen examples beats 1000 random ones.
Creating Ground Truth
Option 1: Expert annotation
- Have domain experts write ideal answers
- Most accurate, most expensive
- Best for high-stakes domains
Option 2: User feedback mining
- Extract from support tickets, chat logs
- "Real" questions with known resolutions
- Watch for privacy concerns
Option 3: Synthetic generation
- Use LLMs to generate Q&A pairs from your documents
- Scale easily but quality varies
- Always human-validate a sample
Option 4: Adversarial generation
- Deliberately create hard cases
- Questions that sound similar but have different answers
- Edge cases that have broken the system before
LLM-as-Judge: Using AI to Evaluate AI
When you can't afford human evaluation at scale, LLMs can serve as automated judges.
How It Works
Prompt to Judge LLM:
"You are evaluating a RAG system response.
Question: {question}
Retrieved Context: {context}
Generated Answer: {answer}
Rate the following on a scale of 1-5:
1. Faithfulness: Is the answer supported by the context?
2. Relevance: Does the answer address the question?
3. Completeness: Does the answer cover all aspects?
Provide scores and brief justification."
Strengths and Weaknesses
Strengths:
- Scales infinitely
- Consistent (no inter-rater variability)
- Can evaluate nuanced criteria
Weaknesses:
- Biased toward verbose, confident-sounding answers
- May miss subtle factual errors
- Can't catch errors the judge LLM would also make
Calibrating LLM Judges
Critical step: Validate LLM judgments against human judgments.
Process:
1. Have humans rate 100-200 examples
2. Have LLM judge the same examples
3. Calculate correlation
4. If correlation < 0.7, adjust prompts or criteria
5. Document known blind spots
Best practice: Use a stronger model as judge than the model being evaluated. GPT-4 judging GPT-3.5, Claude Opus judging Claude Haiku, etc.
Evaluation Pipeline Architecture
┌─────────────────────────────────────────────────────────────┐
│ Evaluation Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Golden │───▶│ RAG │───▶│ Metrics │ │
│ │ Dataset │ │ System │ │ Compute │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Ground │ │ Results │ │
│ │ Truth │ │ Store │ │
│ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Dashboard │ │
│ │ & Alerts │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Key Components
Golden Dataset Store: Version-controlled, with metadata about when/how each example was created.
Metrics Compute: Calculates all metrics for each evaluation run. Should be deterministic (same inputs = same outputs).
Results Store: Historical record of all evaluation runs. Enables trend analysis and regression detection.
Dashboard & Alerts: Visualize metrics over time. Alert when metrics drop below thresholds.
Metric Selection: What to Track When
For Development (Daily)
Fast metrics that catch obvious regressions:
| Metric | Target | Why |
|---|---|---|
| Context Precision@5 | > 0.6 | Quick retrieval sanity check |
| Faithfulness (sampled) | > 0.85 | Catch hallucination spikes |
| Answer Relevance | > 0.7 | Ensure answers address questions |
For Release Validation (Weekly)
Comprehensive evaluation before deployments:
| Metric | Target | Why |
|---|---|---|
| Full retrieval suite | Various | Complete retrieval quality picture |
| Faithfulness (full) | > 0.9 | No hallucination regressions |
| Answer Correctness | > 0.85 | Accuracy against ground truth |
| Latency p95 | < 3s | Performance hasn't degraded |
For Production Monitoring (Continuous)
Lightweight signals that work without ground truth:
| Signal | Alert Threshold | Why |
|---|---|---|
| User feedback ratio | < 0.7 thumbs up | Direct user sentiment |
| Follow-up question rate | > 0.3 | Users aren't getting answers |
| "I don't know" rate | Significant change | Retrieval may be failing |
| Avg response length | Significant change | Generation behavior shift |
Common Evaluation Pitfalls
Pitfall 1: Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure."
If you optimize purely for faithfulness, the system learns to give vague, hedged answers that are technically faithful but useless.
Solution: Balance multiple metrics. No single metric should dominate.
Pitfall 2: Test Set Leakage
Your golden set accidentally overlaps with training data or retrieval corpus in ways that inflate scores.
Solution: Strict separation. Date-based splits where possible. Regular audits for overlap.
Pitfall 3: Distribution Shift
Your golden set was created six months ago. User questions have evolved. Metrics look great but users complain.
Solution: Continuously add new examples from production traffic. Retire stale examples.
Pitfall 4: Over-Reliance on Automatic Metrics
BLEU, ROUGE, and embedding similarity are cheap to compute but poorly correlated with human judgment for open-ended generation.
Solution: Always include some human evaluation. Use automatic metrics for quick feedback, not final decisions.
Pitfall 5: Ignoring Confidence Calibration
Your system says it's 90% confident but is only right 60% of the time.
Solution: Track calibration (accuracy at each confidence level). Well-calibrated confidence enables smart escalation.
Implementing an Evaluation Framework
Here's a conceptual framework—the GitHub repo will have the full implementation:
class RAGEvaluator:
"""
Core evaluation framework structure.
Evaluates: Retrieval quality, faithfulness, answer quality
Supports: Golden sets, LLM-as-judge, custom metrics
"""
def __init__(self, config: EvalConfig):
self.retrieval_metrics = RetrievalMetrics()
self.faithfulness_checker = FaithfulnessChecker()
self.answer_evaluator = AnswerEvaluator()
self.golden_set = GoldenDataset(config.golden_set_path)
def evaluate_retrieval(self, query, retrieved, relevant) -> dict:
"""Layer 1: Did we find the right documents?"""
return {
"precision": self.retrieval_metrics.precision(retrieved, relevant),
"recall": self.retrieval_metrics.recall(retrieved, relevant),
"mrr": self.retrieval_metrics.mrr(retrieved, relevant),
"ndcg": self.retrieval_metrics.ndcg(retrieved, relevant)
}
def evaluate_faithfulness(self, answer, context) -> dict:
"""Layer 2: Is the answer grounded in context?"""
claims = self.faithfulness_checker.extract_claims(answer)
supported = self.faithfulness_checker.verify_claims(claims, context)
return {
"faithfulness": len(supported) / len(claims),
"unsupported_claims": [c for c in claims if c not in supported]
}
def evaluate_answer(self, question, answer, ground_truth=None) -> dict:
"""Layer 3: Is the answer good?"""
result = {
"relevance": self.answer_evaluator.relevance(question, answer)
}
if ground_truth:
result["correctness"] = self.answer_evaluator.correctness(answer, ground_truth)
return result
def run_full_evaluation(self) -> EvalReport:
"""Run evaluation on entire golden set."""
results = []
for example in self.golden_set:
# Run RAG system
retrieved, answer = self.rag_system.query(example.question)
# Evaluate all layers
result = {
"retrieval": self.evaluate_retrieval(
example.question, retrieved, example.relevant_docs
),
"faithfulness": self.evaluate_faithfulness(answer, retrieved),
"answer": self.evaluate_answer(
example.question, answer, example.ground_truth
)
}
results.append(result)
return EvalReport(results)
Data Engineer's ROI Lens: The Business Impact
The Cost of Not Measuring
| Failure Mode | Business Impact | Detection Without Metrics |
|---|---|---|
| Retrieval degradation | Wrong answers increase 40% | Weeks (user complaints) |
| Hallucination spike | Trust erosion, potential liability | Days to weeks |
| Relevance drift | User satisfaction drops | Months (gradual) |
| Completeness issues | Support tickets increase | Weeks |
The Value of Good Evaluation
Scenario: E-commerce product Q&A system handling 50,000 queries/day.
Without evaluation:
- Undetected hallucination rate: 8%
- Bad answers per day: 4,000
- Support tickets generated: 400/day
- Cost per ticket: $15
- Daily cost: $6,000
- Monthly cost: $180,000
With evaluation:
- Hallucination detected in 2 days, fixed in 1 week
- Hallucination rate after fix: 1%
- Bad answers per day: 500
- Support tickets: 50/day
- Daily cost: $750
- Monthly cost: $22,500
Monthly savings: $157,500
Evaluation system cost: ~$5,000/month (compute + maintenance)
Net monthly benefit: $152,500
ROI Calculation
def calculate_eval_roi(
daily_queries: int,
error_rate_without_eval: float,
error_rate_with_eval: float,
cost_per_error: float,
eval_system_monthly_cost: float
) -> dict:
monthly_queries = daily_queries * 30
errors_without = monthly_queries * error_rate_without_eval
errors_with = monthly_queries * error_rate_with_eval
cost_without = errors_without * cost_per_error
cost_with = errors_with * cost_per_error + eval_system_monthly_cost
return {
"monthly_savings": cost_without - cost_with,
"error_reduction": f"{(1 - error_rate_with_eval/error_rate_without_eval)*100:.0f}%",
"roi": f"{(cost_without - cost_with) / eval_system_monthly_cost:.0f}x"
}
# Example
roi = calculate_eval_roi(
daily_queries=50000,
error_rate_without_eval=0.08,
error_rate_with_eval=0.01,
cost_per_error=15,
eval_system_monthly_cost=5000
)
# Result: 30x ROI
Key Takeaways
Evaluate every layer: Retrieval, faithfulness, answer quality, and end-to-end outcomes each require different metrics.
Faithfulness is non-negotiable: Hallucination detection must be part of every RAG evaluation.
Golden sets are investments: Spend time building high-quality evaluation data. It pays dividends forever.
LLM-as-judge scales, humans validate: Use AI for volume, humans for calibration.
Multiple metrics prevent gaming: No single metric captures quality. Balance retrieval, generation, and outcome metrics.
Continuous evaluation catches drift: Production quality degrades silently. Regular evaluation makes it visible.
The ROI is clear: Catching errors before users do saves orders of magnitude more than evaluation costs.
Start with a 50-example golden set and three metrics (precision, faithfulness, relevance). Expand as you learn what breaks in your specific domain.
Next in this series: Production AI: Monitoring, Cost Optimization, and Operations—building observable, efficient AI systems that scale reliably.
Top comments (0)