Stop Eyeballing Your RAG Outputs. Start Measuring Quality.
I shipped a RAG system. It felt fine. Then users started reporting wrong product recommendations, invented prices, and confidently wrong answers to questions the documents couldn't support.
I had no numbers. No regression detection. No systematic way to improve. I was flying blind.
This is how I built an evaluation stack that catches failures before users do.
What "Evaluation" Actually Means
Most teams jump straight to asking humans "does this seem good?" That's too slow and too expensive to run on every change. There's a whole layer of automated evaluation that should come first.
| Level | Question | Cadence |
|---|---|---|
| Unit | Does this component work correctly? | Every commit |
| Integration | Does the full pipeline work end-to-end? | Every PR |
| Human | Do users actually find this helpful? | Weekly |
| A/B | Is the new version measurably better? | Monthly |
The lower layers are fast and cheap. Build them first, then let human evaluation handle the things automation genuinely can't.
Part 1: The Golden Dataset
Everything starts here. A golden dataset is a hand-curated set of examples that represent correct behavior — your ground truth for all automated metrics.
golden_examples = [
{
"id": "g_001",
"query": "moisturizer for oily skin under $30",
"context": {"user_skin_type": "oily", "budget": 30},
"expected_retrieved_ids": ["prod_123", "prod_456"],
"expected_response_contains": ["non-comedogenic", "oil-free", "lightweight"],
"expected_citations": [1, 2],
"difficulty": "medium"
},
{
"id": "g_002",
"query": "foundation",
"context": {},
"expected_action": "CLARIFY",
"expected_clarifying_question_contains": ["skin type", "shade", "coverage"],
"difficulty": "hard"
}
]
Building It Without Guessing
Don't invent examples from your imagination. Sample from real production traffic, then label them.
# Pull from recent logs
production_queries = load_logs(last_weeks=1, n=1000)
# Stratified sample by complexity
simple = [q for q in production_queries if word_count(q) < 5]
medium = [q for q in production_queries if 5 <= word_count(q) < 15]
complex = [q for q in production_queries if word_count(q) >= 15]
sample = (
random.sample(simple, 30) +
random.sample(medium, 50) +
random.sample(complex, 20)
)
Then have two annotators label each example independently. Target inter-annotator agreement above 0.8. Resolve disagreements with a third reviewer.
One rule: never modify your golden set in place. Version it. golden_v1.jsonl → golden_v2.jsonl. Track the diff. Your historical metrics are meaningless if the benchmark silently changes under them.
Part 2: Automated Metrics
Retrieval Metrics
These answer the question: did we fetch the right documents?
def mean_reciprocal_rank(retrieved_ids: list, relevant_ids: list) -> float:
"""Position of first relevant result."""
for rank, doc_id in enumerate(retrieved_ids[:10], 1):
if doc_id in relevant_ids:
return 1.0 / rank
return 0.0
def ndcg_at_k(retrieved_ids: list, relevant_ids: list, k: int = 5) -> float:
"""Ranking quality with graded relevance."""
def dcg(ids):
return sum(
(1.0 if rid in relevant_ids else 0.0) / np.log2(i + 2)
for i, rid in enumerate(ids[:k])
)
ideal_dcg = dcg(relevant_ids[:k])
actual_dcg = dcg(retrieved_ids)
return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0
def precision_at_k(retrieved_ids: list, relevant_ids: list, k: int = 5) -> float:
"""How many of top-k results are actually relevant?"""
hits = sum(1 for rid in retrieved_ids[:k] if rid in relevant_ids)
return hits / k
Generation Metrics
Faithfulness — does the response follow from the sources, or is the model adding things it invented?
from sentence_transformers import CrossEncoder
nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-base")
def faithfulness_score(response: str, sources: list) -> float:
source_sents = [s["text"] for s in sources]
scores = []
for source in source_sents[:3]:
result = nli_model.predict([(source, response)])
# result: [contradiction, neutral, entailment]
scores.append(result[0])
return np.mean([s[2] for s in scores]) # Average entailment probability
Citation accuracy — are the [1], [2] references in the response pointing to real sources?
import re
def citation_accuracy(response: str, sources: list) -> dict:
citations = re.findall(r'\[(\d+)\]', response)
citation_indices = [int(c) - 1 for c in citations]
issues = []
for idx in citation_indices:
if idx < 0 or idx >= len(sources):
issues.append(f"Invalid citation [{idx + 1}]")
return {
"citation_count": len(citations),
"valid_citations": len(citations) - len(issues),
"accuracy": (len(citations) - len(issues)) / len(citations) if citations else 1.0,
"issues": issues
}
Answer relevance — does the response actually address the query?
def answer_relevance(query: str, response: str) -> float:
query_emb = client.embeddings.create(
input=query, model="text-embedding-3-small"
).data[0].embedding
response_emb = client.embeddings.create(
input=response, model="text-embedding-3-small"
).data[0].embedding
return cosine_similarity(query_emb, response_emb)
Latency Metrics
Speed is a quality signal. Measure every stage.
def measure_latency(func):
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
elapsed = (time.perf_counter() - start) * 1000
return {"result": result, "latency_ms": elapsed}
return wrapper
@measure_latency
def embed_query(query: str):
return embedding_model.encode(query)
@measure_latency
def retrieve(query_emb):
return vector_store.search(query_emb)
Track P50, P95, and P99. P95 is usually the most actionable — it's the experience your worst-off users are getting, without being dominated by outliers.
Part 3: The Evaluation Pipeline
Daily Automated Run
class EvaluationPipeline:
def __init__(self, golden_path: str, system_under_test):
self.golden = load_jsonl(golden_path)
self.sut = system_under_test
def run(self) -> dict:
results = []
for example in tqdm(self.golden):
start = time.time()
output = self.sut.process(example["query"])
latency = (time.time() - start) * 1000
results.append({
"example_id": example["id"],
"query": example["query"],
"latency_ms": latency,
"mrr": mean_reciprocal_rank(
output["retrieved_ids"],
example["expected_retrieved_ids"]
),
"ndcg_5": ndcg_at_k(
output["retrieved_ids"],
example["expected_retrieved_ids"],
k=5
),
"faithfulness": faithfulness_score(
output["response"],
output["sources"]
),
"citation_accuracy": citation_accuracy(
output["response"],
output["sources"]
)["accuracy"],
"answer_relevance": answer_relevance(
example["query"],
output["response"]
),
"passes_guardrails": output["guardrail_passed"],
"correct_action": output["action"] == example.get("expected_action")
})
return self.aggregate(results)
def aggregate(self, results) -> dict:
df = pd.DataFrame(results)
return {
"retrieval": {
"mrr_mean": df["mrr"].mean(),
"mrr_p10": df["mrr"].quantile(0.10),
"ndcg_mean": df["ndcg_5"].mean()
},
"generation": {
"faithfulness_mean": df["faithfulness"].mean(),
"citation_acc_mean": df["citation_accuracy"].mean(),
"relevance_mean": df["answer_relevance"].mean()
},
"latency": {
"p50": df["latency_ms"].median(),
"p95": df["latency_ms"].quantile(0.95),
"p99": df["latency_ms"].quantile(0.99)
},
"reliability": {
"guardrail_pass_rate": df["passes_guardrails"].mean(),
"correct_action_rate": df["correct_action"].mean()
},
"by_difficulty": df.groupby("difficulty")[["mrr", "faithfulness"]].mean().to_dict()
}
Regression Detection
A daily run is only useful if something happens when metrics drop. Here's a regression detector with configurable tolerance:
class RegressionDetector:
def __init__(self, baseline_metrics: dict, tolerance: float = 0.05):
self.baseline = baseline_metrics
self.tolerance = tolerance
def check(self, new_metrics: dict) -> list:
regressions = []
checks = [
("retrieval.mrr_mean", "Retrieval MRR"),
("generation.faithfulness_mean", "Faithfulness"),
("latency.p95", "P95 Latency"),
("reliability.guardrail_pass_rate", "Guardrail Pass Rate")
]
for path, name in checks:
baseline = self._get_nested(self.baseline, path)
current = self._get_nested(new_metrics, path)
# Latency: lower is better
if "latency" in path:
if current > baseline * (1 + self.tolerance):
regressions.append({
"metric": name,
"baseline": baseline,
"current": current,
"change": f"+{((current/baseline - 1) * 100):.1f}%"
})
else:
# Everything else: higher is better
if current < baseline * (1 - self.tolerance):
regressions.append({
"metric": name,
"baseline": baseline,
"current": current,
"change": f"-{((1 - current/baseline) * 100):.1f}%"
})
return regressions
def alert(self, regressions: list):
if regressions:
message = "REGRESSION DETECTED:\n" + "\n".join([
f"- {r['metric']}: {r['baseline']:.3f} → {r['current']:.3f} ({r['change']})"
for r in regressions
])
send_slack_alert("#ml-alerts", message)
A 5% tolerance is a reasonable starting point. Tighten it as your baselines stabilize and the system matures.
Part 4: Human Evaluation (Done Right)
Automated metrics can't catch everything. Response helpfulness, tone, and nuanced faithfulness edge cases all need human judgment. The key is using humans efficiently.
What Automation Can and Can't Do
| Task | Automated | Human |
|---|---|---|
| MRR, NDCG calculation | ✅ | ❌ |
| Faithfulness (clear cases) | ✅ | ❌ |
| Faithfulness (edge cases) | ⚠️ | ✅ |
| Response helpfulness | ❌ | ✅ |
| Tone, style, brand voice | ❌ | ✅ |
Sample Strategically
Don't review a random 50 examples. Review the examples that are most likely to surface issues:
def select_for_human_eval(all_results: list, n: int = 50) -> list:
# Failures first
failures = [r for r in all_results
if not r["passes_guardrails"] or r["faithfulness"] < 0.6]
# Uncertain cases — where the model might be right or wrong
uncertain = [r for r in all_results if 0.4 < r["faithfulness"] < 0.8]
# Diverse sample across query types
by_type = defaultdict(list)
for r in all_results:
by_type[classify_query(r["query"])].append(r)
diverse = []
for qtype, items in by_type.items():
diverse.extend(random.sample(items, min(5, len(items))))
selected = list({r["example_id"]: r
for r in failures + uncertain + diverse}.values())
return selected[:n]
A Rubric Worth Using
Unstructured "is this good?" questions produce inconsistent ratings. Give annotators something concrete:
Rate this response on 5 dimensions (1–5):
1. ACCURACY — Information is correct and grounded in sources
2. COMPLETENESS — Addresses all parts of the question
3. CLARITY — Easy to understand, well-structured
4. HELPFULNESS — Actually helps the user make progress
5. SAFETY — No harmful, biased, or inappropriate content
Overall: Would you be satisfied with this response?
Provide brief justification for each score.
Check inter-annotator agreement with Cohen's kappa. Target above 0.6 (substantial agreement). If you're consistently below that, the rubric needs refinement before the ratings mean anything.
Part 5: Continuous Integration
Evaluation that only runs on demand gets skipped. Put it in CI so it runs on every PR automatically.
# .github/workflows/eval.yml
name: Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluation
run: python -m evaluation.run --golden golden_v2.jsonl --output results.json
- name: Check for regressions
run: python -m evaluation.check_regression --baseline baseline.json --current results.json
- name: Comment results on PR
uses: actions/github-script@v6
with:
script: |
const results = JSON.parse(require('fs').readFileSync('results.json', 'utf8'));
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Evaluation Results\n\n` +
`| Metric | Value | Status |\n` +
`|--------|-------|--------|\n` +
`| MRR | ${results.retrieval.mrr_mean.toFixed(3)} | ✅ |\n` +
`| Faithfulness | ${results.generation.faithfulness_mean.toFixed(3)} | ✅ |\n` +
`| P95 Latency | ${results.latency.p95.toFixed(0)}ms | ${results.latency.p95 < 500 ? '✅' : '⚠️'} |`
});
Every PR now gets an automated evaluation comment. Reviewers can see metric changes alongside code changes.
The Streamlit Dashboard
import streamlit as st
st.title("RAG Evaluation Dashboard")
results = json.load(open("latest_eval.json"))
col1, col2, col3 = st.columns(3)
col1.metric("MRR", f"{results['retrieval']['mrr_mean']:.3f}", "+0.02")
col2.metric("Faithfulness", f"{results['generation']['faithfulness_mean']:.3f}", "-0.01")
col3.metric("P95 Latency", f"{results['latency']['p95']:.0f}ms", "-50ms")
st.line_chart(load_historical_metrics())
failures = [r for r in results["per_example"] if r["faithfulness"] < 0.6]
st.table(failures[:10])
The Full Stack
┌─────────────────────────────────────────┐
│ GOLDEN DATASET │
│ Versioned, diverse, expert-labeled │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ AUTOMATED METRICS (CI) │
│ • Retrieval: MRR, NDCG, Precision │
│ • Generation: Faithfulness, Citations │
│ • Latency: P50, P95, breakdown │
│ • Reliability: Guardrails, errors │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ REGRESSION DETECTION │
│ Compare to baseline, alert on degrade │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ HUMAN EVALUATION (Weekly) │
│ Sampled, rubric-based, IAA-checked │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ A/B TESTING (Monthly) │
│ New model vs. production, business KPIs│
└─────────────────────────────────────────┘
The thing nobody tells you about building LLM systems: getting the model to generate output is 20% of the work. Understanding whether that output is any good — and knowing the moment it gets worse — is the other 80%.
Build the evaluation stack early. It's what turns a prototype you're guessing about into a system you can actually improve.
Next up: A/B testing LLM systems — when your new model "looks better" but the metrics disagree.
Top comments (0)