Ritwika Kancharla

Posted on Mar 3

Building an LLM Evaluation Framework That Actually Works

#evaluation #llm #ai

Stop Eyeballing Your RAG Outputs. Start Measuring Quality.

I shipped a RAG system. It felt fine. Then users started reporting wrong product recommendations, invented prices, and confidently wrong answers to questions the documents couldn't support.

I had no numbers. No regression detection. No systematic way to improve. I was flying blind.

This is how I built an evaluation stack that catches failures before users do.

What "Evaluation" Actually Means

Most teams jump straight to asking humans "does this seem good?" That's too slow and too expensive to run on every change. There's a whole layer of automated evaluation that should come first.

Level	Question	Cadence
Unit	Does this component work correctly?	Every commit
Integration	Does the full pipeline work end-to-end?	Every PR
Human	Do users actually find this helpful?	Weekly
A/B	Is the new version measurably better?	Monthly

The lower layers are fast and cheap. Build them first, then let human evaluation handle the things automation genuinely can't.

Part 1: The Golden Dataset

Everything starts here. A golden dataset is a hand-curated set of examples that represent correct behavior — your ground truth for all automated metrics.

golden_examples = [
    {
        "id": "g_001",
        "query": "moisturizer for oily skin under $30",
        "context": {"user_skin_type": "oily", "budget": 30},
        "expected_retrieved_ids": ["prod_123", "prod_456"],
        "expected_response_contains": ["non-comedogenic", "oil-free", "lightweight"],
        "expected_citations": [1, 2],
        "difficulty": "medium"
    },
    {
        "id": "g_002",
        "query": "foundation",
        "context": {},
        "expected_action": "CLARIFY",
        "expected_clarifying_question_contains": ["skin type", "shade", "coverage"],
        "difficulty": "hard"
    }
]

Building It Without Guessing

Don't invent examples from your imagination. Sample from real production traffic, then label them.

# Pull from recent logs
production_queries = load_logs(last_weeks=1, n=1000)

# Stratified sample by complexity
simple  = [q for q in production_queries if word_count(q) < 5]
medium  = [q for q in production_queries if 5 <= word_count(q) < 15]
complex = [q for q in production_queries if word_count(q) >= 15]

sample = (
    random.sample(simple, 30) +
    random.sample(medium, 50) +
    random.sample(complex, 20)
)

Then have two annotators label each example independently. Target inter-annotator agreement above 0.8. Resolve disagreements with a third reviewer.

One rule: never modify your golden set in place. Version it. golden_v1.jsonl → golden_v2.jsonl. Track the diff. Your historical metrics are meaningless if the benchmark silently changes under them.

Part 2: Automated Metrics

Retrieval Metrics

These answer the question: did we fetch the right documents?

def mean_reciprocal_rank(retrieved_ids: list, relevant_ids: list) -> float:
    """Position of first relevant result."""
    for rank, doc_id in enumerate(retrieved_ids[:10], 1):
        if doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(retrieved_ids: list, relevant_ids: list, k: int = 5) -> float:
    """Ranking quality with graded relevance."""
    def dcg(ids):
        return sum(
            (1.0 if rid in relevant_ids else 0.0) / np.log2(i + 2)
            for i, rid in enumerate(ids[:k])
        )
    ideal_dcg = dcg(relevant_ids[:k])
    actual_dcg = dcg(retrieved_ids)
    return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0

def precision_at_k(retrieved_ids: list, relevant_ids: list, k: int = 5) -> float:
    """How many of top-k results are actually relevant?"""
    hits = sum(1 for rid in retrieved_ids[:k] if rid in relevant_ids)
    return hits / k

Generation Metrics

Faithfulness — does the response follow from the sources, or is the model adding things it invented?

from sentence_transformers import CrossEncoder

nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-base")

def faithfulness_score(response: str, sources: list) -> float:
    source_sents = [s["text"] for s in sources]
    scores = []

    for source in source_sents[:3]:
        result = nli_model.predict([(source, response)])
        # result: [contradiction, neutral, entailment]
        scores.append(result[0])

    return np.mean([s[2] for s in scores])  # Average entailment probability

Citation accuracy — are the [1], [2] references in the response pointing to real sources?

import re

def citation_accuracy(response: str, sources: list) -> dict:
    citations = re.findall(r'\[(\d+)\]', response)
    citation_indices = [int(c) - 1 for c in citations]

    issues = []
    for idx in citation_indices:
        if idx < 0 or idx >= len(sources):
            issues.append(f"Invalid citation [{idx + 1}]")

    return {
        "citation_count": len(citations),
        "valid_citations": len(citations) - len(issues),
        "accuracy": (len(citations) - len(issues)) / len(citations) if citations else 1.0,
        "issues": issues
    }

Answer relevance — does the response actually address the query?

def answer_relevance(query: str, response: str) -> float:
    query_emb = client.embeddings.create(
        input=query, model="text-embedding-3-small"
    ).data[0].embedding

    response_emb = client.embeddings.create(
        input=response, model="text-embedding-3-small"
    ).data[0].embedding

    return cosine_similarity(query_emb, response_emb)

Latency Metrics

Speed is a quality signal. Measure every stage.

def measure_latency(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = (time.perf_counter() - start) * 1000
        return {"result": result, "latency_ms": elapsed}
    return wrapper

@measure_latency
def embed_query(query: str):
    return embedding_model.encode(query)

@measure_latency
def retrieve(query_emb):
    return vector_store.search(query_emb)

Track P50, P95, and P99. P95 is usually the most actionable — it's the experience your worst-off users are getting, without being dominated by outliers.

Part 3: The Evaluation Pipeline

Daily Automated Run

class EvaluationPipeline:
    def __init__(self, golden_path: str, system_under_test):
        self.golden = load_jsonl(golden_path)
        self.sut = system_under_test

    def run(self) -> dict:
        results = []

        for example in tqdm(self.golden):
            start = time.time()
            output = self.sut.process(example["query"])
            latency = (time.time() - start) * 1000

            results.append({
                "example_id": example["id"],
                "query": example["query"],
                "latency_ms": latency,
                "mrr": mean_reciprocal_rank(
                    output["retrieved_ids"],
                    example["expected_retrieved_ids"]
                ),
                "ndcg_5": ndcg_at_k(
                    output["retrieved_ids"],
                    example["expected_retrieved_ids"],
                    k=5
                ),
                "faithfulness": faithfulness_score(
                    output["response"],
                    output["sources"]
                ),
                "citation_accuracy": citation_accuracy(
                    output["response"],
                    output["sources"]
                )["accuracy"],
                "answer_relevance": answer_relevance(
                    example["query"],
                    output["response"]
                ),
                "passes_guardrails": output["guardrail_passed"],
                "correct_action": output["action"] == example.get("expected_action")
            })

        return self.aggregate(results)

    def aggregate(self, results) -> dict:
        df = pd.DataFrame(results)

        return {
            "retrieval": {
                "mrr_mean": df["mrr"].mean(),
                "mrr_p10": df["mrr"].quantile(0.10),
                "ndcg_mean": df["ndcg_5"].mean()
            },
            "generation": {
                "faithfulness_mean": df["faithfulness"].mean(),
                "citation_acc_mean": df["citation_accuracy"].mean(),
                "relevance_mean": df["answer_relevance"].mean()
            },
            "latency": {
                "p50": df["latency_ms"].median(),
                "p95": df["latency_ms"].quantile(0.95),
                "p99": df["latency_ms"].quantile(0.99)
            },
            "reliability": {
                "guardrail_pass_rate": df["passes_guardrails"].mean(),
                "correct_action_rate": df["correct_action"].mean()
            },
            "by_difficulty": df.groupby("difficulty")[["mrr", "faithfulness"]].mean().to_dict()
        }

Regression Detection

A daily run is only useful if something happens when metrics drop. Here's a regression detector with configurable tolerance:

class RegressionDetector:
    def __init__(self, baseline_metrics: dict, tolerance: float = 0.05):
        self.baseline = baseline_metrics
        self.tolerance = tolerance

    def check(self, new_metrics: dict) -> list:
        regressions = []

        checks = [
            ("retrieval.mrr_mean", "Retrieval MRR"),
            ("generation.faithfulness_mean", "Faithfulness"),
            ("latency.p95", "P95 Latency"),
            ("reliability.guardrail_pass_rate", "Guardrail Pass Rate")
        ]

        for path, name in checks:
            baseline = self._get_nested(self.baseline, path)
            current  = self._get_nested(new_metrics, path)

            # Latency: lower is better
            if "latency" in path:
                if current > baseline * (1 + self.tolerance):
                    regressions.append({
                        "metric": name,
                        "baseline": baseline,
                        "current": current,
                        "change": f"+{((current/baseline - 1) * 100):.1f}%"
                    })
            else:
                # Everything else: higher is better
                if current < baseline * (1 - self.tolerance):
                    regressions.append({
                        "metric": name,
                        "baseline": baseline,
                        "current": current,
                        "change": f"-{((1 - current/baseline) * 100):.1f}%"
                    })

        return regressions

    def alert(self, regressions: list):
        if regressions:
            message = "REGRESSION DETECTED:\n" + "\n".join([
                f"- {r['metric']}: {r['baseline']:.3f} → {r['current']:.3f} ({r['change']})"
                for r in regressions
            ])
            send_slack_alert("#ml-alerts", message)

A 5% tolerance is a reasonable starting point. Tighten it as your baselines stabilize and the system matures.

Part 4: Human Evaluation (Done Right)

Automated metrics can't catch everything. Response helpfulness, tone, and nuanced faithfulness edge cases all need human judgment. The key is using humans efficiently.

What Automation Can and Can't Do

Task	Automated	Human
MRR, NDCG calculation	✅	❌
Faithfulness (clear cases)	✅	❌
Faithfulness (edge cases)	⚠️	✅
Response helpfulness	❌	✅
Tone, style, brand voice	❌	✅

Sample Strategically

Don't review a random 50 examples. Review the examples that are most likely to surface issues:

def select_for_human_eval(all_results: list, n: int = 50) -> list:
    # Failures first
    failures = [r for r in all_results
                if not r["passes_guardrails"] or r["faithfulness"] < 0.6]

    # Uncertain cases — where the model might be right or wrong
    uncertain = [r for r in all_results if 0.4 < r["faithfulness"] < 0.8]

    # Diverse sample across query types
    by_type = defaultdict(list)
    for r in all_results:
        by_type[classify_query(r["query"])].append(r)

    diverse = []
    for qtype, items in by_type.items():
        diverse.extend(random.sample(items, min(5, len(items))))

    selected = list({r["example_id"]: r
                     for r in failures + uncertain + diverse}.values())
    return selected[:n]

A Rubric Worth Using

Unstructured "is this good?" questions produce inconsistent ratings. Give annotators something concrete:

Rate this response on 5 dimensions (1–5):

1. ACCURACY        — Information is correct and grounded in sources
2. COMPLETENESS    — Addresses all parts of the question
3. CLARITY         — Easy to understand, well-structured
4. HELPFULNESS     — Actually helps the user make progress
5. SAFETY          — No harmful, biased, or inappropriate content

Overall: Would you be satisfied with this response?
Provide brief justification for each score.

Check inter-annotator agreement with Cohen's kappa. Target above 0.6 (substantial agreement). If you're consistently below that, the rubric needs refinement before the ratings mean anything.

Part 5: Continuous Integration

Evaluation that only runs on demand gets skipped. Put it in CI so it runs on every PR automatically.

# .github/workflows/eval.yml
name: Evaluation

on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluation
        run: python -m evaluation.run --golden golden_v2.jsonl --output results.json

      - name: Check for regressions
        run: python -m evaluation.check_regression --baseline baseline.json --current results.json

      - name: Comment results on PR
        uses: actions/github-script@v6
        with:
          script: |
            const results = JSON.parse(require('fs').readFileSync('results.json', 'utf8'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Evaluation Results\n\n` +
                    `| Metric | Value | Status |\n` +
                    `|--------|-------|--------|\n` +
                    `| MRR | ${results.retrieval.mrr_mean.toFixed(3)} | ✅ |\n` +
                    `| Faithfulness | ${results.generation.faithfulness_mean.toFixed(3)} | ✅ |\n` +
                    `| P95 Latency | ${results.latency.p95.toFixed(0)}ms | ${results.latency.p95 < 500 ? '✅' : '⚠️'} |`
            });

Every PR now gets an automated evaluation comment. Reviewers can see metric changes alongside code changes.

The Streamlit Dashboard

import streamlit as st

st.title("RAG Evaluation Dashboard")
results = json.load(open("latest_eval.json"))

col1, col2, col3 = st.columns(3)
col1.metric("MRR", f"{results['retrieval']['mrr_mean']:.3f}", "+0.02")
col2.metric("Faithfulness", f"{results['generation']['faithfulness_mean']:.3f}", "-0.01")
col3.metric("P95 Latency", f"{results['latency']['p95']:.0f}ms", "-50ms")

st.line_chart(load_historical_metrics())

failures = [r for r in results["per_example"] if r["faithfulness"] < 0.6]
st.table(failures[:10])

The Full Stack

┌─────────────────────────────────────────┐
│         GOLDEN DATASET                  │
│  Versioned, diverse, expert-labeled     │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      AUTOMATED METRICS (CI)             │
│  • Retrieval: MRR, NDCG, Precision      │
│  • Generation: Faithfulness, Citations  │
│  • Latency: P50, P95, breakdown         │
│  • Reliability: Guardrails, errors      │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      REGRESSION DETECTION               │
│  Compare to baseline, alert on degrade  │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      HUMAN EVALUATION (Weekly)          │
│  Sampled, rubric-based, IAA-checked     │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      A/B TESTING (Monthly)              │
│  New model vs. production, business KPIs│
└─────────────────────────────────────────┘

The thing nobody tells you about building LLM systems: getting the model to generate output is 20% of the work. Understanding whether that output is any good — and knowing the moment it gets worse — is the other 80%.

Build the evaluation stack early. It's what turns a prototype you're guessing about into a system you can actually improve.

Next up: A/B testing LLM systems — when your new model "looks better" but the metrics disagree.

DEV Community