Gabriel Anhaia

Posted on Apr 26

Your RAG Eval Set Is Probably Wrong. The Test That Catches It.

#ai #rag #llm #observability

Book: RAG Pocket Guide
Also by me: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture a composite scenario drawn from several RAG postmortems: a team ships a legal-research assistant with a Ragas faithfulness score in the mid-0.9s and an answer-relevance score not far behind. Two weeks after launch, customer-success starts forwarding screenshots of the bot citing the wrong jurisdiction. The eval scores never move.

The eval set is a few hundred questions hand-written months earlier. Production users are running tax-court queries with citation patterns that do not exist anywhere in the eval set. The evals are measuring how well the system answered last quarter's questions, not this morning's. The dashboard is green for a system nobody is actually using.

Most teams blame the retriever. It is almost never the retriever. Your eval set is wrong in three ways at once, and only one of them shows up in your metrics. Below: the three failure modes, and a single drift test that catches the worst of them in 40 lines of Python.

The three ways your eval set lies to you

Each of these ships to production regularly. Each one looks like a great eval score on the dashboard.

1. Leakage into training data

If you generated your eval questions with GPT-4 against your corpus, and your corpus is on the public internet, your eval is half-leaked already. The model has seen the documents and probably seen something close to your "synthetic" questions. The PremAI 2026 RAG eval guide says it directly: optimizing for Ragas scores on a leaked set produces systems that look good in CI and break the day a real user asks a real question.

The signal is subtle. Faithfulness is high because the model has memorized the answers. Context recall is high because the retriever finds the chunks the model already knows. The eval is measuring memorization, with a thin retrieval check on top.

To check, ask your model the eval question with no retrieved context. If the closed-book answer score is within 5 points of the RAG answer score, the retriever is not adding value, and the eval is measuring something else.

2. Drift from real production queries

This is the one that bit the legal team above, and it is the one that is most often invisible. Your eval set was hand-written or LLM-generated against a snapshot of intent that existed at one point in time. Your production traffic is a moving target. New segments show up, the vocabulary shifts, novel failure modes appear. The eval set freezes; reality does not.

Atlan's framework comparison makes the same point: knowledge bases that change daily drift faster than the eval sets watching them. The evals stay green while the product gets worse.

This is the failure mode the 40-line test below is designed to catch.

3. Judge bias

Most modern RAG evals (Ragas, DeepEval, TruLens) use an LLM as the grader. That works until the judge has the same blind spot as the system under test. If both are GPT-4o, both share opinions about what "faithful" looks like, both prefer answers that hedge, both penalize the same kind of curt phrasing. You are scoring with a ruler made by the same factory that produced the thing being measured.

The Ragas paper discusses this directly, and the LLM-judge bias literature (Zheng et al., "Judging LLM-as-a-Judge") shows that swapping the judge model can move metric scores meaningfully, often into the high single digits, without any change to the system being evaluated. Run one cross-judge sanity check: grade with gpt-4o, grade with claude-sonnet-4, compare. Without that, your scores have a confidence interval you cannot see.

Drift is the one that hits production hardest because it gets worse over time without any code change. You ship green. Traffic shifts. The eval stays green. Customers churn. The other two are bad on day one. Drift is bad on day 90.

The query-distribution drift test

Here is the test, in 40 lines. It compares the distribution of your eval-set queries against the distribution of your production queries along three axes (embedding-space centroid distance, length distribution, and intent-class distribution) and alerts when any of the three crosses a threshold.

import os
from collections import Counter

import numpy as np
from openai import OpenAI
from scipy.stats import wasserstein_distance

client = OpenAI()
EMBED_MODEL = "text-embedding-3-large"


def embed(texts):
    out = client.embeddings.create(input=texts, model=EMBED_MODEL)
    return np.array([d.embedding for d in out.data])


def centroid_cosine(a, b):
    ca, cb = a.mean(axis=0), b.mean(axis=0)
    return float(
        1 - (ca @ cb) / (np.linalg.norm(ca) * np.linalg.norm(cb))
    )


def length_emd(a, b):
    la = np.array([len(x.split()) for x in a], dtype=float)
    lb = np.array([len(x.split()) for x in b], dtype=float)
    return float(wasserstein_distance(la, lb))


def intent_tv(a_labels, b_labels):
    keys = set(a_labels) | set(b_labels)
    ca, cb = Counter(a_labels), Counter(b_labels)
    pa = np.array([ca[k] / len(a_labels) for k in keys])
    pb = np.array([cb[k] / len(b_labels) for k in keys])
    return float(0.5 * np.abs(pa - pb).sum())  # total variation


def drift_report(eval_q, eval_intents, prod_q, prod_intents):
    ea, eb = embed(eval_q), embed(prod_q)
    return {
        "centroid_cosine": centroid_cosine(ea, eb),
        "length_emd":      length_emd(eval_q, prod_q),
        "intent_tv":       intent_tv(eval_intents, prod_intents),
    }


THRESHOLDS = {
    "centroid_cosine": 0.05,
    "length_emd":      4.0,
    "intent_tv":       0.15,
}


def alert_if_drift(report):
    fired = [
        (k, v) for k, v in report.items()
        if v > THRESHOLDS[k]
    ]
    if fired:
        print("DRIFT", fired)
        return True
    return False

Three signals, one report. Each one catches a different shape of drift.

Centroid cosine distance is the headline number. It says: the average semantic location of your eval queries has moved this far from the average semantic location of your production queries. A sane starting point is around 0.05 in text-embedding-3-large space for short queries, but tune it on a baseline week of your own data. The Evidently embedding drift writeup covers the same idea with reference data versus current data.

The second signal catches a specific failure mode. Length earth-mover distance flags it when your eval set is mostly short, well-formed questions and your production traffic is long, copy-pasted dumps from internal Slack threads. Same intent, very different retrieval shape. The retriever performs differently on a 6-word question and a 90-word one, and a length-blind eval misses that entirely.

The case where vocabulary stays similar but the user task changes is what intent total-variation distance catches. If 30% of your eval set is "definition of X" and 50% of your prod traffic is "compare X and Y," the eval score on definition questions does not predict the system's behavior on comparison questions. Use a small classifier (even a 5-class zero-shot one) and compare label distributions. As a starting threshold, anything past 0.15 total variation is a different product.

Wiring it into a periodic check

The 40 lines above are a function. The piece you actually ship is the cron that runs them and the alert that pages someone.

# Run nightly. Sample 500 prod queries from the last 24h.
import json
import sqlite3

def load_eval():
    with open("eval/queries.jsonl") as f:
        rows = [json.loads(l) for l in f]
    return [r["q"] for r in rows], [r["intent"] for r in rows]


def sample_prod(db_path, n=500):
    conn = sqlite3.connect(db_path)
    cur = conn.execute(
        "SELECT query, intent FROM rag_logs"
        " WHERE ts > datetime('now', '-1 day')"
        " ORDER BY RANDOM() LIMIT ?", (n,))
    rows = cur.fetchall()
    return [r[0] for r in rows], [r[1] for r in rows]


if __name__ == "__main__":
    eq, ei = load_eval()
    pq, pi = sample_prod("rag.db")
    report = drift_report(eq, ei, pq, pi)
    print(report)
    if alert_if_drift(report):
        # post to Slack, page on-call, open a ticket.
        ...

The thresholds above are sane defaults for a corpus of business documents and short user queries. Tune them on your own data: run the script for seven days against a frozen eval set with stable traffic, take the 95th percentile of each signal, set the threshold there.

What the alert should actually do

A drift alert is not a bug ticket. It is a "your evaluation is no longer measuring what your users do" alert. The right response is to refresh the eval set, not to chase a fix in the system.

Three actions to bake into the runbook:

Sample 200 production queries from the drift window for human labeling (a smaller, manually reviewed slice than the 500 the cron uses for the drift signal), label them with the same intent classifier, add them to a prod_addendum slice of the eval set.
Re-run all metrics against the addendum specifically. If your faithfulness drops 10 points on the new slice, the system is failing on the new query shape. That is the bug to fix.
Decay the original eval set. Mark every question with a created_at. After 90 days, drop questions whose intent class no longer appears in production. Eval sets should age out the same way feature flags do.

Run this loop and you have a system you can trust instead of a dashboard you watch. The Galileo data drift docs push the same idea: an eval set is a living artifact, not a fixture file.

The headline

Your eval scores can be high while your product is broken. The likeliest reason is not your retriever or your model. It is that your eval set has drifted away from the queries your users are actually sending. The 40 lines above will not fix the drift. They will tell you it happened, the day it happened, before customer-success forwards the screenshot.

If your dashboard is green and your support queue is red, the test runs in 40 lines. Start there.

If this was useful

The eval and observability chapters of RAG Pocket Guide and LLM Observability Pocket Guide cover the full version of the test above: proper baselines, judge-swap protocols, the slice-by-slice breakdown that catches the failure modes Ragas alone misses, and the on-call runbook for drift alerts. If your evals are green and your users are not, both books are written for that gap.

DEV Community