A RAG evaluator that admits what it can't judge

Melissa D. Ellison — Fri, 03 Jul 2026 02:08:43 +0000

Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.

The quiet flaw in "LLM-as-judge" evals

Most tools that score AI output are an LLM grading an LLM, and they report every number in the same confident voice — the verified ones and the guessed ones alike. For evaluation that's backwards. An evaluator's whole job is to be more trustworthy than the model it grades, not equally credulous.

rag-triad is a small local evaluator for retrieval-augmented answers built on one rule: lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.

Localizing the failure, not just scoring it

A RAG answer fails in three different places — bad retrieval, hallucinated generation, or an off-topic reply — and each needs a different fix. A single quality score can't tell them apart. The triad can:

context relevance ✗ → retrieval miss (fix chunking / embeddings / top-k)
groundedness ✗ → hallucination (fix generation, or enforce cite-and-verify)
answer relevance ✗ → off-topic (fix the prompt)

The discipline (this is the actual contribution)

The triad framing is standard (TruLens, RAGAS). What's different:

Fail-closed groundedness — the judge must cite a quote; code verifies it's in the context, so a fabricated citation can't pass. Worst case is an honest DEFER.
A deterministic corroborator matched to each leg's failure mode — an embedding-similarity floor for context relevance; an answer-type gate for answer relevance (reusing the embedding trick on the answer leg would backfire — cosine rewards topical-but-evasive answers). The signal has to fit the failure.
Judges abstain instead of bluffing — sample N times; disagreement → ABSTAIN, not a fake score.
Validate the validator — --selftest runs planted failures it must catch before you trust it.

Why calibration is the point

The property that makes downstream AI safe isn't raw capability — it's calibration. A more capable model that's confidently wrong is more dangerous than a weaker one that abstains. (I've watched a newer model generation shift on hard computational questions from confident-wrong to honestly-inconclusive — exactly the move an evaluator should reward and a naive scorer misses.) So rag-triad prizes the honest "I can't tell" over the confident guess.

Code + a one-command demo: github.com/MonongahelaHellbender/rag-triad. Runs locally on Ollama, MIT.

DEV Community: Melissa D. Ellison

A RAG evaluator that admits what it can't judge

The quiet flaw in "LLM-as-judge" evals

Localizing the failure, not just scoring it

The discipline (this is the actual contribution)

Why calibration is the point