A RAG evaluator that admits what it can't judge

#ai #llm #opensource #rag

Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.

The quiet flaw in "LLM-as-judge" evals

Most tools that score AI output are an LLM grading an LLM, and they report every number in the same confident voice — the verified ones and the guessed ones alike. For evaluation that's backwards. An evaluator's whole job is to be more trustworthy than the model it grades, not equally credulous.

rag-triad is a small local evaluator for retrieval-augmented answers built on one rule: lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.

Localizing the failure, not just scoring it

A RAG answer fails in three different places — bad retrieval, hallucinated generation, or an off-topic reply — and each needs a different fix. A single quality score can't tell them apart. The triad can:

context relevance ✗ → retrieval miss (fix chunking / embeddings / top-k)
groundedness ✗ → hallucination (fix generation, or enforce cite-and-verify)
answer relevance ✗ → off-topic (fix the prompt)

The discipline (this is the actual contribution)

The triad framing is standard (TruLens, RAGAS). What's different:

Fail-closed groundedness — the judge must cite a quote; code verifies it's in the context, so a fabricated citation can't pass. Worst case is an honest DEFER.
A deterministic corroborator matched to each leg's failure mode — an embedding-similarity floor for context relevance; an answer-type gate for answer relevance (reusing the embedding trick on the answer leg would backfire — cosine rewards topical-but-evasive answers). The signal has to fit the failure.
Judges abstain instead of bluffing — sample N times; disagreement → ABSTAIN, not a fake score.
Validate the validator — --selftest runs planted failures it must catch before you trust it.

Why calibration is the point

The property that makes downstream AI safe isn't raw capability — it's calibration. A more capable model that's confidently wrong is more dangerous than a weaker one that abstains. (I've watched a newer model generation shift on hard computational questions from confident-wrong to honestly-inconclusive — exactly the move an evaluator should reward and a naive scorer misses.) So rag-triad prizes the honest "I can't tell" over the confident guess.

Code + a one-command demo: github.com/MonongahelaHellbender/rag-triad. Runs locally on Ollama, MIT.

Top comments (3)

Comment hidden by post author - thread only accessible via permalink

Aly • Jul 3

I appreciate your exploration of fail-closed groundedness in RAG evaluators. It's crucial to ensure that evaluators can identify their limitations, especially in high-stakes applications. One aspect you might consider is how to implement tamper-evident capture of evaluation results. This can enhance trust in the evaluation process by providing a verifiable audit trail. For instance, using evidence bundles with SHA-256 hashes can ensure that the evaluation data remains unchanged and can be verified offline. This is particularly useful in compliance-heavy environments. If you're interested in this approach, check out how DocImprint's MCP tool can assist in integrating such features into your RAG pipelines: docimprint.com/mcp.

Tae Kim • Jul 4

The asymmetry in corroborator design is the sharpest part of this -- using cosine for context relevance but not for answer relevance is exactly the right call, because a topical-but-evasive answer (one that mentions the right domain but deflects the question) is high-cosine and wrong, and an evaluator that rewards it defeats its own purpose. The ABSTAIN-on-disagreement design also captures something single-shot LLM judges systematically miss: evaluator confidence is itself a measurement carrying signal about whether the input is in the evaluator's distributional support. One thing worth adding to the selftest suite: a planted case where the answer is faithful to the context but the context itself is irrelevant to the question -- that exposes whether your context-relevance and groundedness legs are genuinely independent or whether passing one masks a failure in the other. For teams that need a confidence interval rather than a pass/fail, logging abstention rates by query category over time is worth doing because a rising abstention rate in a specific category is an early signal of distributional shift before your aggregate scores have moved.

Melissa D. Ellison • Jul 4

Thank you @hannune for catching the leg-independence case. The fixture was actually already in the selftest ("irrelevant context": Eiffel Tower context, boiling-point question, answer faithful to the tower text), but the assertion only pinned the context-relevance leg. Groundedness passing on that same case (the thing that would actually demonstrate the legs are independent) wasn't asserted, so a leg-coupling regression could have slipped through unnoticed.

Fixed: every leg is now pinned on that case, and the live selftest shows exactly the signature you're describing: context ABSTAIN, grounded PASS, answer IRRELEVANT.

github.com/MonongahelaHellbender/r...

Taking the abstention-by-category idea too. It matches something I found from the other side in a calibration study: the most honest model never used its abstention token at all... its honesty was structural, visible only in the API stop reason. Same lesson in reverse: abstention behavior is measurement, not etiquette. Logging it per category as an early drift signal is going on the list for the production-facing evaluator.

Some comments have been hidden by the post's author - find out more