Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.
The quiet flaw in "LLM-as-judge" evals
Most tools that score AI output are an LLM grading an LLM, and they report every number in the same confident voice — the verified ones and the guessed ones alike. For evaluation that's backwards. An evaluator's whole job is to be more trustworthy than the model it grades, not equally credulous.
rag-triad is a small local evaluator for retrieval-augmented answers built on one rule: lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.
Localizing the failure, not just scoring it
A RAG answer fails in three different places — bad retrieval, hallucinated generation, or an off-topic reply — and each needs a different fix. A single quality score can't tell them apart. The triad can:
- context relevance ✗ → retrieval miss (fix chunking / embeddings / top-k)
- groundedness ✗ → hallucination (fix generation, or enforce cite-and-verify)
- answer relevance ✗ → off-topic (fix the prompt)
The discipline (this is the actual contribution)
The triad framing is standard (TruLens, RAGAS). What's different:
- Fail-closed groundedness — the judge must cite a quote; code verifies it's in the context, so a fabricated citation can't pass. Worst case is an honest DEFER.
- A deterministic corroborator matched to each leg's failure mode — an embedding-similarity floor for context relevance; an answer-type gate for answer relevance (reusing the embedding trick on the answer leg would backfire — cosine rewards topical-but-evasive answers). The signal has to fit the failure.
- Judges abstain instead of bluffing — sample N times; disagreement → ABSTAIN, not a fake score.
-
Validate the validator —
--selftestruns planted failures it must catch before you trust it.
Why calibration is the point
The property that makes downstream AI safe isn't raw capability — it's calibration. A more capable model that's confidently wrong is more dangerous than a weaker one that abstains. (I've watched a newer model generation shift on hard computational questions from confident-wrong to honestly-inconclusive — exactly the move an evaluator should reward and a naive scorer misses.) So rag-triad prizes the honest "I can't tell" over the confident guess.
Code + a one-command demo: github.com/MonongahelaHellbender/rag-triad. Runs locally on Ollama, MIT.
Top comments (0)