RAG Evaluation Beyond Recall@K: Faithfulness, Coverage, Robustness

#llm #rag #ai #observability

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship a RAG bot. Recall@5 on the eval set is 0.92. You sleep well for a week. Then a finance user pings support: the bot quoted a number that does not appear in any of the documents the trace shows it pulled. You re-run the eval. Recall@5 is still 0.92. The number it made up is still made up.

Recall@K tells you whether the right document is somewhere in the top K. It does not tell you whether the model used the chunk, whether the answer is supported, or whether a paraphrase changes the result. Recall@K is silent about all three, and your users notice every one of them.

This post is about those three: faithfulness, coverage, and robustness. A 100-line Python harness at the end scores all four metrics over a small golden set, in one pass. No framework lock-in, dependencies are openai, numpy, and the standard library.

Why Recall@K is necessary but not sufficient

Recall@K is a retrieval metric. It treats the system like a search engine and asks one question: did the relevant chunk land in the top K? That is a real thing to measure and you should keep measuring it. The problem is everything that happens after retrieval is invisible to it.

A retriever can score 0.95 on Recall@5 and your bot can still:

Pull the right chunk and ignore it. The model answers from its prior, the chunk sits in the prompt unused, the user gets a hallucination wearing a citation badge.
Pull five chunks and lean on one. The other four are in the prompt for show. Re-rank them, swap them, delete them — same answer. Your retrieval changes do nothing.
Answer the same question two ways depending on phrasing. A user types refund window for digital goods? and gets one answer. A teammate types how long do I have to refund a download? and gets another. Same corpus. Same retriever. Different answer.

None of those failures move Recall@K. All of them move customer support volume.

The three metrics below cover the gap. They are not new; the TruLens RAG Triad puts groundedness and answer relevance next to context relevance for exactly this reason, and Ragas ships a faithfulness metric that decomposes the answer into claims and checks each one against the retrieved context. What is often missing in practice is the wiring: a single harness that runs all four together so a regression in one shows up in the same dashboard as a regression in the others.

Faithfulness: is the answer actually in the chunks

Faithfulness asks one question: every claim in the answer, can you point to the chunk that supports it. If yes, score is 1. If half the claims are supported and half come from the model's prior, score is 0.5. If the answer is a confident paragraph about facts that appear nowhere in the retrieved context, score is 0 and your bot is hallucinating with citations.

The standard implementation has two LLM calls per answer. First, decompose the answer into atomic claims. Second, for each claim, check whether the retrieved context entails it (NLI-style: does context imply claim, contradict it, or neither). The score is the fraction of claims marked entailed.

Ragas calls this Faithfulness. DeepEval calls it FaithfulnessMetric and frames it as the generator-side check in its RAG triad coverage. TruLens calls it Groundedness. They differ in prompt wording and aggregation, not in shape.

Threshold to ship: faithfulness below 0.85 on the eval set is a regression. Below 0.75 and the bot is reliably making things up.

The cheap version of this you can run today, without pulling in a framework, is to ask one judge model: list every factual claim in this answer that is not present in the retrieved context. If the list is empty, the answer is faithful. If it has items, those items are exactly what to show in the failure trace. Code in the harness below.

Coverage: are you using more than one chunk

Coverage is the metric that catches the silent failure. The retriever returns five chunks. The answer uses one. The other four are dead weight. You can know this is happening because you can ablate: drop chunks 2-5 from the prompt, regenerate, compare. If the answer is identical, the retriever did one chunk's worth of work to fetch four chunks of latency.

The reason this matters is that retrieval changes only move the needle if the model actually consumes more than one chunk. Better re-ranker, larger top-K, hybrid search, MMR diversity: none of it matters if the answer is a one-chunk answer. A team can spend a quarter on retrieval and ship zero quality improvement because the answers were always one-chunk answers.

A clean way to score this: count, per answer, how many distinct chunks the model cited at least once, divided by the number of retrieved chunks. Call it chunk_attribution_rate. A score of 0.2 on top-5 means the model leaned on one chunk in five.

This is not a metric most frameworks ship out of the box. Ragas has context_recall and context_precision, which are close cousins; DeepEval has ContextualPrecision for ranking quality. Neither is exactly the per-answer attribution rate. The harness below computes it directly.

The threshold here is a judgment call. A QA bot answering single-fact questions should run hot. Most answers will be one-chunk answers because most questions are. A summarization-style assistant pulling from policy documents should sit higher; if it averages 1.2 chunks used out of 5 retrieved, the retriever is doing work the model will not consume. Set the threshold against your own baseline week, not a global default.

Robustness: does paraphrase change the answer

Robustness asks whether the same question asked five different ways produces five answers that mean the same thing. It is the metric users notice without naming. They retype a question because the first answer was unhelpful, get a different answer the second time, and lose trust in the system regardless of which answer was right.

The mechanic: take each eval question. Use a separate model (different family from the generator if you can; same family will paraphrase the same way) to produce three to five paraphrases. Run the full RAG pipeline on each paraphrase. Compare the answers pairwise: do they entail each other.

Two things make this cheap to run. First, the paraphrases can be generated once and cached against the eval set. They do not need to regenerate every CI run. Second, the entailment check is the same NLI-style judge call you already use for faithfulness, just over answer pairs instead of answer-context pairs.

Two failure shapes show up here:

Brittle retrieval. Paraphrases retrieve different chunks. Different chunks produce different answers. The fix is upstream: query expansion, hybrid search, a stronger re-ranker.
Brittle generation. Paraphrases retrieve the same chunks but the model answers them differently. The fix is in the prompt: tighter system instructions, a clearer answer schema, fewer degrees of freedom.

The harness below computes robustness as the average pairwise entailment rate across paraphrases of the same question. Below 0.8 and the bot is contradicting itself for non-pathological inputs.

The harness

Below is a single Python file that scores all four metrics over a golden set of (question, gold_chunk_ids, paraphrases) rows. Output is per-row metrics plus the aggregate. The judge model and embed model are pinned at the top.

import json
import os
import re
from dataclasses import dataclass
from typing import Iterable

import numpy as np
from openai import OpenAI

client = OpenAI()
JUDGE = "gpt-4o-mini"
GEN = "gpt-4o-mini"
EMBED = "text-embedding-3-large"


@dataclass
class Row:
    qid: str
    question: str
    paraphrases: list[str]
    gold_chunk_ids: list[str]


def embed(texts: list[str]) -> np.ndarray:
    out = client.embeddings.create(input=texts, model=EMBED)
    return np.array([d.embedding for d in out.data])


def retrieve(corpus, q_emb, k=5):
    sims = corpus["embeds"] @ q_emb / (
        np.linalg.norm(corpus["embeds"], axis=1)
        * np.linalg.norm(q_emb)
    )
    idx = np.argsort(-sims)[:k]
    return [
        {"id": corpus["ids"][i], "text": corpus["texts"][i]}
        for i in idx
    ]


def generate(question: str, chunks: list[dict]) -> str:
    ctx = "\n\n".join(
        f"[{c['id']}] {c['text']}" for c in chunks
    )
    msg = [
        {"role": "system",
         "content": "Answer only from the context. "
                    "Cite chunk ids in [brackets]."},
        {"role": "user",
         "content": f"Context:\n{ctx}\n\nQ: {question}"},
    ]
    r = client.chat.completions.create(
        model=GEN, messages=msg, temperature=0)
    return r.choices[0].message.content


def judge_json(prompt: str) -> dict:
    r = client.chat.completions.create(
        model=JUDGE,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return json.loads(r.choices[0].message.content)


def recall_at_k(retrieved, gold_ids):
    hits = {c["id"] for c in retrieved} & set(gold_ids)
    return len(hits) / max(len(gold_ids), 1)


def faithfulness(answer, chunks):
    ctx = "\n".join(f"[{c['id']}] {c['text']}" for c in chunks)
    p = (
        "List the factual claims in the ANSWER that are NOT "
        "supported by the CONTEXT. Return JSON: "
        '{"unsupported": [...], "total": N}.\n\n'
        f"CONTEXT:\n{ctx}\n\nANSWER:\n{answer}"
    )
    j = judge_json(p)
    total = max(j.get("total", 0), 1)
    return 1.0 - len(j.get("unsupported", [])) / total


def coverage(answer, chunks):
    ids = [c["id"] for c in chunks]
    used = sum(
        1 for cid in ids
        if re.search(rf"\[{re.escape(cid)}\]", answer)
    )
    return used / max(len(ids), 1)


def robustness(answers: list[str]) -> float:
    if len(answers) < 2:
        return 1.0
    pairs = []
    for i in range(len(answers)):
        for j in range(i + 1, len(answers)):
            p = (
                "Do these two answers convey the same facts? "
                'Return JSON: {"same": true|false}.\n\n'
                f"A: {answers[i]}\n\nB: {answers[j]}"
            )
            pairs.append(judge_json(p).get("same", False))
    return sum(pairs) / len(pairs)


def score_row(row: Row, corpus) -> dict:
    q_emb = embed([row.question])[0]
    chunks = retrieve(corpus, q_emb, k=5)
    answer = generate(row.question, chunks)

    para_answers = [answer]
    for p in row.paraphrases:
        p_emb = embed([p])[0]
        p_chunks = retrieve(corpus, p_emb, k=5)
        para_answers.append(generate(p, p_chunks))

    return {
        "qid": row.qid,
        "recall@5": recall_at_k(chunks, row.gold_chunk_ids),
        "faithfulness": faithfulness(answer, chunks),
        "coverage": coverage(answer, chunks),
        "robustness": robustness(para_answers),
    }


def run(rows: Iterable[Row], corpus) -> dict:
    out = [score_row(r, corpus) for r in rows]
    keys = ["recall@5", "faithfulness", "coverage", "robustness"]
    agg = {k: float(np.mean([r[k] for r in out])) for k in keys}
    return {"per_row": out, "aggregate": agg}

A few things worth pointing at. The judge runs at temperature=0 and asks for JSON; that is the cheapest way to keep the metric stable across runs. Coverage uses citation tags as a proxy for chunk attribution. It works if your prompt requires citations, which it should. Robustness compares answers pairwise; on five paraphrases plus the original that is fifteen judge calls per question, so do not run it on every CI. Run it weekly, or on a small slice nightly.

What to do with the four numbers

Read them as a 2×2: Recall@K on the retrieval axis, the other three (faithfulness, coverage, robustness) on the generation-and-stability axis. Most teams instrument the first and ignore the second.

Recall@K drops, faithfulness stable: retriever regression. Look at the index, the embedder version, the chunking change someone shipped on Tuesday.
Recall@K stable, faithfulness drops: generator regression. Look at the model version, the prompt, the system message, the temperature someone bumped while debugging something else.
Coverage at 0.2 and stable: your retrieval changes will not move quality. Stop tuning the retriever, look at the prompt or the chunk size.
Robustness drops, others stable: brittle generation. Tighten the system prompt or add an answer schema.

The four numbers tell you which slice of human review to do this week.

If your dashboard shows green Recall@K and your support queue shows red, the gap is in the three metrics it does not see.

If this was useful

The eval and observability chapters of LLM Observability Pocket Guide cover the production version of the harness above: judge-model selection, cross-judge sanity checks, slice-by-slice breakdowns, and how to wire the four metrics into OpenTelemetry GenAI spans so a regression shows up in the same dashboard as your latency and cost numbers. The harness above is the floor; the book covers what production looks like when the four numbers have to survive a real release cadence.