Dave Graham

Posted on May 24 • Originally published at benchwright.polsia.app

How to Evaluate Your RAG Pipeline

#ai #rag #llm #machinelearning

RAG has two places to fail: retrieval and generation. Most teams only catch one. Here's the complete evaluation framework.

Your RAG-powered feature returns a confident, well-formatted answer. The problem: it's wrong. Not hallucinated in an obvious way — it cites a real document, uses correct terminology, and sounds authoritative. But the document it retrieved was from six months ago, before the policy changed, and the answer is no longer valid.

This is the failure mode that makes RAG evaluation hard. Unlike a pure LLM where you're testing one system, RAG is a pipeline: a retriever that finds relevant context, and a generator that synthesizes an answer from that context. Each component can fail independently. Evaluating only the final answer misses half the problem.

This post covers the complete RAG evaluation framework: how to evaluate retrieval and generation separately, the three core metrics (context precision, faithfulness, answer relevance), the hyperparameters worth sweeping, and how to detect the silent failures that only appear in production.

Why RAG Requires Separate Evaluation of Retrieval and Generation

When a RAG system fails, the root cause is almost always one of two things: the retriever returned the wrong chunks, or the LLM generated an answer that wasn't supported by the chunks it received. These require different fixes, so you need to know which failure you're dealing with.

Consider the failure modes:

Retrieval failure: The relevant document exists in your corpus, but the retriever doesn't surface it. The LLM never sees the right context, so even a perfect generator can't produce the right answer.
Generation failure: The retriever returns excellent context, but the LLM ignores it or contradicts it — synthesizing an answer from prior training weights instead of the retrieved chunks.
Compound failure: Partially relevant context retrieved, and the LLM extrapolates beyond what the context supports.

Evaluating only the final answer tells you that something went wrong. It doesn't tell you where to fix it. If your retrieval is broken, optimizing your prompt won't help. If your generation is hallucinating, upgrading your embedding model won't help.

The silent failure pattern: Teams often report their RAG system "works fine" because end-to-end answer quality is acceptable on their test set. But their test set only covers queries where retrieval is easy. On long-tail queries — niche topics, recent documents, ambiguous phrasing — retrieval silently degrades, and the LLM fills the gap with plausible-sounding fabrications.

The RAG Triad: The Core Evaluation Framework

The RAG Triad is the most widely used framework for RAG evaluation. It measures three relationships: between the query and retrieved context, between the context and the answer, and between the query and the answer.

Metric	What It Measures	Failure Signal
Context Relevance	Are the retrieved chunks relevant to the query?	Retriever returning noise
Faithfulness	Is the answer grounded in the retrieved context?	LLM hallucinating beyond context
Answer Relevance	Does the answer actually address the query?	Correct but non-responsive answers

All three metrics need to be high simultaneously. A high faithfulness score with low context relevance means the LLM faithfully reproduced irrelevant content. High answer relevance with low faithfulness means the LLM answered the question correctly but made it up.

Retrieval Quality Metrics

Context Precision

Context precision measures what fraction of your retrieved chunks are actually relevant to the query. If you retrieve 5 chunks and only 2 are relevant, your precision is 0.4. Low precision means you're feeding your LLM noisy context — irrelevant information that increases the chance of confused or distracted generation.

function contextPrecision(retrievedChunks, relevantChunkIds) {
    const relevantSet = new Set(relevantChunkIds);
    const relevant = retrievedChunks.filter(chunk => relevantSet.has(chunk.id));
    return relevant.length / retrievedChunks.length;
}

async function scoreRetrieval(query, retrievedChunks, evaluatorLLM) {
    const relevanceJudgments = await Promise.all(
        retrievedChunks.map(chunk =>
            evaluatorLLM.judge({
                query,
                chunk: chunk.text,
                task: 'Is this chunk relevant to answering the query? Answer YES or NO.'
            })
        )
    );
    const relevantChunkIds = retrievedChunks
        .filter((_, i) => relevanceJudgments[i] === 'YES')
        .map(c => c.id);
    return {
        precision: contextPrecision(retrievedChunks, relevantChunkIds),
        relevant_count: relevantChunkIds.length,
        total_retrieved: retrievedChunks.length
    };
}

Context Recall

Context recall is the complement: of all the relevant chunks that exist in your corpus, how many did your retriever actually find? High precision with low recall means you're retrieving a clean set of relevant chunks, but missing others that could have improved the answer. You need both.

Recall requires knowing the ground truth — which chunks in your corpus are relevant for a given query. For evaluation sets, you build this ground truth manually (or with LLM-assisted annotation) on a representative sample of queries, then measure how often your retriever surfaces those chunks.

Entity Recall

Entity recall is a cheaper proxy when full ground-truth annotation isn't feasible. Extract the named entities from the correct answer, then check what fraction of those entities appear in the retrieved context.

function entityRecall(answerText, retrievedChunksText) {
    const answerEntities = extractEntities(answerText);
    const contextText = retrievedChunksText.join(' ').toLowerCase();
    const foundEntities = answerEntities.filter(entity =>
        contextText.includes(entity.toLowerCase())
    );
    return {
        recall: foundEntities.length / answerEntities.length,
        found: foundEntities,
        missing: answerEntities.filter(e => !foundEntities.includes(e))
    };
}

Retrieval evaluation shortcut: If you have ground-truth question-answer pairs, you can score retrieval without human chunk annotation. Use the gold answer to check whether the retrieved context contains the information needed to derive that answer — an LLM evaluator can judge this cheaply at scale.

Generation Quality Metrics

Faithfulness (Groundedness)

Faithfulness is the most critical generation metric. It measures whether every claim in the generated answer is supported by the retrieved context — not by the LLM's training data, not by plausible inference, but directly supported by what was retrieved.

def score_faithfulness(answer, retrieved_context, evaluator_llm):
    claims_response = evaluator_llm.complete(f"""
    Break the following answer into individual factual claims.
    Return a JSON list of strings, each a single verifiable claim.
    Answer: {answer}
    """)
    claims = json.loads(claims_response)
    if not claims:
        return 1.0
    verdicts = []
    for claim in claims:
        verdict = evaluator_llm.complete(f"""
        Context: {retrieved_context}
        Claim: {claim}
        Is this claim directly supported by the context? Answer YES or NO only.
        """)
        verdicts.append(verdict.strip() == "YES")
    faithful_count = sum(verdicts)
    return {
        "faithfulness": faithful_count / len(claims),
        "claims_total": len(claims),
        "claims_supported": faithful_count,
        "unsupported_claims": [c for c, v in zip(claims, verdicts) if not v]
    }

Hallucination Detection

Hallucination is the inverse of faithfulness — it measures what fraction of the answer is fabricated. Pay special attention to precise factual claims: numbers, dates, names, percentages, and direct quotes. LLMs hallucinate specific details far more often than general concepts. A pipeline that scores 95% faithfulness overall might still be fabricating specific figures 20% of the time.

Answer Relevance

Answer relevance is deceptively tricky. An answer can be faithful to the context and still completely miss the question. This happens most often when the retrieved context is technically related but not directly responsive.

The evaluation approach: ask an LLM evaluator to generate the question that the answer is most directly responding to, then measure semantic similarity between that generated question and the original query.

The completeness trap: An answer can score high on faithfulness and answer relevance but still be dangerously incomplete. If the retrieved context only contains half the relevant information and the LLM faithfully summarizes that half, you get a confident, grounded, relevant — but incomplete — answer. This is why context recall matters: incompleteness starts at retrieval.

Hyperparameter Sweeps: What to Tune and How to Measure It

Chunk Size

Smaller chunks improve precision but hurt recall. Larger chunks improve recall but decrease precision and increase LLM context noise. The optimal chunk size varies by document type.

Sweep strategy: fix top-K, vary chunk size from 256 to 2048 tokens in steps, measure context precision and recall on your eval set. Pick the chunk size at the knee of the precision-recall curve.

Top-K Retrieval

Retrieving more chunks increases recall but decreases precision and can overwhelm the LLM context window. There's also a position bias effect in many LLMs — information near the beginning and end of the context is more likely to be used than information in the middle.

async function sweepTopK(evalQueries, vectorStore, evaluator) {
    const kValues = [1, 3, 5, 10, 15, 20];
    const results = [];
    for (const k of kValues) {
        const queryScores = await Promise.all(
            evalQueries.map(async ({ query, goldAnswer }) => {
                const chunks = await vectorStore.retrieve(query, { topK: k });
                const answer = await generateAnswer(query, chunks);
                return {
                    contextPrecision: await evaluator.precision(query, chunks),
                    faithfulness: await evaluator.faithfulness(answer, chunks),
                    answerRelevance: await evaluator.relevance(query, answer)
                };
            })
        );
        results.push({
            k,
            composite: mean(queryScores.map(s =>
                (s.contextPrecision + s.faithfulness + s.answerRelevance) / 3
            ))
        });
    }
    return results;
}

Embedding Model

The embedding model determines how well semantic similarity maps to actual relevance for your domain. When evaluating embedding models, hold everything else constant and vary only the embedding model. Measure context recall and precision — you want to isolate the retrieval contribution.

Production Monitoring: Silent RAG Failures

Lab evaluation covers the queries you anticipated. Production covers everything else. Three signals worth monitoring:

Average context relevance score: If this drops, your corpus has likely grown stale or your query distribution has shifted.
Faithfulness rate: An uptick in low-faithfulness answers signals the LLM is increasingly ignoring context — often triggered by context window saturation or corpus contamination.
No-retrieval rate: Queries that return zero relevant chunks above your confidence threshold may receive fabricated responses instead of honest "I don't know" answers.

class RAGMonitor {
    constructor(evaluator, alertThresholds) {
        this.evaluator = evaluator;
        this.thresholds = alertThresholds;
        this.window = [];
    }
    async logQuery(query, retrievedChunks, answer) {
        const shouldEval = Math.random() < 0.05; // 5% sampling
        if (!shouldEval) return;
        const scores = {
            timestamp: Date.now(),
            context_relevance: await this.evaluator.precision(query, retrievedChunks),
            faithfulness: await this.evaluator.faithfulness(answer, retrievedChunks),
            answer_relevance: await this.evaluator.relevance(query, answer)
        };
        this.window.push(scores);
        if (this.window.length > 500) this.window.shift();
        this.checkAlerts(scores);
        return scores;
    }
    checkAlerts(latestScore) {
        const recent = this.window.slice(-50);
        const avgFaithfulness = mean(recent.map(s => s.faithfulness));
        const avgRelevance = mean(recent.map(s => s.context_relevance));
        if (avgFaithfulness < this.thresholds.faithfulness) {
            this.alert('FAITHFULNESS_DEGRADATION', { avg: avgFaithfulness });
        }
        if (avgRelevance < this.thresholds.contextRelevance) {
            this.alert('RETRIEVAL_DEGRADATION', { avg: avgRelevance });
        }
    }
}

The corpus drift problem: Re-run your evaluation suite monthly. Treat a 5% drop in context recall as a deployment-blocking regression, not a minor inconvenience.

How Benchwright Handles RAG Evaluation

The evaluation framework above requires infrastructure: an evaluation dataset, an LLM evaluator, a metrics pipeline, and a monitoring layer. Building this from scratch adds weeks of engineering time that isn't core to your product.

Benchwright provides this infrastructure out of the box. Connect your RAG pipeline — any vector store, any LLM generator — define your evaluation dataset, and Benchwright runs the full RAG Triad evaluation on a schedule. You get context precision, context recall, faithfulness, and answer relevance scores tracked over time, with regression alerts when any metric drops below your threshold.

When you change your chunk size, swap embedding models, or update your system prompt, Benchwright automatically re-evaluates and flags regressions before they reach users. No spreadsheets, no manual scoring runs, no finding out from support tickets.

Originally published on Benchwright

→ Evaluate your RAG pipeline on Benchwright — free, no credit card required

DEV Community