Anna Danilec

Posted on May 18 • Edited on May 23 • Originally published at invra.co

RAG Evaluation with RAGAS: Measuring Faithfulness, Context Precision, and Recall in Production

#ai #rag #llm #agents

Key takeaways:

RAGAS gives you four core metrics that split RAG failures into retrieval vs. generation problems

Faithfulness catches hallucinations; Context Recall catches retrieval gaps

Most metrics require no human-labeled data

Treat RAGAS like unit tests, run it in CI every time you change your pipeline

You've shipped a RAG-based product. Your engineers say it "seems to work well." Your users occasionally complain it gives wrong answers. You have no idea which part is broken, the retrieval, the generation, or both.

This is the state of most RAG deployments today. And it's a problem you can solve with a proper evaluation framework.

Let's talk about RAGAS.

The Problem With "It Seems Fine"

Building a RAG system is the easy part. Dozens of tutorials get you from zero to a working demo in an afternoon. But production RAG is a different beast. You're dealing with:

Retrieval failures - the system pulls irrelevant chunks from your vector store
Hallucinations - the LLM generates facts not present in the retrieved documents
Incomplete coverage - the retrieval misses key information needed to answer the question
Irrelevant answers - the response doesn't actually address what the user asked

Traditional NLP metrics like BLEU and ROUGE won't catch any of these. They measure surface-level text similarity to a reference answer, useful for machine translation, not for knowledge-grounded generation. They completely ignore whether the LLM is actually using the retrieved context or just making things up.

You need metrics designed specifically for the RAG pipeline.

What is RAGAS?

RAGAS (Retrieval Augmented Generation Assessment) is an open-source Python framework for evaluating RAG systems. It was introduced by Shahul Es, Jithin James, and collaborators in a paper published in late 2023 and presented at EACL 2024.

The key design decision that makes RAGAS practical: most of its metrics require no human-labeled ground truth. It uses LLMs as judges, the same type of model you're evaluating is used to evaluate the evaluation. Yes, this is a meta-game, but it works surprisingly well in practice.

RAGAS processes over 5 million evaluations monthly for companies including AWS, Microsoft, Databricks, and Moody's. It has 4,000+ GitHub stars and is backed by a Y Combinator company. It's become the de facto standard for RAG evaluation.

The Two-Axis Mental Model

Before diving into specific metrics, understand the RAG pipeline as having two distinct components, each with its own failure modes:

Retriever failures: Wrong chunks, missing chunks, poorly ranked chunks
Generator failures: Hallucination, ignoring context, irrelevant response

RAGAS gives you metrics for both axes. If you only measure end-to-end output quality, you can't tell which half is broken.

The Core Four Metrics

1. Faithfulness: Does the answer stay true to the retrieved context?

What it catches: Hallucinations from the generator

How it works:
RAGAS extracts individual statements from the generated answer, then asks an LLM judge whether each statement can be logically inferred from the retrieved context. The score is the fraction of statements that can be supported.

Faithfulness = Supported Statements / Total Statements in Answer

Score range: 0 to 1 (higher is better)

Example:
Question: "What is our refund policy?"
Context: "Refunds are available within 30 days of purchase."
Answer: "Refunds are available within 30 days. We also offer exchanges for 60 days."

The second sentence isn't in the context → Faithfulness < 1

What low Faithfulness scores tell you, and how to fix it:

1. Tighten your system prompt: The most immediate lever. Add explicit grounding instructions:

"Answer only using the information provided in the context below. If the context does not contain enough information to answer, say so explicitly."

"Do not use any prior knowledge. Every claim in your answer must be traceable to the context."

Negative framing helps too: "Do not speculate. Do not add information not present in the provided documents."

2. Lower the temperature: High temperature = more creative, more likely to drift from the context. For factual RAG tasks, set temperature to 0 or close to it. There's no good reason to have randomness in a document Q&A system.

3. Switch or downgrade your model: Counter-intuitively, more capable models sometimes hallucinate more confidently. A model like GPT-4o has seen so much training data that it may "helpfully" fill gaps from its parametric memory rather than admitting the context is insufficient. Sometimes a smaller, instruction-tuned model with a strict prompt outperforms a frontier model on Faithfulness specifically.

2. Answer Relevancy: Does the answer actually address the question?

What it catches: Verbose, off-topic, or evasive answers from the generator

How it works:
An LLM generates several hypothetical questions that the given answer would be the answer to. Then it computes the cosine similarity between those generated questions and the original question. High similarity means the answer is directly addressing what was asked.

Answer Relevancy = avg(cosine_similarity(generated_questions, original_question))

Score range: 0 to 1

Example of a low-relevancy response:
Question: "When does the system auto-scale?"
Answer: "Our platform uses Kubernetes for container orchestration. It supports multiple cloud providers and can be deployed on-premises." ← technically related but doesn't answer the question

What low Answer Relevancy scores tell you, and how to fix it:

1. Your prompt template isn't directing the LLM toward the question: The most common cause. If your template looks like "Use the context below to help the user", the LLM has too much freedom to respond however feels natural, which often means answering a slightly different, easier version of the question. Fix it by anchoring the response explicitly to the input:

"Answer the following question directly and concisely: {question}"

"Your response must directly address what was asked. Do not provide background information unless it is necessary to answer the question."

2. Your retriever is pulling tangentially related chunks: This one is subtle, the LLM isn't hallucinating, it's faithfully summarizing context that happens to be adjacent to the topic but doesn't answer the specific question asked. The answer sounds reasonable, passes a Faithfulness check, but misses the point entirely.

Cross-reference with Context Precision: if that score is also low, the retriever is the culprit. The fix is better retrieval, a reranker, stricter similarity thresholds, or query rewriting before retrieval.

3. The LLM is being evasive or overly hedged: Some models, especially when given ambiguous context, default to safe, non-committal answers: "This is a complex topic with many perspectives…" These score very low on Answer Relevancy because a hypothetical question reverse-engineered from that answer looks nothing like the original query.

The fix is prompt-level: instruct the model to commit to an answer and flag uncertainty explicitly rather than hiding behind vagueness, "If you cannot find a direct answer in the context, say: I don't have enough information to answer this. Do not speculate."

3. Context Precision: Are the retrieved chunks actually useful? Are the best ones ranked first?

What it catches: Noisy retrieval, retrieving a lot of documents but ranking the relevant ones poorly

How it works:
For each retrieved chunk, an LLM judge decides whether that chunk is useful for answering the question. Context Precision then uses Average Precision, a ranking-aware metric that penalizes systems that bury the relevant chunks at the bottom.

This is important: two systems could retrieve the same relevant chunks but if one puts them at positions 1 and 2 and another at positions 8 and 9, the LLM may not use them effectively.

What low Context Precision scores tell you, and how to fix it:
1. Add a reranker as a second retrieval stage: Your embedding model does a decent job finding broadly relevant chunks, but cosine similarity in vector space is a blunt instrument, it measures general topic overlap, not “does this chunk actually help answer this specific question.”

A cross-encoder reranker (Cohere Rerank, BGE Reranker, Jina Reranker) reads the query and each chunk together and produces a much more accurate relevance score. The typical pattern is: retrieve top-20 with your vector store, rerank, pass top-5 to the LLM. This often moves Context Precision more than any other single change.

2. Fix your chunking strategy: Poorly sized chunks are a hidden precision killer. Chunks that are too large contain the relevant sentence plus a lot of surrounding noise, the chunk scores as retrieved but most of its content is irrelevant, dragging precision down.

Chunks that are too small lose surrounding context and get ranked inconsistently. The fix isn’t always obvious because the right chunk size is domain-dependent: dense technical documentation needs smaller chunks than narrative prose.

Test with a few different sizes (256, 512, 1024 tokens) and run Context Precision against each. Also consider sentence-window retrieval or parent-child chunking, retrieve small chunks for precision, but pass their larger parent context to the LLM.

3. Rewrite the query before retrieval: User queries are often poorly formed for vector search. They’re conversational, ambiguous, or assume context from earlier in the conversation. The embedding model then retrieves chunks that match the surface phrasing of the query rather than its intent.

Query rewriting with an LLM before hitting the vector store (sometimes called HyDE, Hypothetical Document Embeddings, or simply query expansion) can dramatically improve what gets ranked at the top. A simple prompt like “Rewrite this question as a declarative statement that would appear in a technical document” often moves the needle more than swapping embedding models.

4. Context Recall: Did the retriever find everything needed to answer the question?

What it catches: Retrieval gaps, the right information exists in your knowledge base but wasn’t retrieved

How it works:
This is the one metric that typically needs a ground truth reference answer. RAGAS decomposes the reference answer into individual statements, then checks which statements can be attributed to the retrieved context.

Context Recall = Statements attributable to context / Total statements in reference answer

What low Context Recall scores tell you, and how to fix it:
1. Increase your top-K and experiment with retrieval depth: The simplest fix first. If you’re retrieving top-3 or top-5 chunks, relevant information that exists in your knowledge base simply isn’t making it into the context window. Try top-10 or top-20 and re-measure.

The tradeoff is more noise (which hurts Context Precision), so watch both metrics together, you’re looking for the sweet spot where recall improves without precision collapsing. A reranker helps here because it lets you retrieve broadly and then filter aggressively.

2. Fix your chunking before fixing your retrieval: Low Context Recall is often misdiagnosed as a retrieval problem when it’s actually a chunking problem. If a single answer requires information spread across a document, an introduction, a table in the middle, and a caveat at the end, but your chunks split those pieces apart and only one gets retrieved, recall will suffer regardless of how good your embeddings are.

Consider parent-child chunking: index small chunks for precise matching, but when a small chunk is retrieved, pass its larger parent document to the LLM. This way you get retrieval precision without losing surrounding context.

3. Switch to hybrid search: Pure vector search fails on specific, precise queries, exact product names, version numbers, acronyms, proper nouns. The embedding model generalizes these into semantic space where they lose their distinctiveness.

BM25 (keyword search) handles them perfectly. Hybrid search combines both signals, dense retrieval for semantic understanding, sparse retrieval for exact matching, and consistently improves recall across diverse query types without significantly hurting precision. Most modern vector stores (Elasticsearch, Weaviate, Qdrant) support hybrid search natively.

How the Metrics Map to Your Architecture

Metric	Measures	Failure Points to Investigate
Context Precision	Retrieval quality & ranking	Embedding model, reranker, chunk size
Context Recall	Retrieval coverage	top-K setting, chunking, indexing strategy
Faithfulness	Generator groundedness	System prompt, temperature, model choice
Answer Relevancy	Generator focus	Prompt template, retrieval quality

A useful diagnostic pattern: if Faithfulness is fine but Answer Relevancy is low, your LLM is staying honest but the retrieved context isn’t helping it answer the actual question. That’s a retrieval problem dressed up as a generation problem.

Beyond the Core Four

RAGAS has expanded significantly since its original release. For production systems, you should also look at:

Noise Sensitivity – how much does your answer quality degrade when irrelevant chunks are retrieved alongside relevant ones? Critical for adversarial or domain-drift scenarios.
Context Entities Recall – checks whether specific entities (names, numbers, dates) from the ground truth appear in the retrieved context. Useful for fact-dense domains like legal or finance.
Factual Correctness – a reference-based metric that checks whether the answer is factually correct, not just grounded in context. This requires ground truth but gives you absolute accuracy, not just relative faithfulness.

For teams building agentic RAG pipelines, RAGAS also covers Tool Call Accuracy, Agent Goal Accuracy, and Topic Adherence.

The Evaluation Dataset Problem (And How RAGAS Solves It)

Here’s the real bottleneck: to run these metrics at scale, you need test questions. Building hundreds of representative questions by hand is expensive and slow.

RAGAS includes a synthetic test data generation module. It ingests your source documents, builds a knowledge graph, and generates diverse question types automatically, including multi-hop questions that require reasoning across multiple documents.

This lets you create a meaningful evaluation dataset in hours rather than weeks. It’s not perfect, you’ll still want human review for high-stakes domains, but it dramatically lowers the barrier to having a real eval suite before your next deployment.

What a RAGAS Workflow Looks Like

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Your RAG system output
data = {
    "question": ["What is our data retention policy?", ...],
    "contexts": [["Our data is retained for 90 days...", ...], ...],
    "answer": ["Data is retained for 90 days.", ...],
    "ground_truth": ["Data is retained for 90 days per GDPR requirements.", ...],
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
])

print(results)
# {'faithfulness': 0.91, 'answer_relevancy': 0.87, 
#  'context_precision': 0.76, 'context_recall': 0.82}

The real value isn’t a single run. It’s running RAGAS as part of your CI/CD pipeline every time you change your prompt template, swap your embedding model, or update your knowledge base. Treat it like unit tests for your AI system.

Practical Guidance

Start with Faithfulness and Context Recall. These two are the highest signal metrics for most production systems. Faithfulness catches the most dangerous failure mode (hallucination), and Context Recall tells you if your retrieval architecture is fundamentally sound.

Don’t optimize a single metric. You can game Context Precision by returning fewer, more targeted chunks, but this hurts Context Recall. You need to watch all four together.

Use RAGAS scores to run A/B experiments. Want to know if switching from text-embedding-ada-002 to text-embedding-3-large improves your system? Run RAGAS before and after. Now you have data instead of intuition.

Integrate with observability tools. RAGAS works natively with LangSmith and Langfuse. This means you can trace individual requests that score poorly and inspect exactly what was retrieved and how the LLM used it.

The LLM-as-judge limitation. Be aware that RAGAS uses LLMs internally for most metrics. This means your evaluation has its own failure modes, LLM judges can be inconsistent, sensitive to prompt phrasing, and exhibit position bias. Use a strong, reliable model (GPT-4o, Claude) for your judge. For critical systems, validate RAGAS scores against a sample of human annotations.

Summary

Most teams ship RAG systems and evaluate them with vibes. RAGAS gives you a structured, automated way to know exactly where your pipeline is failing, retrieval or generation, and gives you the feedback loop to fix it systematically.

This is the difference between iterating on your AI system and guessing about it.

The framework is open-source, takes an afternoon to integrate, and has become the standard for a reason. If you’re running RAG in production without evaluation metrics, that’s the technical debt your team should be paying down next.

Top comments (2)

Harjot Singh • Jun 1

the distinction between retrieval and generation issues is crucial for improving RAG systems. identifying the source of failures can save a lot of headaches. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you keep the code on your github. happy to offer a complimentary run if you're interested.

RAGPrep • Jun 1

Solid breakdown of RAGAS. Faithfulness, context precision, and recall are the right three metrics to track in production, and the gap between teams that measure these and teams that ship blind is enormous.
One angle worth adding for anyone applying these metrics: all three measure outcomes after retrieval. They tell you whether your retrieval brought back the right chunks and whether the model used them faithfully. They don't tell you whether the chunks in your vector database are worth retrieving in the first place.
A pattern I've seen repeatedly in production: RAGAS scores look mediocre, teams spend weeks tuning the retriever, the embedder, the reranker, the prompt — and the scores barely move. The actual failure is upstream. The vector DB contains a meaningful percentage of low-quality chunks (PDF parsing fragments, boilerplate, mid-sentence splits, near-duplicates), and no amount of retrieval optimisation overcomes a corrupted source pool.
The diagnostic that's saved me time: before running RAGAS, sample 50-100 chunks at random from the vector DB and score them for basic quality — semantic coherence, completeness, information density. If 30%+ of chunks fail a basic readability check, retrieval optimisation will hit a ceiling no matter how good your reranker is.
The mental model I've landed on: pre-embedding chunk quality and post-retrieval RAGAS evaluation are the two ends of the same problem. RAGAS tells you whether your pipeline produced a good answer from what it had. Chunk quality scoring tells you whether what it had was worth working with. Run both, in that order. Fix upstream first.