WonderLab

Posted on May 6

RAG Series (8): RAG Evaluation System — Speaking with Data

#rag #ragas #llm #evaluation

Why "It Feels Fine" Is Not a Standard

In the previous seven articles, we built a complete RAG pipeline: chunking, embeddings, vector stores, and retrieval strategies. The system is running, and when you ask a few questions, the answers look "pretty good."

But then problems arise:

Did it actually get better after iteration? You swapped the embedding model, tuned chunk_size, added MMR — but did answer quality really improve? Or did it just "feel" better?
Where is the problem? A particular question gets a terrible answer — is it because the retrieval phase failed to fetch relevant documents, or because the generation phase is hallucinating?
How do you report to your boss? "I think our RAG system is pretty good" — this statement carries zero weight in a data-driven team.

RAG system evaluation cannot rely on feelings; it must rely on data.

This article will walk you through building a quantifiable RAG evaluation system from scratch using the RAGAS framework, so you can clearly know whether your system is good, where it falls short, and how to fix it.

What Is RAGAS?

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source evaluation framework designed specifically for RAG systems. Its core idea is simple: use an LLM as a judge to automatically assess the quality of a RAG system's output.

Why use an LLM as a judge? Traditional NLP metrics like BLEU and ROUGE are designed for translation or summarization tasks — they judge similarity through string matching and completely fail to understand semantics. RAG evaluation, however, requires understanding whether "this answer is grounded in the context" or "this response missed the point" — exactly what LLMs excel at.

RAGAS proposes 4 core metrics that cover the two key phases of a RAG system (retrieval + generation):

The Four Core Metrics Explained

1. Faithfulness

Question: Is the answer making things up?

Faithfulness measures whether the generated answer is faithful to the retrieved context. If the model adds information not present in the context, that's a "hallucination," and Faithfulness will be low.

In plain terms: High Faithfulness on an exam means every sentence in your answer is based on the provided reference materials — no made-up facts.

How it works: The LLM is prompted to check each claim in the answer and determine whether it can be inferred from the context. (Number of inferable claims) / (Total claims) = Faithfulness.

2. Answer Relevancy

Question: Is the answer actually answering the question?

Answer Relevancy measures how relevant the answer is to the question. Even if the content is factually correct, if it strays from the core of the question, this metric will be low.

In plain terms: You ask "How do I learn Python?" and the other person gives you a lecture on Java's history — the content may be correct, but it's completely off-topic.

How it works: The LLM generates several question variants based on the answer, then computes the embedding similarity between these variants and the original question, taking the average.

3. Context Precision

Question: How much of the retrieved content is noise?

Context Precision measures the proportion of relevant document chunks in the retrieval results. If 2 out of 4 retrieved contexts are completely irrelevant, Context Precision is 0.5.

In plain terms: You go to the library and borrow 4 books, but only 2 are useful — your retrieval precision is 50%.

How it works: The LLM judges each context chunk individually to determine whether it is relevant to the question. (Relevant chunks) / (Total chunks).

4. Context Recall

Question: Did we find everything we should have?

Context Recall measures how much of the information relevant to the question was successfully retrieved. This is the most critical metric for the retrieval phase.

In plain terms: The exam covers 10 knowledge points, but your study notes only cover 6 — your recall rate is 60%.

How it works: The LLM breaks the ground_truth answer into multiple key claims, then checks whether each claim can be inferred from the retrieved contexts. (Inferable claims) / (Total claims in ground_truth) = Context Recall.

How the Four Metrics Relate

User asks a question
    ├─→ Context Recall low? → Retrieval phase issue (chunk/embedding/top-k)
    ├─→ Context Precision low? → Noise mixed into retrieval results
    ├─→ Faithfulness low? → Generation phase hallucination (insufficient context or model disobedience)
    └─→ Answer Relevancy low? → Answer is off-topic

These four metrics are independent yet complementary, together forming a "health report" for your RAG system.

Step One of Evaluation: Building a Test Set

Evaluation needs "exam questions" and "reference answers." The quality of your test set directly determines the credibility of your evaluation results.

What Does a Test Sample Look Like?

A standard RAG test sample contains four fields:

{
  "question": "What are the four core evaluation metrics in the RAGAS framework?",
  "ground_truth": "The four core RAGAS metrics are: Faithfulness, Answer Relevancy, Context Precision, and Context Recall.",
  "contexts": ["...", "..."],
  "answer": "..."
}

question: The user's question
ground_truth: The standard answer (manually written, model-independent)
contexts: Contexts retrieved by the RAG system (filled automatically after running)
answer: The answer generated by the RAG system (filled automatically after running)

Two Ways to Build Test Sets

Method	Pros	Cons	Best For
Manual labeling	High quality, clear boundaries	High cost, time-consuming	Core test sets, production acceptance
LLM generation	High efficiency, scalable	Requires human sampling and correction	Rapid construction, expanding coverage

Our Test Data

We use 8 RAG-related technical articles as the knowledge base (covering vector databases, embeddings, chunking strategies, hybrid search, etc.) and manually craft 1 QA pair for each:

[
  {"question": "What is RAG and what problems does it solve?", "ground_truth": "RAG..."},
  {"question": "Which vector database should enterprises choose?", "ground_truth": "Qdrant and Weaviate..."},
  {"question": "Which embedding model is best for Chinese?", "ground_truth": "BGE series..."},
  ...
]

Full data in data/knowledge_base.json and data/manual_testset.json

Hands-On (Part 1): Building an Evaluable RAG System

Before evaluation, we need a configurable RAG Pipeline. Why configurable? Because we want to compare evaluation results under different parameters (chunk_size, top_k).

Core Structure of rag_pipeline.py

class RAGPipeline:
    def __init__(self, chunk_size=512, chunk_overlap=50, top_k=4):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.top_k = top_k

    def build_index(self, docs, force_rebuild=False):
        # 1. Split documents
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separators=["\n\n", "\n", "。", "；", " ", ""],
        )
        chunks = splitter.split_documents(docs)

        # 2. Build vector index
        self.vectorstore = Chroma.from_documents(
            documents=chunks, embedding=embeddings
        )

        # 3. Assemble LCEL Chain
        self.chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt | llm | StrOutputParser()
        )

    def query(self, question):
        contexts = self.retriever.invoke(question)
        answer = self.chain.invoke(question)
        return {"question": question, "answer": answer, "contexts": contexts}

Key design: All quality-impacting parameters are exposed as constructor arguments, making it easy to switch configurations in diagnostic experiments.

Hands-On (Part 2): Running RAGAS Evaluation

Evaluation Flow in evaluate.py

def main():
    # 1. Load 8 manually labeled test samples
    testset = load_testset("./data/manual_testset.json")

    # 2. Build RAG Pipeline (default config: chunk=512, overlap=50, top_k=4)
    pipeline = RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)
    pipeline.build_index(force_rebuild=True)

    # 3. Run RAG on each test sample, collect answers and contexts
    dataset = prepare_dataset(pipeline, testset)
    # dataset format: {question, answer, contexts, ground_truth}

    # 4. Run RAGAS evaluation
    result = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
        llm=ragas_llm,        # LLM used as evaluation judge
        embeddings=ragas_emb, # Embedding used for evaluation
    )

    # 5. Print and save report
    print_report(result, dataset)
    save_report(result, dataset, "./evaluation_report.json")

Note: The LLM and embeddings used for evaluation can differ from those used by the RAG system itself. The RAG system can use a lightweight model (e.g., glm-4-flash), while the evaluation judge can use a stronger model (e.g., GPT-4) for more accurate judgments. In practice, however, the same model is often used to save costs.

Actual Run Results

$ python evaluate.py

Output:

============================================================
 RAGAS Evaluation Report
============================================================

📊 Overall Scores:
   faithfulness          : 0.792 ███████████████
   answer_relevancy      : 0.406 ████████
   context_precision     : 0.583 ███████████
   context_recall        : 0.625 ████████████

   Average Score              : 0.602

📋 Per-Question Breakdown:
     # Question                        Faith AnsRel CtxPre CtxRec
   ---------------------------------------------------------------
     1 What is RAG and...               1.00   0.68   1.00   1.00
     2 Which vector DB for...           0.33   0.09   0.33   0.00
     3 Which embedding for...           1.00   0.78   1.00   1.00
     4 Recommended chunk size...        1.00   0.86   0.50   1.00
     5 Four RAGAS metrics...            1.00   0.64   1.00   1.00
     6 RRF fusion formula...            0.00   0.00   0.00   0.00
     7 HyDE query optimization...       1.00   0.13   0.00   0.00
     8 Multi-tenant isolation...        1.00   0.07   0.83   1.00

⚠️  Weakest Metric: answer_relevancy (0.406)
============================================================

Interpreting the Results

This report lets us pinpoint problems immediately:

Faithfulness is decent (0.792): The model generally answers based on retrieved context, rarely making things up.
Answer Relevancy is very low (0.406): This is the biggest problem! Many answers are grounded in context but fail to address the core of the question.
Context Recall (0.625) and Context Precision (0.583) are moderate: The retrieval phase has room for improvement.
Question 6 completely collapses (0/0/0/0): Asked about the RRF formula, but the system failed to retrieve any relevant document — a classic retrieval failure.

Without this report, you would only think "Question 6 didn't turn out great." With the report, you know precisely that retrieval failed, not generation.

Hands-On (Part 3): Diagnostic Experiment — Good Config vs Bad Config

The greatest value of evaluation is not scoring but diagnosis. Below, we deliberately create a "bad configuration" to see if RAGAS can catch the problem.

Design of diagnose.py

Parameter	Good Config	Bad Config	Expected Problem
chunk_size	512	128	Context fragmentation, semantic breaks
chunk_overlap	50	0	Boundary information loss
top_k	4	2	Insufficient recall, missing relevant info

# Good config
good = RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)

# Bad config: tiny chunks + no overlap + only retrieve 2
bad = RAGPipeline(chunk_size=128, chunk_overlap=0, top_k=2)

Diagnostic Comparison Results

Running python diagnose.py:

============================================================
 Diagnostic Comparison: Good Config vs Bad Config
============================================================

  Metric                      Good      Bad        Diff      Diagnosis
  --------------------------------------------------------------------
  faithfulness                0.830     0.750     +0.080    ✓ Normal
  answer_relevancy            0.502     0.191     +0.312    ✗ Severe
  context_precision           0.583     0.375     +0.208    ⚠ Warning
  context_recall              0.625     0.250     +0.375    ✗ Severe
  --------------------------------------------------------------------
  Average Score                 0.635     0.391     +0.244
============================================================

  📋 Diagnostic Conclusions:
     → Context Recall severely dropped: retrieval phase issue (chunk too small / top-k too few)
     → Context Precision dropped: noise mixed into retrieval results
     → Answer Relevancy severely dropped: answers deviated from questions

Diagnostic Analysis

Dropped Metric	Bad Config Value	Diagnosis
Context Recall ↓↓	0.625 → 0.250	Tiny chunks cause semantic breaks; top_k=2 causes large amounts of relevant info to go unretrieved
Context Precision ↓	0.583 → 0.375	After fragmentation, low-quality chunks are more easily falsely matched to the query
Answer Relevancy ↓↓	0.502 → 0.191	Insufficient context → model can only answer based on limited info → answer deviates from question
Faithfulness ↓	0.830 → 0.750	With fragmented context, the model sometimes has to "hallucinate" to fill information gaps

This experiment perfectly demonstrates RAGAS's diagnostic power: instead of vaguely saying "the bad config is worse," it precisely tells you which component failed and how severely.

Complete Source Code

The complete code for this article is open-sourced, including:

rag_pipeline.py — Configurable RAG Pipeline
evaluate.py — Main RAGAS evaluation script
diagnose.py — Good config vs bad config diagnostic experiment
generate_qa.py — LLM-automated test set generation
data/knowledge_base.json — 8 knowledge base documents
data/manual_testset.json — 8 manually labeled test samples

Source code:

https://github.com/chendongqi/llm-in-action/tree/main/08-ragas-eval

When to Use RAGAS and What to Watch Out For

When Should You Run RAGAS Evaluation?

After system iteration: After swapping embedding models, adjusting chunking strategies, or upgrading the LLM, run an evaluation to see metric changes
Before production launch: Establish a baseline to ensure the system meets acceptable quality thresholds
During issue triage: When users complain about poor answer quality, use metrics to pinpoint whether it's a retrieval or generation issue

Important Notes

Evaluation is not cheap: 4 metrics × N test samples, each requiring multiple LLM calls. 8 samples take roughly 3-5 minutes and hundreds of API calls. Production environments should use async batch processing.
Ground truth quality determines the evaluation ceiling: If your reference answers are poorly written, metrics like Context Recall will be skewed.
Judge model and business model can differ: Evaluation can use a stronger model (e.g., GPT-4) while the RAG system itself uses a lightweight model (e.g., glm-4-flash). This yields more accurate judgments while keeping costs controllable.
Update test sets regularly: When the knowledge base is updated, the test set should be synchronized to cover new domains.

Summary

This article walked you through building a RAG evaluation system from scratch:

Why evaluate: "Feels fine" is not quantifiable; you won't know if iterations actually improved anything
RAGAS four core metrics: Faithfulness (anti-hallucination), Answer Relevancy (anti-off-topic), Context Precision (noise reduction), Context Recall (recall guarantee)
Test set construction: Manual labeling as primary, LLM generation as supplementary
Hands-on code: Complete RAG Pipeline + RAGAS evaluation + diagnostic comparison experiment
Real data: Good config averaged 0.635, bad config averaged 0.391 — the gap is obvious

Key insight: Don't optimize RAG by gut feeling. Run RAGAS first, identify the weakest metric, then optimize specifically — that's the most efficient improvement path.

DEV Community