<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: sridhar Tondapi</title>
    <description>The latest articles on DEV Community by sridhar Tondapi (@sridhar_tondapi).</description>
    <link>https://dev.to/sridhar_tondapi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3637635%2F73831221-2a78-4166-bd14-c5f030c199ed.png</url>
      <title>DEV Community: sridhar Tondapi</title>
      <link>https://dev.to/sridhar_tondapi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sridhar_tondapi"/>
    <language>en</language>
    <item>
      <title>Master RAG Evaluation with RAGAS</title>
      <dc:creator>sridhar Tondapi</dc:creator>
      <pubDate>Sun, 30 Nov 2025 16:46:18 +0000</pubDate>
      <link>https://dev.to/sridhar_tondapi/master-rag-evaluation-with-ragas-5403</link>
      <guid>https://dev.to/sridhar_tondapi/master-rag-evaluation-with-ragas-5403</guid>
      <description>&lt;p&gt;In recent years, Retrieval-Augmented Generation (RAG) has become one of the go-to architectures for enterprises building AI assistants, knowledge search systems, and domain-specific agents. As companies ingest more documents, RAG systems grow complex — with custom retrievers, vector stores, and prompt engineering. To ensure the RAG pipeline performs well and returns results without hallucinations, the open-source framework Ragas (Retrieval-Augmented Generation Assessment Suite) is widely used to evaluate RAG pipelines.&lt;/p&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Ragas?
&lt;/h2&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;p&gt;Ragas is an open-source evaluation framework designed specifically for RAG pipelines. Instead of evaluating the LLM output in isolation, Ragas assesses the full loop:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;user question → Retrieved context → Generated answer from LLM → Final Answer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Below are the core metrics used for evaluating the full loop. In 2025, the four metrics are widely used to evaluate the output:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Precision, Context Recall, Faithfulness, and Answer Relevancy.&lt;/strong&gt; These four cover &amp;gt;95% of real-world evaluation needs without requiring hard-to-get ground-truth answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Older tutorials sometimes mention metrics like answer_correctness or context_relevance, but they are rarely used in modern pipelines .&lt;/p&gt;

&lt;p&gt;Below are 4 Core RAGAS Metrics :&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Context Recall&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“Out of everything that should have been retrieved to answer the question correctly, how much did we actually manage to retrieve?”&lt;/p&gt;

&lt;p&gt;An LLM extracts relevant statements from the ground-truth answer and checks if they appear in the retrieved context (no need for full corpus labels).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perfect Score:&lt;/strong&gt; 1.0&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common real-world range:&lt;/strong&gt; 0.70–0.95&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score = 1.0 → We retrieved all the relevant pieces of information.&lt;/li&gt;
&lt;li&gt;Score = 0.6 → We missed 40% of the truly relevant context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Context Precision&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“Of all the chunks we brought back, how many are truly useful vs. how many are just noise or irrelevant?”&lt;/p&gt;

&lt;p&gt;Verifies the relevance of each retrieved chunk to the question using an LLM judge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perfect Score:&lt;/strong&gt; 1.0&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common real-world range:&lt;/strong&gt; 0.85–0.98&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score = 1.0 → Everything we retrieved is relevant (zero noise).&lt;/li&gt;
&lt;li&gt;Score = 0.8 → 20% of the retrieved chunks are irrelevant or distracting.&lt;/li&gt;
&lt;li&gt;Score = 0.4 → Only 40% of what we retrieved is actually helpful; the other 60% is junk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Faithfulness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“Does the model make stuff up, or is every single claim backed by the retrieved chunks?”&lt;/p&gt;

&lt;p&gt;Breaks the answer into atomic claims and verifies each one against the context (fine-grained hallucination check; replaced the older “groundedness”).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perfect Score:&lt;/strong&gt; 1.0&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common real-world range:&lt;/strong&gt; 0.90–1.00&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score = 1.0 for Faithfulness → 100% faithful: no hallucinated claims.&lt;/li&gt;
&lt;li&gt;Score = 0.85 → 15% of the claims are not supported → some hallucination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Answer Relevancy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“Does the answer actually answer what was asked, or is it off-topic, incomplete?”&lt;/p&gt;

&lt;p&gt;An LLM judge evaluates completeness, focus, and lack of tangents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perfect Score:&lt;/strong&gt; 1.0&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common real-world range:&lt;/strong&gt; 0.88–0.99&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score = 1.0 → Fully relevant: answers the question completely and stays focused.&lt;/li&gt;
&lt;li&gt;Score = 0.8 → Mostly relevant but slightly incomplete or contains minor irrelevant parts.&lt;/li&gt;
&lt;li&gt;Score = 0.3 → Largely irrelevant, evasive, or misses the main point of the question.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from datasets import Dataset  # Hugging Face datasets
from ragas import evaluate
from ragas.metrics import (
    context_precision, 
    context_recall,
    answer_relevancy,
    faithfulness,
)

data = {
    "question": ["What is RAG?"],
    "answer": ["RAG combines retrieval and generation for better AI responses."],
    "contexts": [["Retrieval-Augmented Generation (RAG) is a technique..."]],
    "ground_truth": ["RAG enhances LLMs by fetching external knowledge."],  # Optional
}
test_dataset = Dataset.from_dict(data)

result = evaluate(
    dataset=test_dataset,  # Or 'data=' in some older versions
    metrics=[context_precision, context_recall, answer_relevancy, faithfulness],
)

print(result)
# Output example: {'context_precision': 0.92, 'context_recall': 0.85, 'faithfulness': 0.98, 'answer_relevancy': 0.94}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Real Use Case:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparing Versions of Your RAG Pipeline:&lt;/strong&gt; The true power of Ragas emerges when you use it to compare versions of your RAG pipeline. A RAG system has two major components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval:&lt;/strong&gt; Pulling the right documents.&lt;br&gt;
&lt;strong&gt;Generation:&lt;/strong&gt; Producing a correct, grounded answer.&lt;/p&gt;

&lt;p&gt;Ragas evaluates both.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate Retrieval Quality:&lt;/strong&gt; RAGAS checks whether your retriever is bringing back the right information. It answers questions like:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Are we retrieving relevant chunks?&lt;br&gt;
Are we retrieving too much noise or irrelevant text?&lt;br&gt;
Metrics that help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context Recall: Did we retrieve what we should retrieve?&lt;/li&gt;
&lt;li&gt;Context Precision: What proportion of what we retrieved was relevant?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate Generation Quality:&lt;/strong&gt; Even if retrieval is perfect, the LLM may hallucinate, skip details, or misinterpret the context. Ragas checks the quality of the generated answer. It answers:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Is the answer factually correct?&lt;br&gt;
Did the model stay grounded in the retrieved context?&lt;br&gt;
Metrics that help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faithfulness: Checks for hallucinations (no unsupported claims).&lt;/li&gt;
&lt;li&gt;Answer Relevancy: Ensures the answer is on-topic and complete.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use the below score combinations to understand the issue and iterate on the RAG pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3c2cx4qzkkuqzllqj5d2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3c2cx4qzkkuqzllqj5d2.png" alt=" " width="800" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>llm</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
