DEV Community

vishalmysore
vishalmysore

Posted on

RAG Evaluation in Java: A Comprehensive Guide

Quick Introduction to RAG

Retrieval-Augmented Generation (RAG) is a powerful approach that combines document retrieval with Large Language Models (LLMs). By first retrieving relevant documents from a knowledge base and then using them to inform the LLM's response, RAG systems ensure more accurate, context-aware, and factual outputs while reducing hallucinations.

Code for this article is here

Why RAG Evaluation is Important

Evaluating RAG systems is crucial because:

  • It helps ensure the accuracy and reliability of AI applications
  • It verifies that the right documents are being retrieved from the knowledge base
  • It confirms that LLM responses are faithful to source documents
  • It helps identify and minimize hallucinations and incorrect information
  • It enables continuous optimization of the system's performance

RAG Evaluation Parameters

Core Retrieval Metrics

  1. Precision (✅)

    • Definition: How many retrieved documents are actually relevant
    • Formula: Precision = Relevant Retrieved / Total Retrieved
    • Use Case: Critical when wrong document retrieval is costly
    • Impact: Helps minimize irrelevant information in responses
  2. Recall (✅)

    • Definition: Proportion of relevant documents retrieved from all possible ones
    • Formula: Recall = Relevant Retrieved / All Relevant
    • Use Case: Important when missing important information is risky
    • Impact: Ensures comprehensive coverage of relevant information
  3. F1 Score

    • Definition: Harmonic mean of precision and recall
    • Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall)
    • Use Case: When you need a single comprehensive metric
    • Impact: Provides balanced evaluation of retrieval performance
  4. MRR (Mean Reciprocal Rank)

    • Definition: Position of the first relevant document
    • Formula: 1 / Rank (averaged)
    • Use Case: Particularly important for Q&A systems
    • Impact: Ensures most relevant documents appear first
  5. nDCG (Normalized Discounted Cumulative Gain)

    • Definition: Quality of document ranking
    • Use Case: Especially relevant for vector DB + LLM pipelines
    • Impact: Measures how well the system ranks documents by relevance
  6. Hit Rate / Recall@k

    • Definition: Presence of relevant documents in top k results
    • Formula: (#Queries with ≥1 relevant doc in top k) / Total Queries
    • Use Case: Optimizing chunk sizes in RAG
    • Impact: Ensures relevant information is within the top results

Implementation Example: Document Storage

The following Java code demonstrates how documents are stored in the RAG system using an A2A agent:

public class StoreDocumentsWithA2A {
    public static void main(String[] args) {
        A2AAgent agent = new A2AAgent();
        agent.connect("http://localhost:7860");

        // Store all ground truth documents in RAG
        for (String instructions : GroundTruthData.GROUND_TRUTH_DOCS) {
            log.info("Storing instructions: {}", instructions);
            String response = agent.remoteMethodCall(
                "Store these instructions: " + instructions
            ).getTextResult();
            log.info("Agent response: {}", response);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

This implementation:

  1. Creates an A2A agent instance
  2. Connects to the local RAG server
  3. Iterates through ground truth documents
  4. Stores each document with logging for tracking

Document Retrieval

Documents can be retrieved using a REST endpoint:

GET http://localhost:7860/getDocuments?documentText=dishwasher
Enter fullscreen mode Exit fullscreen mode

This endpoint:

  • Accepts a query parameter documentText
  • Returns relevant documents based on semantic similarity
  • Uses vector embeddings for matching
  • Supports natural language queries

RAG Evaluation Results

When evaluating the system with a query for "dishwasher", we obtained the following metrics:

RAG Metrics:
- Precision: 0.2 (20% of retrieved documents were relevant)
- Recall: 0.011 (1.1% of all relevant documents were retrieved)
- F1 Score: 0.021 (Combined precision/recall performance)
- MRR: 0.0 (No relevant documents in top positions)
- nDCG: 0.0 (Poor ranking quality)
- Hit Rate: 0.0 (No relevant documents in top-k results)
Enter fullscreen mode Exit fullscreen mode

Analysis of Results:

  1. Low Precision (0.2)

    • Only 20% of retrieved documents were relevant
    • Indicates high noise in retrieval results
    • Suggests need for better filtering
  2. Very Low Recall (0.011)

    • System is missing 98.9% of relevant documents
    • Indicates potential issues with:
      • Document embedding quality
      • Similarity threshold settings
      • Query processing
  3. Poor F1 Score (0.021)

    • Confirms overall suboptimal performance
    • Shows need for system-wide improvements
  4. Zero Metrics (MRR, nDCG, Hit Rate)

    • Indicates serious ranking issues
    • Suggests need for:
      • Re-evaluation of embedding model
      • Adjustment of similarity thresholds
      • Review of document preprocessing
      • Optimization of ranking algorithm

Recommendations for Improvement:

  1. Embedding Quality

    • Consider using domain-adapted embedding models
    • Experiment with different embedding dimensions
  2. Retrieval Strategy

    • Implement hybrid retrieval (semantic + keyword)
    • Adjust similarity thresholds
    • Consider using multiple retrieval stages
  3. Document Processing

    • Review document chunking strategy
    • Implement better text preprocessing
    • Consider adding metadata enrichment
  4. System Optimization

    • Fine-tune vector store parameters
    • Implement results reranking
    • Add relevance feedback mechanisms

These results highlight the importance of continuous monitoring and iterative improvement in RAG systems. Regular evaluation using these metrics helps maintain system quality and guides optimization efforts.

Top comments (0)