RAG Evaluation in Java: A Comprehensive Guide

Quick Introduction to RAG

Retrieval-Augmented Generation (RAG) is a powerful approach that combines document retrieval with Large Language Models (LLMs). By first retrieving relevant documents from a knowledge base and then using them to inform the LLM's response, RAG systems ensure more accurate, context-aware, and factual outputs while reducing hallucinations.

Code for this article is here

Why RAG Evaluation is Important

Evaluating RAG systems is crucial because:

It helps ensure the accuracy and reliability of AI applications
It verifies that the right documents are being retrieved from the knowledge base
It confirms that LLM responses are faithful to source documents
It helps identify and minimize hallucinations and incorrect information
It enables continuous optimization of the system's performance

RAG Evaluation Parameters

Core Retrieval Metrics

Precision (✅)
- Definition: How many retrieved documents are actually relevant
- Formula: Precision = Relevant Retrieved / Total Retrieved
- Use Case: Critical when wrong document retrieval is costly
- Impact: Helps minimize irrelevant information in responses
Recall (✅)
- Definition: Proportion of relevant documents retrieved from all possible ones
- Formula: Recall = Relevant Retrieved / All Relevant
- Use Case: Important when missing important information is risky
- Impact: Ensures comprehensive coverage of relevant information
F1 Score
- Definition: Harmonic mean of precision and recall
- Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall)
- Use Case: When you need a single comprehensive metric
- Impact: Provides balanced evaluation of retrieval performance
MRR (Mean Reciprocal Rank)
- Definition: Position of the first relevant document
- Formula: 1 / Rank (averaged)
- Use Case: Particularly important for Q&A systems
- Impact: Ensures most relevant documents appear first
nDCG (Normalized Discounted Cumulative Gain)
- Definition: Quality of document ranking
- Use Case: Especially relevant for vector DB + LLM pipelines
- Impact: Measures how well the system ranks documents by relevance
Hit Rate / Recall@k
- Definition: Presence of relevant documents in top k results
- Formula: (#Queries with ≥1 relevant doc in top k) / Total Queries
- Use Case: Optimizing chunk sizes in RAG
- Impact: Ensures relevant information is within the top results

Implementation Example: Document Storage

The following Java code demonstrates how documents are stored in the RAG system using an A2A agent:

public class StoreDocumentsWithA2A {
    public static void main(String[] args) {
        A2AAgent agent = new A2AAgent();
        agent.connect("http://localhost:7860");

        // Store all ground truth documents in RAG
        for (String instructions : GroundTruthData.GROUND_TRUTH_DOCS) {
            log.info("Storing instructions: {}", instructions);
            String response = agent.remoteMethodCall(
                "Store these instructions: " + instructions
            ).getTextResult();
            log.info("Agent response: {}", response);
        }
    }
}

This implementation:

Creates an A2A agent instance
Connects to the local RAG server
Iterates through ground truth documents
Stores each document with logging for tracking

Document Retrieval

Documents can be retrieved using a REST endpoint:

GET http://localhost:7860/getDocuments?documentText=dishwasher

This endpoint:

Accepts a query parameter documentText
Returns relevant documents based on semantic similarity
Uses vector embeddings for matching
Supports natural language queries

RAG Evaluation Results

When evaluating the system with a query for "dishwasher", we obtained the following metrics:

RAG Metrics:
- Precision: 0.2 (20% of retrieved documents were relevant)
- Recall: 0.011 (1.1% of all relevant documents were retrieved)
- F1 Score: 0.021 (Combined precision/recall performance)
- MRR: 0.0 (No relevant documents in top positions)
- nDCG: 0.0 (Poor ranking quality)
- Hit Rate: 0.0 (No relevant documents in top-k results)

Analysis of Results:

Low Precision (0.2)
- Only 20% of retrieved documents were relevant
- Indicates high noise in retrieval results
- Suggests need for better filtering
Very Low Recall (0.011)
- System is missing 98.9% of relevant documents
- Indicates potential issues with:
  - Document embedding quality
  - Similarity threshold settings
  - Query processing
Poor F1 Score (0.021)
- Confirms overall suboptimal performance
- Shows need for system-wide improvements
Zero Metrics (MRR, nDCG, Hit Rate)
- Indicates serious ranking issues
- Suggests need for:
  - Re-evaluation of embedding model
  - Adjustment of similarity thresholds
  - Review of document preprocessing
  - Optimization of ranking algorithm

Recommendations for Improvement:

Embedding Quality
- Consider using domain-adapted embedding models
- Experiment with different embedding dimensions
Retrieval Strategy
- Implement hybrid retrieval (semantic + keyword)
- Adjust similarity thresholds
- Consider using multiple retrieval stages
Document Processing
- Review document chunking strategy
- Implement better text preprocessing
- Consider adding metadata enrichment
System Optimization
- Fine-tune vector store parameters
- Implement results reranking
- Add relevance feedback mechanisms

These results highlight the importance of continuous monitoring and iterative improvement in RAG systems. Regular evaluation using these metrics helps maintain system quality and guides optimization efforts.