Quick Introduction to RAG
Retrieval-Augmented Generation (RAG) is a powerful approach that combines document retrieval with Large Language Models (LLMs). By first retrieving relevant documents from a knowledge base and then using them to inform the LLM's response, RAG systems ensure more accurate, context-aware, and factual outputs while reducing hallucinations.
Code for this article is here
Why RAG Evaluation is Important
Evaluating RAG systems is crucial because:
- It helps ensure the accuracy and reliability of AI applications
- It verifies that the right documents are being retrieved from the knowledge base
- It confirms that LLM responses are faithful to source documents
- It helps identify and minimize hallucinations and incorrect information
- It enables continuous optimization of the system's performance
RAG Evaluation Parameters
Core Retrieval Metrics
-
Precision (✅)
- Definition: How many retrieved documents are actually relevant
- Formula:
Precision = Relevant Retrieved / Total Retrieved
- Use Case: Critical when wrong document retrieval is costly
- Impact: Helps minimize irrelevant information in responses
-
Recall (✅)
- Definition: Proportion of relevant documents retrieved from all possible ones
- Formula:
Recall = Relevant Retrieved / All Relevant
- Use Case: Important when missing important information is risky
- Impact: Ensures comprehensive coverage of relevant information
-
F1 Score
- Definition: Harmonic mean of precision and recall
- Formula:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
- Use Case: When you need a single comprehensive metric
- Impact: Provides balanced evaluation of retrieval performance
-
MRR (Mean Reciprocal Rank)
- Definition: Position of the first relevant document
- Formula:
1 / Rank
(averaged) - Use Case: Particularly important for Q&A systems
- Impact: Ensures most relevant documents appear first
-
nDCG (Normalized Discounted Cumulative Gain)
- Definition: Quality of document ranking
- Use Case: Especially relevant for vector DB + LLM pipelines
- Impact: Measures how well the system ranks documents by relevance
-
Hit Rate / Recall@k
- Definition: Presence of relevant documents in top k results
- Formula:
(#Queries with ≥1 relevant doc in top k) / Total Queries
- Use Case: Optimizing chunk sizes in RAG
- Impact: Ensures relevant information is within the top results
Implementation Example: Document Storage
The following Java code demonstrates how documents are stored in the RAG system using an A2A agent:
public class StoreDocumentsWithA2A {
public static void main(String[] args) {
A2AAgent agent = new A2AAgent();
agent.connect("http://localhost:7860");
// Store all ground truth documents in RAG
for (String instructions : GroundTruthData.GROUND_TRUTH_DOCS) {
log.info("Storing instructions: {}", instructions);
String response = agent.remoteMethodCall(
"Store these instructions: " + instructions
).getTextResult();
log.info("Agent response: {}", response);
}
}
}
This implementation:
- Creates an A2A agent instance
- Connects to the local RAG server
- Iterates through ground truth documents
- Stores each document with logging for tracking
Document Retrieval
Documents can be retrieved using a REST endpoint:
GET http://localhost:7860/getDocuments?documentText=dishwasher
This endpoint:
- Accepts a query parameter
documentText
- Returns relevant documents based on semantic similarity
- Uses vector embeddings for matching
- Supports natural language queries
RAG Evaluation Results
When evaluating the system with a query for "dishwasher", we obtained the following metrics:
RAG Metrics:
- Precision: 0.2 (20% of retrieved documents were relevant)
- Recall: 0.011 (1.1% of all relevant documents were retrieved)
- F1 Score: 0.021 (Combined precision/recall performance)
- MRR: 0.0 (No relevant documents in top positions)
- nDCG: 0.0 (Poor ranking quality)
- Hit Rate: 0.0 (No relevant documents in top-k results)
Analysis of Results:
-
Low Precision (0.2)
- Only 20% of retrieved documents were relevant
- Indicates high noise in retrieval results
- Suggests need for better filtering
-
Very Low Recall (0.011)
- System is missing 98.9% of relevant documents
- Indicates potential issues with:
- Document embedding quality
- Similarity threshold settings
- Query processing
-
Poor F1 Score (0.021)
- Confirms overall suboptimal performance
- Shows need for system-wide improvements
-
Zero Metrics (MRR, nDCG, Hit Rate)
- Indicates serious ranking issues
- Suggests need for:
- Re-evaluation of embedding model
- Adjustment of similarity thresholds
- Review of document preprocessing
- Optimization of ranking algorithm
Recommendations for Improvement:
-
Embedding Quality
- Consider using domain-adapted embedding models
- Experiment with different embedding dimensions
-
Retrieval Strategy
- Implement hybrid retrieval (semantic + keyword)
- Adjust similarity thresholds
- Consider using multiple retrieval stages
-
Document Processing
- Review document chunking strategy
- Implement better text preprocessing
- Consider adding metadata enrichment
-
System Optimization
- Fine-tune vector store parameters
- Implement results reranking
- Add relevance feedback mechanisms
These results highlight the importance of continuous monitoring and iterative improvement in RAG systems. Regular evaluation using these metrics helps maintain system quality and guides optimization efforts.
Top comments (0)