Most RAG demos stop at:
“Look, it answers correctly.”
I wanted to go further.
Instead of building a flashy Retrieval-Augmented Generation system, I built a baseline RAG architecture and focused heavily on evaluation:
- Context adherence
- Context precision
- Answer relevance
- Groundedness
This post walks through:
- The architecture
- The dataset
- The evaluation framework
- The real failure modes
- And what I’d fix next
🧠 The Goal
- Build and evaluate a structured RAG system that:2.
- Extracts and chunks PDFs
- Creates a vector retrieval layer
- Generates grounded answers
- Evaluates answers using LLM-as-Judge
- Produces measurable metrics
This was not about "chatbot performance".
It was about architectural clarity + measurable quality.
🏗 Architecture Overview
PDFs → Chunking → Embeddings → FAISS → Retrieval
Retrieval → Context + Question → gpt-4o-mini → Answer
Answer → LLM-as-Judge → Evaluation Metrics
Stack used:
- LangChain
- FAISS (locally persisted)
- sentence-transformers/all-MiniLM-L6-v2
- gpt-4o-mini
- Windows local environment
Simple. Reproducible. Baseline-first.
📂 Dataset
I intentionally used complex, table-heavy documents:
| Source | Type |
|---|---|
| NVIDIA 10-K | Financial |
| Microsoft 10-K | Financial + Business |
| AWS Well-Architected Framework | Cloud Architecture |
Total PDFs: 3
Chunk size: 1000
Overlap: 200
🪓 Chunking Strategy
- Recursive character splitting
- Chunk size: 1000
- Overlap: 200
Why 1000?
To reduce embedding cost and maintain context continuity.
What happened?
Precision dropped.
Financial documents contain large multi-column tables. Large chunks diluted retrieval precision.
Lesson:
Bigger chunks ≠ better RAG.
🔎 Retrieval Layer
Embedding model
sentence-transformers/all-MiniLM-L6-v2
Chosen because:
- Fast
- Strong semantic baseline
- Lightweight for local experiments
*Vector store
*
FAISS (local persistent index)
✨ Answer Generation
Model:
gpt-4o-mini
Prompt strategy:
- Strictly answer from context
- Avoid hallucination
- Say “I don’t know” if answer absent
This conservative approach reduced hallucination — but introduced new behavior (we’ll get to that).
📊 Evaluation Framework (LLM-as-Judge)
I evaluated 20 questions across documents.
Each answer was scored on:
- Context Adherence
- Context Precision
- Answer Relevance
- Groundedness
This separation is critical.
Most RAG systems fail because teams don’t know where the failure happens:
- Retrieval?
- Generation?
- Alignment?
- Table parsing?
From 20 evaluated questions:
- Context Adherence: ~76%
- Context Precision: ~0.48 average
- Answer Relevance: ~0.74
- Groundedness: High (except temporal mismatch cases)
Overall maturity:
7.5 / 10 Baseline RAG
🔎** What Actually Broke?**
This is where things get interesting.
1️⃣ Temporal Misalignment (High Risk)
Example:
The system extracted an operating income value from the wrong fiscal year column.
The answer:
- Looked correct
- Existed in context
- Was grounded
But belonged to the wrong year.
This is dangerous.
Financial tables with multiple years introduce alignment risk that naive RAG systems fail to detect.
2️⃣ “I Don’t Know” Even When Context Exists
Several cases where:
- Context contained the answer
- Model still said: “I don’t know”
Likely causes:
- Chunk too large
- Table parsing ambiguity
- Conservative prompt
This is not hallucination.
This is extraction hesitation.
3️⃣ Low Context Precision
Many correct answers had low precision scores because:
Chunk size = 1000
Financial tables = noisy
The answer was present, but buried inside large irrelevant context.
🧠 Key Insight
Most RAG failures are not hallucinations.
They are:
- Retrieval precision failures
- Column alignment failures
- Temporal reasoning failures
- Overly conservative generation
Evaluation-first design makes these visible.
Without metrics, you’d never see this.
🚀 What I Would Improve
- Reduce chunk size to 600–800
- Increase overlap to maintain continuity
- Add year-alignment guardrail in prompt
- Add table-aware extraction logic
- Add reranker (hybrid retrieval or cross-encoder)
Baseline RAG works.
Architected RAG works better.
🏁 Why This Project Matters
There’s a difference between:
“RAG that answers”
and
“RAG that can be trusted”
This experiment focused on trust:
- Measuring grounding
- Detecting temporal misalignment
- Identifying precision loss
- Structuring evaluation signals
📌 Final Rating
| Category | Rating |
|---|---|
| Retrieval | ⭐⭐⭐⭐☆ |
| Generation | ⭐⭐⭐⭐☆ |
| Grounding | ⭐⭐⭐⭐☆ |
| Precision | ⭐⭐⭐☆☆ |
| Temporal Robustness | ⭐⭐☆☆☆ |
Baseline: Strong
Production-ready: Not yet
If you're building RAG systems, I strongly recommend:
- Separate retrieval metrics from generation metrics
- Always test on table-heavy documents
- Measure groundedness independently
- Add temporal alignment checks
RAG is easy to build.
Reliable RAG is engineering
RAG is easy to build.
Reliable RAG is engineering.
Detecting temporal misalignment
Identifying precision loss
Structuring evaluation signals
That’s the difference between demo-level AI and production-level AI.
📌 Final Rating
Category Rating
Retrieval ⭐⭐⭐⭐☆
Generation ⭐⭐⭐⭐☆
Grounding ⭐⭐⭐⭐☆
Precision ⭐⭐⭐☆☆
Temporal Robustness ⭐⭐☆☆☆
Baseline: Strong
Production-ready: Not yet
If you're building RAG systems, I strongly recommend:
- Separate retrieval metrics from generation metrics
- Always test on table-heavy documents
- Measure groundedness independently
- Add temporal alignment checks
RAG is easy to build.
_
**_Reliable RAG is engineering**.
Top comments (0)