Mukesh Z

Posted on Mar 1

I Built a Baseline RAG System — Then Measured Where It Actually Breaks

#rag #evaluate #langchain #vectordatabase

Most RAG demos stop at:

“Look, it answers correctly.”

I wanted to go further.

Instead of building a flashy Retrieval-Augmented Generation system, I built a baseline RAG architecture and focused heavily on evaluation:

Context adherence
Context precision
Answer relevance
Groundedness

This post walks through:

The architecture
The dataset
The evaluation framework
The real failure modes
And what I’d fix next

🧠 The Goal

Build and evaluate a structured RAG system that:2.
Extracts and chunks PDFs
Creates a vector retrieval layer
Generates grounded answers
Evaluates answers using LLM-as-Judge
Produces measurable metrics

This was not about "chatbot performance".

It was about architectural clarity + measurable quality.

🏗 Architecture Overview

PDFs → Chunking → Embeddings → FAISS → Retrieval
Retrieval → Context + Question → gpt-4o-mini → Answer
Answer → LLM-as-Judge → Evaluation Metrics

Stack used:

LangChain
FAISS (locally persisted)
sentence-transformers/all-MiniLM-L6-v2
gpt-4o-mini
Windows local environment

Simple. Reproducible. Baseline-first.

📂 Dataset

I intentionally used complex, table-heavy documents:

Source	Type
NVIDIA 10-K	Financial
Microsoft 10-K	Financial + Business
AWS Well-Architected Framework	Cloud Architecture

Total PDFs: 3
Chunk size: 1000
Overlap: 200

🪓 Chunking Strategy

Recursive character splitting
Chunk size: 1000
Overlap: 200

Why 1000?

To reduce embedding cost and maintain context continuity.

What happened?

Precision dropped.

Financial documents contain large multi-column tables. Large chunks diluted retrieval precision.

Lesson:

Bigger chunks ≠ better RAG.

🔎 Retrieval Layer

Embedding model

sentence-transformers/all-MiniLM-L6-v2

Chosen because:

Fast
Strong semantic baseline
Lightweight for local experiments

*Vector store
*

FAISS (local persistent index)

✨ Answer Generation

Model:

gpt-4o-mini

Prompt strategy:

Strictly answer from context
Avoid hallucination
Say “I don’t know” if answer absent

This conservative approach reduced hallucination — but introduced new behavior (we’ll get to that).

📊 Evaluation Framework (LLM-as-Judge)

I evaluated 20 questions across documents.

Each answer was scored on:

Context Adherence
Context Precision
Answer Relevance
Groundedness

This separation is critical.

Most RAG systems fail because teams don’t know where the failure happens:

Retrieval?
Generation?
Alignment?
Table parsing?

📈 Results Summary

From 20 evaluated questions:

Context Adherence: ~76%
Context Precision: ~0.48 average
Answer Relevance: ~0.74
Groundedness: High (except temporal mismatch cases)

Overall maturity:
7.5 / 10 Baseline RAG

🔎** What Actually Broke?**

This is where things get interesting.

1️⃣ Temporal Misalignment (High Risk)

Example:

The system extracted an operating income value from the wrong fiscal year column.

The answer:

Looked correct
Existed in context
Was grounded

But belonged to the wrong year.

This is dangerous.

Financial tables with multiple years introduce alignment risk that naive RAG systems fail to detect.

2️⃣ “I Don’t Know” Even When Context Exists

Several cases where:

Context contained the answer
Model still said: “I don’t know”

Likely causes:

Chunk too large
Table parsing ambiguity
Conservative prompt

This is not hallucination.

This is extraction hesitation.

3️⃣ Low Context Precision

Many correct answers had low precision scores because:

Chunk size = 1000
Financial tables = noisy

The answer was present, but buried inside large irrelevant context.

🧠 Key Insight

Most RAG failures are not hallucinations.

They are:

Retrieval precision failures
Column alignment failures
Temporal reasoning failures
Overly conservative generation

Evaluation-first design makes these visible.

Without metrics, you’d never see this.

🚀 What I Would Improve

Reduce chunk size to 600–800
Increase overlap to maintain continuity
Add year-alignment guardrail in prompt
Add table-aware extraction logic
Add reranker (hybrid retrieval or cross-encoder)

Baseline RAG works.

Architected RAG works better.

🏁 Why This Project Matters

There’s a difference between:

“RAG that answers”
and
“RAG that can be trusted”

This experiment focused on trust:

Measuring grounding
Detecting temporal misalignment
Identifying precision loss
Structuring evaluation signals

📌 Final Rating

Category	Rating
Retrieval	⭐⭐⭐⭐☆
Generation	⭐⭐⭐⭐☆
Grounding	⭐⭐⭐⭐☆
Precision	⭐⭐⭐☆☆
Temporal Robustness	⭐⭐☆☆☆

Baseline: Strong
Production-ready: Not yet

If you're building RAG systems, I strongly recommend:

Separate retrieval metrics from generation metrics
Always test on table-heavy documents
Measure groundedness independently
Add temporal alignment checks

RAG is easy to build.

Reliable RAG is engineering

GitHub

RAG is easy to build.

Reliable RAG is engineering.

Detecting temporal misalignment

Identifying precision loss

Structuring evaluation signals

That’s the difference between demo-level AI and production-level AI.

📌 Final Rating
Category Rating
Retrieval ⭐⭐⭐⭐☆
Generation ⭐⭐⭐⭐☆
Grounding ⭐⭐⭐⭐☆
Precision ⭐⭐⭐☆☆
Temporal Robustness ⭐⭐☆☆☆

Baseline: Strong
Production-ready: Not yet

If you're building RAG systems, I strongly recommend:

Separate retrieval metrics from generation metrics
Always test on table-heavy documents
Measure groundedness independently
Add temporal alignment checks

RAG is easy to build.
_
**_Reliable RAG is engineering**.

DEV Community

I Built a Baseline RAG System — Then Measured Where It Actually Breaks

This is extraction hesitation.

Top comments (0)