DEV Community

Cover image for I Built a Baseline RAG System — Then Measured Where It Actually Breaks
Mukesh Z
Mukesh Z

Posted on

I Built a Baseline RAG System — Then Measured Where It Actually Breaks

Most RAG demos stop at:

“Look, it answers correctly.”

I wanted to go further.

Instead of building a flashy Retrieval-Augmented Generation system, I built a baseline RAG architecture and focused heavily on evaluation:

  • Context adherence
  • Context precision
  • Answer relevance
  • Groundedness

This post walks through:

  • The architecture
  • The dataset
  • The evaluation framework
  • The real failure modes
  • And what I’d fix next

🧠 The Goal

  1. Build and evaluate a structured RAG system that:2.
  2. Extracts and chunks PDFs
  3. Creates a vector retrieval layer
  4. Generates grounded answers
  5. Evaluates answers using LLM-as-Judge
  6. Produces measurable metrics

This was not about "chatbot performance".

It was about architectural clarity + measurable quality.


🏗 Architecture Overview

PDFs → Chunking → Embeddings → FAISS → Retrieval
Retrieval → Context + Question → gpt-4o-mini → Answer
Answer → LLM-as-Judge → Evaluation Metrics
Enter fullscreen mode Exit fullscreen mode

Stack used:

  • LangChain
  • FAISS (locally persisted)
  • sentence-transformers/all-MiniLM-L6-v2
  • gpt-4o-mini
  • Windows local environment

Simple. Reproducible. Baseline-first.


📂 Dataset

I intentionally used complex, table-heavy documents:

Source Type
NVIDIA 10-K Financial
Microsoft 10-K Financial + Business
AWS Well-Architected Framework Cloud Architecture

Total PDFs: 3
Chunk size: 1000
Overlap: 200


🪓 Chunking Strategy

  • Recursive character splitting
  • Chunk size: 1000
  • Overlap: 200

Why 1000?

To reduce embedding cost and maintain context continuity.

What happened?

Precision dropped.

Financial documents contain large multi-column tables. Large chunks diluted retrieval precision.

Lesson:

Bigger chunks ≠ better RAG.


🔎 Retrieval Layer

Embedding model

sentence-transformers/all-MiniLM-L6-v2
Enter fullscreen mode Exit fullscreen mode

Chosen because:

  • Fast
  • Strong semantic baseline
  • Lightweight for local experiments

*Vector store
*

FAISS (local persistent index)
Enter fullscreen mode Exit fullscreen mode

✨ Answer Generation

Model:

gpt-4o-mini
Enter fullscreen mode Exit fullscreen mode

Prompt strategy:

  • Strictly answer from context
  • Avoid hallucination
  • Say “I don’t know” if answer absent

This conservative approach reduced hallucination — but introduced new behavior (we’ll get to that).


📊 Evaluation Framework (LLM-as-Judge)

I evaluated 20 questions across documents.

Each answer was scored on:

  1. Context Adherence
  2. Context Precision
  3. Answer Relevance
  4. Groundedness

This separation is critical.

Most RAG systems fail because teams don’t know where the failure happens:

  1. Retrieval?
  2. Generation?
  3. Alignment?
  4. Table parsing?

📈 Results Summary

From 20 evaluated questions:

  • Context Adherence: ~76%
  • Context Precision: ~0.48 average
  • Answer Relevance: ~0.74
  • Groundedness: High (except temporal mismatch cases)

Overall maturity:
7.5 / 10 Baseline RAG


🔎** What Actually Broke?**

This is where things get interesting.

1️⃣ Temporal Misalignment (High Risk)

Example:

The system extracted an operating income value from the wrong fiscal year column.

The answer:

  • Looked correct
  • Existed in context
  • Was grounded

But belonged to the wrong year.

This is dangerous.

Financial tables with multiple years introduce alignment risk that naive RAG systems fail to detect.


2️⃣ “I Don’t Know” Even When Context Exists

Several cases where:

  • Context contained the answer
  • Model still said: “I don’t know”

Likely causes:

  • Chunk too large
  • Table parsing ambiguity
  • Conservative prompt

This is not hallucination.

This is extraction hesitation.

3️⃣ Low Context Precision

Many correct answers had low precision scores because:

Chunk size = 1000
Financial tables = noisy

The answer was present, but buried inside large irrelevant context.


🧠 Key Insight

Most RAG failures are not hallucinations.

They are:

  • Retrieval precision failures
  • Column alignment failures
  • Temporal reasoning failures
  • Overly conservative generation

Evaluation-first design makes these visible.

Without metrics, you’d never see this.


🚀 What I Would Improve

  1. Reduce chunk size to 600–800
  2. Increase overlap to maintain continuity
  3. Add year-alignment guardrail in prompt
  4. Add table-aware extraction logic
  5. Add reranker (hybrid retrieval or cross-encoder)

Baseline RAG works.

Architected RAG works better.


🏁 Why This Project Matters

There’s a difference between:

“RAG that answers”
and
“RAG that can be trusted”

This experiment focused on trust:

  • Measuring grounding
  • Detecting temporal misalignment
  • Identifying precision loss
  • Structuring evaluation signals

📌 Final Rating

Category Rating
Retrieval ⭐⭐⭐⭐☆
Generation ⭐⭐⭐⭐☆
Grounding ⭐⭐⭐⭐☆
Precision ⭐⭐⭐☆☆
Temporal Robustness ⭐⭐☆☆☆

Baseline: Strong
Production-ready: Not yet


If you're building RAG systems, I strongly recommend:

  • Separate retrieval metrics from generation metrics
  • Always test on table-heavy documents
  • Measure groundedness independently
  • Add temporal alignment checks

RAG is easy to build.

Reliable RAG is engineering

GitHub

RAG is easy to build.

Reliable RAG is engineering.

Detecting temporal misalignment

Identifying precision loss

Structuring evaluation signals

That’s the difference between demo-level AI and production-level AI.

📌 Final Rating
Category Rating
Retrieval ⭐⭐⭐⭐☆
Generation ⭐⭐⭐⭐☆
Grounding ⭐⭐⭐⭐☆
Precision ⭐⭐⭐☆☆
Temporal Robustness ⭐⭐☆☆☆

Baseline: Strong
Production-ready: Not yet

If you're building RAG systems, I strongly recommend:

  • Separate retrieval metrics from generation metrics
  • Always test on table-heavy documents
  • Measure groundedness independently
  • Add temporal alignment checks

RAG is easy to build.
_
**_Reliable RAG is engineering
**.

Top comments (0)