I Shipped a Strict-Source RAG System to Production in 8 Weeks: A Full-Stack Engineering Retrospective

#ai #rag #showdev #softwareengineering

This is a story about "getting RAG right" — not a demo, but a production system under real business pressure, with real failures and real data.

📦 Source code: production-rag-engineering

Why I'm Writing This Series

There's no shortage of RAG articles online. Most of them look like this:

"Load documents with LangChain → split → embed → retrieve → feed to GPT → get answer"

That pipeline works fine for a demo. But the moment you push it to production, things fall apart:

Documents are full of tables — parsing turns them into garbage
Chunking splits a complete rule in half — retrieval never finds it
Vector similarity hits 0.9 — but the conclusion is completely wrong
Something breaks and you have no idea where to look, so you guess

This series isn't about demos. It's a complete engineering retrospective of a RAG system built from zero to production.

What This System Does

The entry point is ESG compliance detection.

Every year, companies publish an ESG report aligned to GRI (Global Reporting Initiative) standards, demonstrating compliance across environmental, social, and governance dimensions. The GRI framework contains 250+ rules, each with specific disclosure requirements.

The traditional approach: compliance teams manually cross-check each rule. One report takes 3–5 days, miss rate sits at 15%, and every time GRI updates its standards, maintaining the knowledge base takes another 2 weeks.

This isn't a headcount problem — it's a scalability bottleneck that shows up in any document-intensive compliance workflow.

The goal: let the system handle this automatically, and produce conclusions that are quantifiable and auditable.

Why This Scenario Is an Extreme Stress Test for RAG Engineering

I didn't choose this scenario because ESG is special. I chose it because its constraints force every hard RAG engineering problem to the surface:

Constraint	Engineering Challenge
250+ structured rules, each with explicit required elements	Semantic matching alone isn't enough — element completeness must be verified
Two completely different document types: rule docs + corporate reports	A single chunking strategy won't work for both
Dense domain terminology (Scope 1/2/3, GHG Protocol…)	General-purpose embedding models will drift on specialized terms
Conclusions must be auditable — clients can challenge them	A complete traceability chain is non-negotiable
GRI standards update annually	The knowledge base must support incremental updates — full rebuilds aren't viable
Privacy-sensitive data (employee compensation, environmental incidents)	Some scenarios require local deployment; data cannot leave the premises

Every one of these constraints maps to an engineering decision you'll never face in a demo — but can't avoid in production.

That's why this scenario is worth a full series: it's a natural stress test for RAG engineering.

Full-Stack Architecture

The system is divided into six modules. Data flows left to right:

Raw Documents (PDF / HTML / Structured Data)
        ↓
[Module 1] Document Ingestion
Parse → Clean → Dual storage (PostgreSQL + Milvus)
        ↓
[Module 2] Text Chunking
Document-type routing → Differentiated chunking strategies → Anti-truncation defense
        ↓
[Module 3] Hybrid Retrieval
Embedding model selection → Terminology augmentation → Dual validation
        ↓
[Module 4] Judgment Engine
Rule engine filtering → Multi-model routing → NER element verification → Quantified scoring
        ↓
[Module 5] Full-Chain Traceability
4-layer metadata → 3-level verification → Auto-repair
        ↓
[Module 6] Evaluation & Iteration Loop
Golden test set → 3-tier metrics → Regression gate → Continuous iteration

Module 5 (Full-Chain Traceability) is not a standalone module — it's a cross-cutting observability layer that runs through every stage. Every operation writes a traceability record, so any conclusion can be traced back to the exact paragraph in the original document.

Results

8 weeks from zero to production. Core metrics before and after:

Metric	Before	After
Detection time per report	3–5 days	2 hours
Miss rate	15%	3%
Audit pass rate	70%	100%
Response time per client challenge	2 hours	5 minutes
Cost per judgment	$0.58	$0.23
Manual review rate	100%	15%

These numbers weren't achieved by throwing more resources at the problem. Every improvement traces back to a specific engineering decision — and this series will break each one down.

Where This Methodology Transfers

The entry point is ESG, but every layer of this architecture is general-purpose:

Technical Module	Transferable Scenarios
Differentiated chunking strategies	Any system processing both rule documents and long-form text
Domain terminology-augmented retrieval	Legal, medical, financial — any terminology-dense domain
Three-layer judgment engine	Any pipeline requiring "retrieval → rule verification → quantified conclusion"
Four-layer metadata traceability	Observability infrastructure for any production-grade RAG system
Evaluation loop + regression gate	Any LLM system that needs continuous iteration

Reading this series, you're not learning "how to do ESG compliance." You're studying a RAG engineering methodology — validated against an extreme real-world scenario.

Series Navigation

Part	Title	Core Engineering Decision
Part 1	How do unstructured documents become a searchable knowledge base?	Multi-source heterogeneous document ingestion + storage selection by elimination
Part 2	Why does one system need three different chunking strategies?	Atomic semantic unit identification + two-layer anti-truncation defense
Part 3	Where does vector retrieval break down in domain-specific terminology scenarios?	Semantic drift mitigation + dual validation mechanism
Part 4	High semantic similarity score ≠ correct business conclusion	Three gaps from retrieval to decision + three-layer judgment engine
Part 5	When a RAG conclusion is challenged, can you produce evidence in 5 minutes?	4-layer metadata + 3-level verification + auto-repair
Part 6	Miss rate dropped from 60% to 7% — not tuned by gut feeling	Golden test set + 3-tier metrics + regression gate