This is a story about "getting RAG right" — not a demo, but a production system under real business pressure, with real failures and real data.
📦 Source code: production-rag-engineering
Why I'm Writing This Series
There's no shortage of RAG articles online. Most of them look like this:
"Load documents with LangChain → split → embed → retrieve → feed to GPT → get answer"
That pipeline works fine for a demo. But the moment you push it to production, things fall apart:
- Documents are full of tables — parsing turns them into garbage
- Chunking splits a complete rule in half — retrieval never finds it
- Vector similarity hits 0.9 — but the conclusion is completely wrong
- Something breaks and you have no idea where to look, so you guess
This series isn't about demos. It's a complete engineering retrospective of a RAG system built from zero to production.
What This System Does
The entry point is ESG compliance detection.
Every year, companies publish an ESG report aligned to GRI (Global Reporting Initiative) standards, demonstrating compliance across environmental, social, and governance dimensions. The GRI framework contains 250+ rules, each with specific disclosure requirements.
The traditional approach: compliance teams manually cross-check each rule. One report takes 3–5 days, miss rate sits at 15%, and every time GRI updates its standards, maintaining the knowledge base takes another 2 weeks.
This isn't a headcount problem — it's a scalability bottleneck that shows up in any document-intensive compliance workflow.
The goal: let the system handle this automatically, and produce conclusions that are quantifiable and auditable.
Why This Scenario Is an Extreme Stress Test for RAG Engineering
I didn't choose this scenario because ESG is special. I chose it because its constraints force every hard RAG engineering problem to the surface:
| Constraint | Engineering Challenge |
|---|---|
| 250+ structured rules, each with explicit required elements | Semantic matching alone isn't enough — element completeness must be verified |
| Two completely different document types: rule docs + corporate reports | A single chunking strategy won't work for both |
| Dense domain terminology (Scope 1/2/3, GHG Protocol…) | General-purpose embedding models will drift on specialized terms |
| Conclusions must be auditable — clients can challenge them | A complete traceability chain is non-negotiable |
| GRI standards update annually | The knowledge base must support incremental updates — full rebuilds aren't viable |
| Privacy-sensitive data (employee compensation, environmental incidents) | Some scenarios require local deployment; data cannot leave the premises |
Every one of these constraints maps to an engineering decision you'll never face in a demo — but can't avoid in production.
That's why this scenario is worth a full series: it's a natural stress test for RAG engineering.
Full-Stack Architecture
The system is divided into six modules. Data flows left to right:
Raw Documents (PDF / HTML / Structured Data)
↓
[Module 1] Document Ingestion
Parse → Clean → Dual storage (PostgreSQL + Milvus)
↓
[Module 2] Text Chunking
Document-type routing → Differentiated chunking strategies → Anti-truncation defense
↓
[Module 3] Hybrid Retrieval
Embedding model selection → Terminology augmentation → Dual validation
↓
[Module 4] Judgment Engine
Rule engine filtering → Multi-model routing → NER element verification → Quantified scoring
↓
[Module 5] Full-Chain Traceability
4-layer metadata → 3-level verification → Auto-repair
↓
[Module 6] Evaluation & Iteration Loop
Golden test set → 3-tier metrics → Regression gate → Continuous iteration
Module 5 (Full-Chain Traceability) is not a standalone module — it's a cross-cutting observability layer that runs through every stage. Every operation writes a traceability record, so any conclusion can be traced back to the exact paragraph in the original document.
Results
8 weeks from zero to production. Core metrics before and after:
| Metric | Before | After |
|---|---|---|
| Detection time per report | 3–5 days | 2 hours |
| Miss rate | 15% | 3% |
| Audit pass rate | 70% | 100% |
| Response time per client challenge | 2 hours | 5 minutes |
| Cost per judgment | $0.58 | $0.23 |
| Manual review rate | 100% | 15% |
These numbers weren't achieved by throwing more resources at the problem. Every improvement traces back to a specific engineering decision — and this series will break each one down.
Where This Methodology Transfers
The entry point is ESG, but every layer of this architecture is general-purpose:
| Technical Module | Transferable Scenarios |
|---|---|
| Differentiated chunking strategies | Any system processing both rule documents and long-form text |
| Domain terminology-augmented retrieval | Legal, medical, financial — any terminology-dense domain |
| Three-layer judgment engine | Any pipeline requiring "retrieval → rule verification → quantified conclusion" |
| Four-layer metadata traceability | Observability infrastructure for any production-grade RAG system |
| Evaluation loop + regression gate | Any LLM system that needs continuous iteration |
Reading this series, you're not learning "how to do ESG compliance." You're studying a RAG engineering methodology — validated against an extreme real-world scenario.
Series Navigation
| Part | Title | Core Engineering Decision |
|---|---|---|
| Part 1 | How do unstructured documents become a searchable knowledge base? | Multi-source heterogeneous document ingestion + storage selection by elimination |
| Part 2 | Why does one system need three different chunking strategies? | Atomic semantic unit identification + two-layer anti-truncation defense |
| Part 3 | Where does vector retrieval break down in domain-specific terminology scenarios? | Semantic drift mitigation + dual validation mechanism |
| Part 4 | High semantic similarity score ≠ correct business conclusion | Three gaps from retrieval to decision + three-layer judgment engine |
| Part 5 | When a RAG conclusion is challenged, can you produce evidence in 5 minutes? | 4-layer metadata + 3-level verification + auto-repair |
| Part 6 | Miss rate dropped from 60% to 7% — not tuned by gut feeling | Golden test set + 3-tier metrics + regression gate |
Source Code
All implementations referenced in this series are available here:
👉 github.com/muzinan123/production-rag-engineering
The repo contains two complete production implementations:
-
esg/— ESG compliance detection pipeline -
medical/— Medical terminology standardization pipeline
Start with Part 1, where we break down the ingestion pipeline.
Top comments (0)