DEV Community

James Lee
James Lee

Posted on

I Shipped a Strict-Source RAG System to Production in 8 Weeks: A Full-Stack Engineering Retrospective

This is a story about "getting RAG right" — not a demo, but a production system under real business pressure, with real failures and real data.

📦 Source code: production-rag-engineering


Why I'm Writing This Series

There's no shortage of RAG articles online. Most of them look like this:

"Load documents with LangChain → split → embed → retrieve → feed to GPT → get answer"

That pipeline works fine for a demo. But the moment you push it to production, things fall apart:

  • Documents are full of tables — parsing turns them into garbage
  • Chunking splits a complete rule in half — retrieval never finds it
  • Vector similarity hits 0.9 — but the conclusion is completely wrong
  • Something breaks and you have no idea where to look, so you guess

This series isn't about demos. It's a complete engineering retrospective of a RAG system built from zero to production.


What This System Does

The entry point is ESG compliance detection.

Every year, companies publish an ESG report aligned to GRI (Global Reporting Initiative) standards, demonstrating compliance across environmental, social, and governance dimensions. The GRI framework contains 250+ rules, each with specific disclosure requirements.

The traditional approach: compliance teams manually cross-check each rule. One report takes 3–5 days, miss rate sits at 15%, and every time GRI updates its standards, maintaining the knowledge base takes another 2 weeks.

This isn't a headcount problem — it's a scalability bottleneck that shows up in any document-intensive compliance workflow.

The goal: let the system handle this automatically, and produce conclusions that are quantifiable and auditable.


Why This Scenario Is an Extreme Stress Test for RAG Engineering

I didn't choose this scenario because ESG is special. I chose it because its constraints force every hard RAG engineering problem to the surface:

Constraint Engineering Challenge
250+ structured rules, each with explicit required elements Semantic matching alone isn't enough — element completeness must be verified
Two completely different document types: rule docs + corporate reports A single chunking strategy won't work for both
Dense domain terminology (Scope 1/2/3, GHG Protocol…) General-purpose embedding models will drift on specialized terms
Conclusions must be auditable — clients can challenge them A complete traceability chain is non-negotiable
GRI standards update annually The knowledge base must support incremental updates — full rebuilds aren't viable
Privacy-sensitive data (employee compensation, environmental incidents) Some scenarios require local deployment; data cannot leave the premises

Every one of these constraints maps to an engineering decision you'll never face in a demo — but can't avoid in production.

That's why this scenario is worth a full series: it's a natural stress test for RAG engineering.


Full-Stack Architecture

The system is divided into six modules. Data flows left to right:

Raw Documents (PDF / HTML / Structured Data)
        ↓
[Module 1] Document Ingestion
Parse → Clean → Dual storage (PostgreSQL + Milvus)
        ↓
[Module 2] Text Chunking
Document-type routing → Differentiated chunking strategies → Anti-truncation defense
        ↓
[Module 3] Hybrid Retrieval
Embedding model selection → Terminology augmentation → Dual validation
        ↓
[Module 4] Judgment Engine
Rule engine filtering → Multi-model routing → NER element verification → Quantified scoring
        ↓
[Module 5] Full-Chain Traceability
4-layer metadata → 3-level verification → Auto-repair
        ↓
[Module 6] Evaluation & Iteration Loop
Golden test set → 3-tier metrics → Regression gate → Continuous iteration
Enter fullscreen mode Exit fullscreen mode

Module 5 (Full-Chain Traceability) is not a standalone module — it's a cross-cutting observability layer that runs through every stage. Every operation writes a traceability record, so any conclusion can be traced back to the exact paragraph in the original document.


Results

8 weeks from zero to production. Core metrics before and after:

Metric Before After
Detection time per report 3–5 days 2 hours
Miss rate 15% 3%
Audit pass rate 70% 100%
Response time per client challenge 2 hours 5 minutes
Cost per judgment $0.58 $0.23
Manual review rate 100% 15%

These numbers weren't achieved by throwing more resources at the problem. Every improvement traces back to a specific engineering decision — and this series will break each one down.


Where This Methodology Transfers

The entry point is ESG, but every layer of this architecture is general-purpose:

Technical Module Transferable Scenarios
Differentiated chunking strategies Any system processing both rule documents and long-form text
Domain terminology-augmented retrieval Legal, medical, financial — any terminology-dense domain
Three-layer judgment engine Any pipeline requiring "retrieval → rule verification → quantified conclusion"
Four-layer metadata traceability Observability infrastructure for any production-grade RAG system
Evaluation loop + regression gate Any LLM system that needs continuous iteration

Reading this series, you're not learning "how to do ESG compliance." You're studying a RAG engineering methodology — validated against an extreme real-world scenario.


Series Navigation

Part Title Core Engineering Decision
Part 1 How do unstructured documents become a searchable knowledge base? Multi-source heterogeneous document ingestion + storage selection by elimination
Part 2 Why does one system need three different chunking strategies? Atomic semantic unit identification + two-layer anti-truncation defense
Part 3 Where does vector retrieval break down in domain-specific terminology scenarios? Semantic drift mitigation + dual validation mechanism
Part 4 High semantic similarity score ≠ correct business conclusion Three gaps from retrieval to decision + three-layer judgment engine
Part 5 When a RAG conclusion is challenged, can you produce evidence in 5 minutes? 4-layer metadata + 3-level verification + auto-repair
Part 6 Miss rate dropped from 60% to 7% — not tuned by gut feeling Golden test set + 3-tier metrics + regression gate

Source Code

All implementations referenced in this series are available here:

👉 github.com/muzinan123/production-rag-engineering

The repo contains two complete production implementations:

  • esg/ — ESG compliance detection pipeline
  • medical/ — Medical terminology standardization pipeline

Start with Part 1, where we break down the ingestion pipeline.

Top comments (0)