This article covers the fifth layer of the full-stack architecture: Full-Chain Traceability. This is not a standalone module — it's observability infrastructure embedded into every layer. Core engineering value: turning "something broke, let's guess" into "root cause identified in 5 minutes."
📦 Source code: production-rag-engineering —
esg/services/embedding_service.py,esg/routers/evaluation.py
0. The Pain Point
After Part 4's judgment engine went live, the system could produce quantified scores and missing element breakdowns. But a new problem emerged almost immediately:
Companies started pushing back.
"Pages 12–13 of our report explicitly state the environmental incident impact scope. Why was this flagged as non-compliant?"
The system's response: "Retrieval results indicate missing impact scope disclosure."
The company followed up: "Where exactly is it missing?"
The system went silent — because the retrieval process hadn't been recorded, and there was nothing to show.
Real numbers: company challenge rate was 35% (1 in every 3 reports was disputed). Manual investigation: 2 hours per case. Audit pass rate: 70%.
The investigation workflow looked like this: check original report (30 min) → check retrieval logs (30 min) → check chunking logs (30 min) → check adjacent chunks (30 min). Total: 2 hours. Success rate: 80%. The remaining 20% had no identifiable root cause at all.
This wasn't a judgment logic problem. The system had no observability infrastructure.
A car without a dashcam can only guess what happened after an accident. A RAG system without full-chain traceability means 2 hours of blind investigation every time something goes wrong.
1. What Traceability Needs to Solve
Production-grade RAG systems have three core tensions that appear in any domain:
Tension 1: Conclusions can't be traced — no one can convince anyone
The system says "missing impact scope disclosure." The user says "it's right there on page 12." Neither side has evidence. Manual review is the only option.
Tension 2: Debugging is guesswork — no idea which layer to start from
The same "missed detection" symptom could be caused by: a parsing step that dropped a page, a chunking step that truncated key content, or a retrieval parameter set too low. Without traceability data, you're guessing layer by layer from scratch.
Tension 3: Metadata doesn't match source text — location drift
Metadata says "pages 10–11" but the actual content is on page 12. This affects 5% of cases. Each investigation takes 1 hour, with only a 70% success rate.
The solution is to embed traceability data at every layer of the system — not adding logs after the fact, but recording at every node in real time as data flows through.
2. Four-Layer Metadata: What to Record and Why
Design principle: design around the audit chain. Each layer records only critical information — no redundant fields.
Early on, we recorded 20+ fields (including server IP, processing duration, user ID, etc.). 90% of them were never used. Storage cost increased 30%. After trimming to 12 core fields, storage cost dropped to +15% with zero loss in traceability capability.
Four-layer structure:
| Layer | Core fields (12 total) | Design purpose | Typical use case |
|---|---|---|---|
| Identity layer |
chunk_id, doc_id, session_id
|
Unique identifiers across the full pipeline | Use chunk_id to locate the exact fragment when a company challenges a conclusion |
| Position layer |
page_range, char_offset, block_index
|
Physical location in source document | When a company asks "which page?", return the page number directly |
| Technical layer |
embedding_model, vector_dim, chunk_strategy
|
Record technical parameters for debugging | When accuracy drops, check whether a model version change caused it |
| Business layer |
similarity_score, gri_code, confidence_level
|
Link to business attributes, explain judgment logic | Explain "this chunk had similarity 0.69 < threshold 0.7, so it wasn't retrieved" |
Complete metadata example:
chunk_metadata = {
# Identity layer
"chunk_id": "C158",
"doc_id": "ESG_2023_001",
"session_id": "session_20231015_143000",
# Position layer
"page_range": "15-15",
"char_offset": [120, 350],
"block_index": 3,
# Technical layer
"embedding_model": "text-embedding-3-large",
"vector_dim": 1536,
"chunk_strategy": "2000chars+300overlap",
# Business layer
"similarity_score": 0.92,
"gri_code": "GRI-305-1",
"confidence_level": "high"
}
Why four layers, not three or five?
| Option | Traceability | Investigation time | Storage cost | Verdict |
|---|---|---|---|---|
| Three layers (no technical layer) | Can't debug technical issues | 1 hour | +10% | ❌ Insufficient |
| Four layers (current) | Complete traceability | 5 minutes | +15% | ✅ Optimal |
| Five layers (add time/user layer) | Same as four layers | 5 minutes | +25% | ❌ Over-engineered |
The technical layer is the critical differentiator. Without it, when accuracy drops, you can't determine whether the cause was a model version change, a chunking strategy adjustment, or something else entirely.
This four-layer design is universal: Identity + Position + Technical + Business layers are domain-agnostic. For legal documents, medical records, or financial reports, you only need to replace the business layer field definitions.
3. Three-Level Verification: Trace the Data Flow, Don't Guess
With four-layer metadata in place, debugging shifts from "guessing" to "following the data flow layer by layer."
Design logic: ordered by data flow — parsing → chunking → retrieval. This matches the direction of error propagation: a parsing error corrupts everything downstream; a chunking error corrupts retrieval; a retrieval parameter error only affects recall results.
Three-Level Verification Flow
Level 1 — Parsing verification (link via doc_id to parse log)
├─ Check 1: parsed page count vs. original PDF page count
│ └─ Pages missing → PDF dropped pages → trigger repair
└─ Check 2: text coverage rate
└─ < 95% → scanned document not OCR'd → trigger repair
→ Identifies 40% of issues
Level 2 — Chunking verification (link via chunk_id to chunk log)
├─ Check 1: page_range in metadata vs. actual page number
│ └─ Mismatch → chunk location drift → trigger repair
└─ Check 2: key term cross-chunk rate
└─ > 10% → chunk boundary error → trigger repair
→ Identifies 45% of issues
Level 3 — Retrieval verification (link via session_id across full pipeline)
├─ Check 1: top_k parameter
│ └─ Relevant chunk ranked outside top_k → parameter too small → trigger repair
└─ Check 2: similarity score distribution
└─ All chunks below 0.7 → query issue → trigger repair
→ Identifies 15% of issues
85% of issues are identified in the first two levels. Retrieval-layer issues account for only 15%, and they're typically configuration problems — the easiest to fix.
Three-level verification code skeleton:
def three_level_check(doc_id: str, chunk_id: str, session_id: str) -> dict:
issues = []
# Level 1: Parsing layer
parse_log = get_parse_log(doc_id)
if parse_log["parsed_pages"] < parse_log["original_pages"]:
issues.append({
"level": "parsing",
"type": "missing_pages",
"detail": f"Lost {parse_log['original_pages'] - parse_log['parsed_pages']} pages"
})
if parse_log["text_coverage"] < 0.95:
issues.append({
"level": "parsing",
"type": "ocr_needed",
"detail": f"Text coverage: {parse_log['text_coverage']:.1%}"
})
# Level 2: Chunking layer
chunk_log = get_chunk_log(chunk_id)
if chunk_log["page_range"] != chunk_log["actual_page"]:
issues.append({
"level": "chunking",
"type": "page_mismatch",
"detail": f"Metadata page {chunk_log['page_range']} vs actual {chunk_log['actual_page']}"
})
if chunk_log["term_cross_rate"] > 0.1:
issues.append({
"level": "chunking",
"type": "term_split",
"detail": f"Term cross-chunk rate: {chunk_log['term_cross_rate']:.1%}"
})
# Level 3: Retrieval layer
retrieval_log = get_retrieval_log(session_id)
if retrieval_log["relevant_chunk_rank"] > retrieval_log["top_k"]:
issues.append({
"level": "retrieval",
"type": "top_k_too_small",
"detail": f"Relevant chunk rank: {retrieval_log['relevant_chunk_rank']}, top_k={retrieval_log['top_k']}"
})
return {
"issues": issues,
"root_cause_level": issues[0]["level"] if issues else None
}
4. Auto-Repair: Fix It Automatically Whenever Possible
Three-level verification locates the problem. The auto-repair module applies the right fix for each problem type:
Four problem types + four repair strategies:
Problem type Coverage Repair strategy
─────────────────────────────────────────────────────────────────
PDF dropped pages 10% Switch parsing tool (PyMuPDF → pdfplumber)
Term cross-chunk 45% Merge adjacent chunks to restore complete expression
Low similarity 20% 0.6–0.7 → rewrite query
< 0.6 → notify ops team to expand knowledge base
top_k too small 10% Dynamically adjust by clause type
─────────────────────────────────────────────────────────────────
Auto-repair coverage 85%
Requires human 15% (knowledge base gaps 10% + complex logic errors 5%)
Dynamic top_k adjustment — design detail:
Different clause types need different top_k values. Multi-dimensional disclosure clauses (e.g., GRI 305-1, requiring total emissions + calculation method + data source) need more candidate chunks. Single data-point clauses (e.g., GRI 301-1, requiring only a materials usage figure) need far fewer:
CLAUSE_TOP_K_CONFIG = {
"multi_dimension": 8, # multi-dimensional disclosure clauses (305-1, 306-3, etc.)
"single_point": 5, # single data-point clauses (301-1, 302-5, etc.)
"default": 5
}
def get_dynamic_top_k(gri_code: str) -> int:
clause_type = get_clause_type(gri_code) # look up clause attributes from knowledge base
return CLAUSE_TOP_K_CONFIG.get(clause_type, CLAUSE_TOP_K_CONFIG["default"])
Auto-repair results:
- Auto-repair rate: 0% → 85%
- Human intervention rate: 100% → 15%
- Operations cost reduced by 80%
- Multi-dimensional clause miss rate: 18% → 3%
5. Two Real Cases
These cases are from the ESG compliance scenario, but the investigation process itself — identify the problem layer, inspect the corresponding stage, trigger repair — is universal.
Case 1: GRI 306-3 Missed Detection (Chunk Boundary Error)
Company challenge: "Pages 12–13 of our report explicitly disclose the environmental incident impact scope. Why was this flagged as non-compliant?"
Traceability walkthrough (5 minutes):
Step 1 — Pull retrieval log (0.5 min)
Query business layer metadata using conclusion_id
Finding: system only matched "emergency response" fragment
missing "impact scope" — the core disclosure point
Step 2 — Check chunking log (1 min)
Query position layer metadata using chunk_id=chunk_162
Finding: content spans pages 12–13, split into:
chunk_162: "...affecting the flow..." (similarity 0.69)
chunk_163: "...area of approximately 0.5km²" (similarity 0.71)
Step 3 — Technical layer analysis (1 min)
chunk_162 similarity 0.69 < threshold 0.7 → not retrieved
chunk_163 > 0.7, but content is incomplete — cannot stand alone
Step 4 — Three-level verification locates root cause (1.5 min)
Parsing layer: page count normal, text coverage 96% — no issue
Chunking layer: mixed table/text layout caused boundary error,
complete expression was truncated ← ROOT CAUSE
Retrieval layer: no issue
Step 5 — Auto-repair (1 min)
Merge chunk_162 + chunk_163 → complete expression restored
Re-retrieve: hit rate 100%, similarity 0.84
Conclusion revised to "Compliant"
Total time: 5 minutes (vs. 2 hours with traditional manual investigation)
Case 2: GRI 305-1 False Negative (top_k Too Small)
Company challenge: "We disclosed Scope 1/2/3 carbon emissions data. Why was this flagged as non-compliant?"
Traceability walkthrough (5 minutes):
Step 1 — Pull retrieval log (0.5 min)
Query business layer metadata using conclusion_id
Finding: only chunk_158 retrieved (contains Scope 1/2 data)
missing "data source" — a core disclosure point
Step 2 — Check related chunks (1 min)
Query all chunks associated with gri_code=305-1
Finding: chunk_170 ("Data source: third-party carbon verification report")
similarity 0.72 ≥ threshold — but top_k=5,
chunk_170 ranked 6th — not retrieved
Step 3 — Three-level verification locates root cause (1.5 min)
Parsing layer: no issue
Chunking layer: no issue
Retrieval layer: top_k=5 insufficient for multi-dimensional clause 305-1 ← ROOT CAUSE
Step 4 — Auto-repair (1 min)
Dynamic adjustment: 305-1 is a multi-dimensional disclosure clause → top_k adjusted to 8
Re-retrieve: chunk_170 now retrieved
Conclusion revised to "Compliant"
Step 5 — Notify company (1 min)
"GRI 305-1 requires multi-dimensional disclosure. Retrieval parameters have been
dynamically adjusted. Conclusion now aligns with standard requirements."
Total time: 5 minutes. Multi-dimensional clause miss rate: 18% → 3%.
6. Storage and Performance
Why store metadata separately in PostgreSQL rather than mixing it into Milvus?
Milvus supports scalar field metadata storage, but two problems arise:
- With 12 metadata fields, complex joint queries are required (e.g., "filter by time range + clause ID + similarity score range simultaneously"). Milvus's query capability doesn't support this.
- Weak transactional guarantees — metadata updates (e.g., rewriting
similarity_scoreafter a repair) can't be made atomic.
PostgreSQL's SQL query capability and transaction support make it the right choice for metadata storage.
Performance problem: 10,000+ metadata records per report, queries were slow
A single ESG report averages 200–500 chunks. Each chunk retrieves Top 5 clauses. Each record has 12 fields. Total: ~10,000 metadata records per report. Initially, querying full-chain traceability for one report took 2 seconds.
Three-step optimization:
-- Optimization 1: monthly table partitioning to reduce single-table size
-- Table naming: metadata_2023_10, metadata_2023_11...
CREATE TABLE metadata_2023_10 PARTITION OF metadata
FOR VALUES FROM ('2023-10-01') TO ('2023-11-01');
-- Optimization 2: composite index for common query patterns
CREATE INDEX idx_chunk_clause ON metadata_2023_10
(chunk_id, gri_code, similarity_score);
-- Optimization 3: hot/cold data separation
-- Data older than 3 months migrated to object storage, loaded on demand
Optimization results:
| Metric | Before | After |
|---|---|---|
| Query time per report | 2 seconds | 300ms |
| Concurrent reports supported | 1 | 10 |
| Storage cost | Baseline | -30% (cold data to object storage) |
7. Wrapping Up: The Traceability Architecture Decision Tree
When building a new production-grade RAG system, three questions determine how much to invest in traceability infrastructure:
Q1: Do conclusions need to be auditable?
├─ Yes (compliance / legal / medical / financial scenarios)
│ → Four-layer metadata is required.
│ Minimum: identity layer + position layer + business layer.
└─ No (internal tools / prototype validation)
→ Simplified recording: just chunk_id + page_range is sufficient.
Q2: Do problems need to be located quickly?
├─ Yes (production environment, high SLA requirements)
│ → Three-level verification is required.
│ Order by data flow direction.
└─ No (offline batch processing, slow investigation is acceptable)
→ Manual investigation is fine.
Q3: Does repair need to be automated?
├─ Yes (operations cost is a concern, scaled deployment)
│ → Build auto-repair strategies by problem type.
│ Target 80%+ coverage of common issues.
└─ No (small scale, human intervention is acceptable)
→ Manual repair is fine.
This four-layer metadata + three-level verification design is the observability baseline for any production-grade RAG system.
When transferring to a new domain, only two things need to change:
-
Business layer fields: replace
gri_codewithlaw_article_id(legal),icd_code(medical), orregulation_id(financial) - Three-level verification rules: replace "term cross-chunk rate > 10%" with the quality indicators appropriate for your scenario
The identity layer, position layer, and technical layer designs are fully universal — no changes needed.
Source Code
All implementations referenced in this article are available here:
👉 github.com/muzinan123/production-rag-engineering
Relevant files for this part:
-
esg/services/embedding_service.py— 4-layer metadata recording at write time -
esg/routers/evaluation.py— evaluation API entry point with traceability hooks
Next up: The system is live. Traceability is in place. But how do you know whether the system is getting better or getting worse? The miss rate dropped from 60% to 38% — not by gut feeling. Behind that improvement is a golden test set + three-tier metrics + regression gate evaluation loop. → Part 6 — Evaluation & Iteration
Top comments (0)