James Lee

Posted on Jun 18

Part 5 — Installing a Black Box Recorder in Your RAG System: 4-Layer Metadata + 3-Level Verification, Root Cause in 5 Minutes

#ai #architecture #monitoring #rag

This article covers the fifth layer of the full-stack architecture: Full-Chain Traceability. This is not a standalone module — it's observability infrastructure embedded into every layer. Core engineering value: turning "something broke, let's guess" into "root cause identified in 5 minutes."

📦 Source code: production-rag-engineering — esg/services/embedding_service.py, esg/routers/evaluation.py

0. The Pain Point

After Part 4's judgment engine went live, the system could produce quantified scores and missing element breakdowns. But a new problem emerged almost immediately:

Companies started pushing back.

"Pages 12–13 of our report explicitly state the environmental incident impact scope. Why was this flagged as non-compliant?"

The system's response: "Retrieval results indicate missing impact scope disclosure."

The company followed up: "Where exactly is it missing?"

The system went silent — because the retrieval process hadn't been recorded, and there was nothing to show.

Real numbers: company challenge rate was 35% (1 in every 3 reports was disputed). Manual investigation: 2 hours per case. Audit pass rate: 70%.

The investigation workflow looked like this: check original report (30 min) → check retrieval logs (30 min) → check chunking logs (30 min) → check adjacent chunks (30 min). Total: 2 hours. Success rate: 80%. The remaining 20% had no identifiable root cause at all.

This wasn't a judgment logic problem. The system had no observability infrastructure.

A car without a dashcam can only guess what happened after an accident. A RAG system without full-chain traceability means 2 hours of blind investigation every time something goes wrong.

1. What Traceability Needs to Solve

Production-grade RAG systems have three core tensions that appear in any domain:

Tension 1: Conclusions can't be traced — no one can convince anyone

The system says "missing impact scope disclosure." The user says "it's right there on page 12." Neither side has evidence. Manual review is the only option.

Tension 2: Debugging is guesswork — no idea which layer to start from

The same "missed detection" symptom could be caused by: a parsing step that dropped a page, a chunking step that truncated key content, or a retrieval parameter set too low. Without traceability data, you're guessing layer by layer from scratch.

Tension 3: Metadata doesn't match source text — location drift

Metadata says "pages 10–11" but the actual content is on page 12. This affects 5% of cases. Each investigation takes 1 hour, with only a 70% success rate.

The solution is to embed traceability data at every layer of the system — not adding logs after the fact, but recording at every node in real time as data flows through.

2. Four-Layer Metadata: What to Record and Why

Design principle: design around the audit chain. Each layer records only critical information — no redundant fields.

Early on, we recorded 20+ fields (including server IP, processing duration, user ID, etc.). 90% of them were never used. Storage cost increased 30%. After trimming to 12 core fields, storage cost dropped to +15% with zero loss in traceability capability.

Four-layer structure:

Layer	Core fields (12 total)	Design purpose	Typical use case
Identity layer	`chunk_id`, `doc_id`, `session_id`	Unique identifiers across the full pipeline	Use `chunk_id` to locate the exact fragment when a company challenges a conclusion
Position layer	`page_range`, `char_offset`, `block_index`	Physical location in source document	When a company asks "which page?", return the page number directly
Technical layer	`embedding_model`, `vector_dim`, `chunk_strategy`	Record technical parameters for debugging	When accuracy drops, check whether a model version change caused it
Business layer	`similarity_score`, `gri_code`, `confidence_level`	Link to business attributes, explain judgment logic	Explain "this chunk had similarity 0.69 < threshold 0.7, so it wasn't retrieved"

Complete metadata example:

chunk_metadata = {
    # Identity layer
    "chunk_id": "C158",
    "doc_id": "ESG_2023_001",
    "session_id": "session_20231015_143000",

    # Position layer
    "page_range": "15-15",
    "char_offset": [120, 350],
    "block_index": 3,

    # Technical layer
    "embedding_model": "text-embedding-3-large",
    "vector_dim": 1536,
    "chunk_strategy": "2000chars+300overlap",

    # Business layer
    "similarity_score": 0.92,
    "gri_code": "GRI-305-1",
    "confidence_level": "high"
}

Why four layers, not three or five?

Option	Traceability	Investigation time	Storage cost	Verdict
Three layers (no technical layer)	Can't debug technical issues	1 hour	+10%	❌ Insufficient
Four layers (current)	Complete traceability	5 minutes	+15%	✅ Optimal
Five layers (add time/user layer)	Same as four layers	5 minutes	+25%	❌ Over-engineered

The technical layer is the critical differentiator. Without it, when accuracy drops, you can't determine whether the cause was a model version change, a chunking strategy adjustment, or something else entirely.

This four-layer design is universal: Identity + Position + Technical + Business layers are domain-agnostic. For legal documents, medical records, or financial reports, you only need to replace the business layer field definitions.

3. Three-Level Verification: Trace the Data Flow, Don't Guess

With four-layer metadata in place, debugging shifts from "guessing" to "following the data flow layer by layer."

Design logic: ordered by data flow — parsing → chunking → retrieval. This matches the direction of error propagation: a parsing error corrupts everything downstream; a chunking error corrupts retrieval; a retrieval parameter error only affects recall results.

Three-Level Verification Flow

Level 1 — Parsing verification (link via doc_id to parse log)
  ├─ Check 1: parsed page count vs. original PDF page count
  │   └─ Pages missing → PDF dropped pages → trigger repair
  └─ Check 2: text coverage rate
      └─ < 95% → scanned document not OCR'd → trigger repair
  → Identifies 40% of issues

Level 2 — Chunking verification (link via chunk_id to chunk log)
  ├─ Check 1: page_range in metadata vs. actual page number
  │   └─ Mismatch → chunk location drift → trigger repair
  └─ Check 2: key term cross-chunk rate
      └─ > 10% → chunk boundary error → trigger repair
  → Identifies 45% of issues

Level 3 — Retrieval verification (link via session_id across full pipeline)
  ├─ Check 1: top_k parameter
  │   └─ Relevant chunk ranked outside top_k → parameter too small → trigger repair
  └─ Check 2: similarity score distribution
      └─ All chunks below 0.7 → query issue → trigger repair
  → Identifies 15% of issues

85% of issues are identified in the first two levels. Retrieval-layer issues account for only 15%, and they're typically configuration problems — the easiest to fix.

Three-level verification code skeleton:

def three_level_check(doc_id: str, chunk_id: str, session_id: str) -> dict:
    issues = []

    # Level 1: Parsing layer
    parse_log = get_parse_log(doc_id)
    if parse_log["parsed_pages"] < parse_log["original_pages"]:
        issues.append({
            "level": "parsing",
            "type": "missing_pages",
            "detail": f"Lost {parse_log['original_pages'] - parse_log['parsed_pages']} pages"
        })

    if parse_log["text_coverage"] < 0.95:
        issues.append({
            "level": "parsing",
            "type": "ocr_needed",
            "detail": f"Text coverage: {parse_log['text_coverage']:.1%}"
        })

    # Level 2: Chunking layer
    chunk_log = get_chunk_log(chunk_id)
    if chunk_log["page_range"] != chunk_log["actual_page"]:
        issues.append({
            "level": "chunking",
            "type": "page_mismatch",
            "detail": f"Metadata page {chunk_log['page_range']} vs actual {chunk_log['actual_page']}"
        })

    if chunk_log["term_cross_rate"] > 0.1:
        issues.append({
            "level": "chunking",
            "type": "term_split",
            "detail": f"Term cross-chunk rate: {chunk_log['term_cross_rate']:.1%}"
        })

    # Level 3: Retrieval layer
    retrieval_log = get_retrieval_log(session_id)
    if retrieval_log["relevant_chunk_rank"] > retrieval_log["top_k"]:
        issues.append({
            "level": "retrieval",
            "type": "top_k_too_small",
            "detail": f"Relevant chunk rank: {retrieval_log['relevant_chunk_rank']}, top_k={retrieval_log['top_k']}"
        })

    return {
        "issues": issues,
        "root_cause_level": issues[0]["level"] if issues else None
    }

4. Auto-Repair: Fix It Automatically Whenever Possible

Three-level verification locates the problem. The auto-repair module applies the right fix for each problem type:

Four problem types + four repair strategies:

Problem type          Coverage    Repair strategy
─────────────────────────────────────────────────────────────────
PDF dropped pages     10%         Switch parsing tool (PyMuPDF → pdfplumber)
Term cross-chunk      45%         Merge adjacent chunks to restore complete expression
Low similarity        20%         0.6–0.7 → rewrite query
                                  < 0.6   → notify ops team to expand knowledge base
top_k too small       10%         Dynamically adjust by clause type
─────────────────────────────────────────────────────────────────
Auto-repair coverage  85%
Requires human        15% (knowledge base gaps 10% + complex logic errors 5%)

Dynamic top_k adjustment — design detail:

Different clause types need different top_k values. Multi-dimensional disclosure clauses (e.g., GRI 305-1, requiring total emissions + calculation method + data source) need more candidate chunks. Single data-point clauses (e.g., GRI 301-1, requiring only a materials usage figure) need far fewer:

CLAUSE_TOP_K_CONFIG = {
    "multi_dimension": 8,   # multi-dimensional disclosure clauses (305-1, 306-3, etc.)
    "single_point": 5,      # single data-point clauses (301-1, 302-5, etc.)
    "default": 5
}

def get_dynamic_top_k(gri_code: str) -> int:
    clause_type = get_clause_type(gri_code)  # look up clause attributes from knowledge base
    return CLAUSE_TOP_K_CONFIG.get(clause_type, CLAUSE_TOP_K_CONFIG["default"])

Auto-repair results:

Auto-repair rate: 0% → 85%
Human intervention rate: 100% → 15%
Operations cost reduced by 80%
Multi-dimensional clause miss rate: 18% → 3%

5. Two Real Cases

These cases are from the ESG compliance scenario, but the investigation process itself — identify the problem layer, inspect the corresponding stage, trigger repair — is universal.

Case 1: GRI 306-3 Missed Detection (Chunk Boundary Error)

Company challenge: "Pages 12–13 of our report explicitly disclose the environmental incident impact scope. Why was this flagged as non-compliant?"

Traceability walkthrough (5 minutes):

Step 1 — Pull retrieval log (0.5 min)
  Query business layer metadata using conclusion_id
  Finding: system only matched "emergency response" fragment
           missing "impact scope" — the core disclosure point

Step 2 — Check chunking log (1 min)
  Query position layer metadata using chunk_id=chunk_162
  Finding: content spans pages 12–13, split into:
    chunk_162: "...affecting the flow..." (similarity 0.69)
    chunk_163: "...area of approximately 0.5km²" (similarity 0.71)

Step 3 — Technical layer analysis (1 min)
  chunk_162 similarity 0.69 < threshold 0.7 → not retrieved
  chunk_163 > 0.7, but content is incomplete — cannot stand alone

Step 4 — Three-level verification locates root cause (1.5 min)
  Parsing layer: page count normal, text coverage 96% — no issue
  Chunking layer: mixed table/text layout caused boundary error,
                  complete expression was truncated ← ROOT CAUSE
  Retrieval layer: no issue

Step 5 — Auto-repair (1 min)
  Merge chunk_162 + chunk_163 → complete expression restored
  Re-retrieve: hit rate 100%, similarity 0.84
  Conclusion revised to "Compliant"

Total time: 5 minutes (vs. 2 hours with traditional manual investigation)

Case 2: GRI 305-1 False Negative (top_k Too Small)

Company challenge: "We disclosed Scope 1/2/3 carbon emissions data. Why was this flagged as non-compliant?"

Traceability walkthrough (5 minutes):

Step 1 — Pull retrieval log (0.5 min)
  Query business layer metadata using conclusion_id
  Finding: only chunk_158 retrieved (contains Scope 1/2 data)
           missing "data source" — a core disclosure point

Step 2 — Check related chunks (1 min)
  Query all chunks associated with gri_code=305-1
  Finding: chunk_170 ("Data source: third-party carbon verification report")
           similarity 0.72 ≥ threshold — but top_k=5,
           chunk_170 ranked 6th — not retrieved

Step 3 — Three-level verification locates root cause (1.5 min)
  Parsing layer: no issue
  Chunking layer: no issue
  Retrieval layer: top_k=5 insufficient for multi-dimensional clause 305-1 ← ROOT CAUSE

Step 4 — Auto-repair (1 min)
  Dynamic adjustment: 305-1 is a multi-dimensional disclosure clause → top_k adjusted to 8
  Re-retrieve: chunk_170 now retrieved
  Conclusion revised to "Compliant"

Step 5 — Notify company (1 min)
  "GRI 305-1 requires multi-dimensional disclosure. Retrieval parameters have been
   dynamically adjusted. Conclusion now aligns with standard requirements."

Total time: 5 minutes. Multi-dimensional clause miss rate: 18% → 3%.

6. Storage and Performance

Why store metadata separately in PostgreSQL rather than mixing it into Milvus?

Milvus supports scalar field metadata storage, but two problems arise:

With 12 metadata fields, complex joint queries are required (e.g., "filter by time range + clause ID + similarity score range simultaneously"). Milvus's query capability doesn't support this.
Weak transactional guarantees — metadata updates (e.g., rewriting similarity_score after a repair) can't be made atomic.

PostgreSQL's SQL query capability and transaction support make it the right choice for metadata storage.

Performance problem: 10,000+ metadata records per report, queries were slow

A single ESG report averages 200–500 chunks. Each chunk retrieves Top 5 clauses. Each record has 12 fields. Total: ~10,000 metadata records per report. Initially, querying full-chain traceability for one report took 2 seconds.

Three-step optimization:

-- Optimization 1: monthly table partitioning to reduce single-table size
-- Table naming: metadata_2023_10, metadata_2023_11...
CREATE TABLE metadata_2023_10 PARTITION OF metadata
FOR VALUES FROM ('2023-10-01') TO ('2023-11-01');

-- Optimization 2: composite index for common query patterns
CREATE INDEX idx_chunk_clause ON metadata_2023_10
(chunk_id, gri_code, similarity_score);

-- Optimization 3: hot/cold data separation
-- Data older than 3 months migrated to object storage, loaded on demand

Optimization results:

Metric	Before	After
Query time per report	2 seconds	300ms
Concurrent reports supported	1	10
Storage cost	Baseline	-30% (cold data to object storage)

7. Wrapping Up: The Traceability Architecture Decision Tree

When building a new production-grade RAG system, three questions determine how much to invest in traceability infrastructure:

Q1: Do conclusions need to be auditable?
  ├─ Yes (compliance / legal / medical / financial scenarios)
  │   → Four-layer metadata is required.
  │     Minimum: identity layer + position layer + business layer.
  └─ No (internal tools / prototype validation)
      → Simplified recording: just chunk_id + page_range is sufficient.

Q2: Do problems need to be located quickly?
  ├─ Yes (production environment, high SLA requirements)
  │   → Three-level verification is required.
  │     Order by data flow direction.
  └─ No (offline batch processing, slow investigation is acceptable)
      → Manual investigation is fine.

Q3: Does repair need to be automated?
  ├─ Yes (operations cost is a concern, scaled deployment)
  │   → Build auto-repair strategies by problem type.
  │     Target 80%+ coverage of common issues.
  └─ No (small scale, human intervention is acceptable)
      → Manual repair is fine.

This four-layer metadata + three-level verification design is the observability baseline for any production-grade RAG system.

When transferring to a new domain, only two things need to change:

Business layer fields: replace gri_code with law_article_id (legal), icd_code (medical), or regulation_id (financial)
Three-level verification rules: replace "term cross-chunk rate > 10%" with the quality indicators appropriate for your scenario

The identity layer, position layer, and technical layer designs are fully universal — no changes needed.

Source Code

All implementations referenced in this article are available here:

👉 github.com/muzinan123/production-rag-engineering

Relevant files for this part:

esg/services/embedding_service.py — 4-layer metadata recording at write time
esg/routers/evaluation.py — evaluation API entry point with traceability hooks

Next up: The system is live. Traceability is in place. But how do you know whether the system is getting better or getting worse? The miss rate dropped from 60% to 38% — not by gut feeling. Behind that improvement is a golden test set + three-tier metrics + regression gate evaluation loop. → Part 6 — Evaluation & Iteration

DEV Community