James Lee

Posted on Jun 18

Part 6 — RAG Recall Quality from 60% to 93%: Building a Continuous Evaluation Loop (Not Gut Feeling)

#llm #performance #rag #softwareengineering

This article covers the sixth and final layer of the full-stack architecture: the Evaluation & Iteration Loop. Without it, every optimization in the previous five layers is a one-time event. Core engineering value: turning "feels better" into "data proves it's better" — giving the system the ability to improve itself.

📦 Source code: production-rag-engineering — esg/services/evaluation_service.py, esg/routers/evaluation.py

0. The Pain Point

After all five layers went live, the system ran for two weeks. Miss rate was still 60%.

The team started tuning parameters: chunk size from 512 to 1024 — miss rate dropped a little, but no one knew whether it was the chunking change or something else. Then the similarity threshold from 0.8 to 0.7 — miss rate changed again, but false positive rate went up at the same time.

After every change: no idea what actually improved, by how much, or why.

This isn't a technical problem. It's an engineering methodology problem.

Optimization without an evaluation framework is fundamentally blind shooting — sometimes you hit the target by luck, sometimes you miss, and either way you still don't know where to aim next time.

The evaluation framework doesn't answer "how to optimize." It answers "how to know the optimization worked."

1. What Evaluation Needs to Solve

Three core tensions appear in any production-grade RAG system:

Tension 1: No baseline for optimization

Changed the chunking strategy — miss rate dropped from 60% to 55%. Is that 5% improvement from the chunking change, or was this batch of test data just easier? Without a fixed golden test set, you can't rule out data variance.

Tension 2: Problem location is unclear

The same "missed detection" symptom could come from: a chunking step that truncated a clause, a retrieval threshold set too high, or a prompt that doesn't handle vague language. Without layered metrics, looking at a single "miss rate" number gives you no idea which layer to start fixing.

Tension 3: Iteration has no closed loop

One round of optimization worked well — shipped it. Two months later, a manufacturing industry client was onboarded and miss rate climbed back to 50%. Because the test set was never updated, the evaluation baseline drifted from the business reality, and the system degraded on new scenarios without anyone noticing.

These three tensions define exactly what the evaluation framework needs to do: golden test set (fixed baseline) + three-tier metrics (layered diagnosis) + regression gate (closed-loop guarantee).

2. The Golden Test Set: The Foundation of the Evaluation Framework

The foundation of the entire evaluation framework is a fixed, human-annotated golden test set.

Construction method:

We sampled 10 representative reports from real business documents, covering three industries (manufacturing, financial services, energy). Human annotators labeled 80 rules across Environmental, Social, and Governance categories. Each annotation includes "the content that should be retrieved" and "the correct judgment conclusion" — this is the ground truth.

golden_test_set = [
    {
        "query": "GRI 305-1 direct greenhouse gas emissions",
        "expected_chunks": ["chunk_245", "chunk_246"],  # chunks that must be retrieved
        "expected_result": "Fully Met",
        "industry": "manufacturing",
        "clause_type": "Environmental"
    },
    {
        "query": "GRI 306-3 significant spill incidents",
        "expected_chunks": ["chunk_162", "chunk_163"],
        "expected_result": "Partially Met",
        "missing_elements": ["spill volume"],
        "industry": "chemical",
        "clause_type": "Environmental"
    },
    # 80 entries total...
]

Why human annotation is irreplaceable:

Someone proposed using GPT-4 to auto-generate annotations to save time. This approach has a fundamental flaw: the standard used to evaluate the system cannot be generated by the system itself.

If GPT-4's annotations contain biases, the "92% accuracy" measured against that standard is meaningless — the system has only learned to make the same mistakes as GPT-4.

Human annotation is a one-time cost (approximately 2 weeks). What it buys is a trustworthy evaluation baseline. The credibility of that baseline is the prerequisite for everything else in the evaluation framework.

Ongoing test set updates:

The test set is not a one-time artifact. We add 10–20 new annotated entries per month from new documents. Two reasons:

Business scenarios evolve — new industries, updated GRI standards, new linguistic patterns
If the test set goes stale, the system may overfit to old data and degrade on new inputs without triggering any alerts

Update triggers:

New industry client onboarded (need coverage of new industry terminology)
User reports a missed detection (new miss case added to test set)
Accuracy drops below threshold (signals the current test set no longer covers the real problem space)

3. Three-Tier Metrics

With a golden test set in place, a layered metrics system is needed to pinpoint which layer a problem lives in.

Design logic: good quality but poor efficiency isn't acceptable. Good quality and good efficiency but users aren't satisfied isn't acceptable either. The three tiers answer three different questions.

Quality tier — Is the system accurate?

Metric	Definition	Why it matters
Miss rate	% of clauses that should be detected but weren't	Core metric — directly affects compliance risk
False positive rate	% of clauses incorrectly judged as satisfied	False confidence — makes companies think they're compliant when they're not
Precision (Top1)	% of cases where the first retrieved result is correct	Measures retrieval precision
Recall (Top3)	% of cases where the correct answer appears in Top 3	Measures retrieval coverage

Efficiency tier — Is the system fast?

Metric	Definition	Target
Chunking time	Time to complete chunking for one report	< 5 minutes
Retrieval latency	Latency per rule retrieval	< 100ms
End-to-end latency	Total time from upload to report delivery	< 2 hours

Business tier — Is the system useful?

Metric	Definition	Target
User satisfaction	% of companies that accept the detection conclusions	> 90%
Manual review rate	% of conclusions requiring human intervention	< 15%
Remediation clarity rate	% of companies that can act directly on the report	> 85%

How to use the three tiers:

Don't look at all metrics every time. Route to the relevant tier based on the symptom:

User reports "conclusions are wrong"  → Quality tier (miss rate / false positive rate)
User reports "system is too slow"     → Efficiency tier (retrieval latency / chunking time)
User reports "report isn't useful"    → Business tier (remediation clarity rate)

4. Three Rounds of Evaluation-Driven Iteration

With a golden test set and three-tier metrics, optimization shifts from blind shooting to targeted improvement.

Round 1: Chunking strategy

Quality tier shows miss rate at 60%. Three-level verification (from Part 5) locates the issue in the chunking layer.

Controlled test across three chunking strategies:

Strategy	Miss rate	Precision	Recall	Issue
Fixed 512 tokens	60%	85%	40%	Cross-paragraph clauses truncated
Fixed 1024 tokens	50%	87%	52%	Less truncation, but irrelevant content introduced
Semantic chunking (paragraph boundary)	38%	92%	62%	✅ Optimal

Why does semantic chunking win?

The problem with fixed-token chunking is "not knowing where to cut." A GRI clause may span two natural paragraphs. Fixed chunking may split it in the middle — retrieval then surfaces only half the clause. The similarity score looks acceptable (0.65–0.70), but the content is incomplete.

Semantic chunking splits at paragraph and section boundaries, recognizing headings, paragraphs, and list structures to preserve semantic integrity. Average chunk size drops from 512 to 280 tokens — but every chunk is a complete semantic unit.

Round 2: Retrieval strategy

After chunking optimization, miss rate dropped to 38%. Quality tier shows Top1 precision still has room to improve. Issue located in the retrieval layer.

Top K calibration:

Top1: precision 85%, but miss rate high (relevant content at rank 2–3 gets missed)
Top3: precision 92%, miss rate drops to 38%  ← optimal
Top5: precision 92%, but more noise introduced — LLM judgment degrades

Similarity threshold calibration (tested against 100 golden test set rules):

Threshold 0.7: precision 70% (too low — too much irrelevant content retrieved)
Threshold 0.8: precision 92%  ← optimal
Threshold 0.9: precision 93%, but recall drops 15% (too high — relevant content missed)

0.8 is the Pareto-optimal point between precision and recall.

Round 3: Prompt optimization

After chunking and retrieval optimization, miss rate dropped to 38%. But two recurring problem types kept appearing in the quality tier metrics:

Problem 1: Vague language miss

Report states: "Some suppliers have completed ESG assessments." The prompt had no rule for handling vague language — the model returned "Not Met" (no specific numbers). The correct answer is "Partially Met."

Fix — added rule to prompt:

"If the report uses vague language (some / a portion / most), classify as Partially Met and note 'specific figures/percentages required' — do not classify as Not Met."

Problem 2: Cross-chapter miss

Scope 3 emissions data across 11 categories was distributed across different chapters. A single retrieval only surfaced one chapter. The model returned "incomplete."

Fix — added rule to prompt:

"If the same element appears across multiple chapters, evaluate holistically. Do not mark an element as missing because a single paragraph is incomplete."

Cumulative effect across three rounds:

$$\text{Miss rate: } 60\% \xrightarrow{\text{chunking}} 38\% \xrightarrow{\text{retrieval}} 25\% \xrightarrow{\text{prompt}} 7\%$$

$$\text{Accuracy: } 85\% \xrightarrow{\text{three rounds}} 93\%$$

5. A Counterintuitive Finding

After semantic chunking, average chunk size dropped from 512 tokens to 280 tokens — chunks got smaller.

Intuitively, smaller chunks mean more fragmented information, which should make retrieval harder.

The actual result: smaller chunks produced better retrieval performance and lower cost.

Why:

The problem with fixed 512-token chunking wasn't "chunks too small" — it was "cutting in the wrong place." A GRI clause spanning two paragraphs gets split in the middle. Retrieval surfaces half a clause. Similarity looks okay (0.65–0.70), but the content is incomplete.

Semantic chunking splits at paragraph boundaries. Each chunk is only 280 tokens, but it's a complete clause. Top3 is sufficient — Top5 is no longer needed.

Quantified comparison:

Metric	Fixed 512 tokens	Semantic 280 tokens
Average chunk size	512 tokens	280 tokens
Top K required	Top5	Top3
Token consumption	Baseline	-30%
Miss rate	60%	38%

Quality improved and cost dropped simultaneously. This is the most important counterintuitive finding the evaluation framework surfaced. Without controlled testing, you'd never discover that "smaller chunks" actually performs better.

6. Regression Gate: Ensuring Optimization Never Goes Backward

The final safeguard in the evaluation framework is the regression gate — every change must pass the golden test set before it can go to production.

Gate logic:

def regression_gate(change_type: str, new_config: dict) -> bool:
    """
    change_type: prompt_update / chunk_strategy / retrieval_params
    Returns True = approved for release, False = blocked
    """
    # Run golden test set
    results = run_golden_test_set(new_config)

    # Compare against current production metrics
    baseline = get_production_metrics()

    # Core metric drops beyond threshold → block release
    if results["accuracy"] < baseline["accuracy"] - REGRESSION_THRESHOLD:
        trigger_rollback(
            reason=f"Accuracy dropped {baseline['accuracy'] - results['accuracy']:.1%}"
        )
        return False

    if results["recall_rate"] < baseline["recall_rate"] - REGRESSION_THRESHOLD:
        trigger_rollback(
            reason=f"Recall dropped {baseline['recall_rate'] - results['recall_rate']:.1%}"
        )
        return False

    return True

# Threshold tightens as test set grows
REGRESSION_THRESHOLD = 0.02  # current: 2% (was 5% early on)

Why did the threshold tighten from 5% to 2%?

Early on, the test set had only 30 entries. Statistical noise was high — a 5% swing could just be sample variance. As the test set grew to 80 entries, statistical significance improved. A 2% drop now reliably signals real degradation rather than noise.

Regression gate trigger log:

Over 8 months in production, the regression gate triggered 3 times:

A prompt update introduced a new vague language rule, but caused a 3.2% accuracy drop on a specific class of precise disclosures — blocked, revised, re-released
A chunking parameter change (chunk_size from 2000 to 1500) caused a 2.8% recall drop — blocked
A similarity threshold change from 0.8 to 0.75 caused a 4.1% false positive rate increase — blocked

These 3 blocks prevented 3 production regressions.

7. Cost Optimization: Cache + Evaluation-Guided Direction

The evaluation framework doesn't just improve quality — it also points to where cost optimization is possible.

Efficiency tier metrics flagged API call costs as elevated. Two optimization opportunities identified:

Optimization 1: Redis cache for high-frequency retrieval results

The GRI rule library is relatively static (updated annually). Retrieval results for the same rule can be reused:

def cached_embedding_search(query: str, gri_code: str) -> list:
    cache_key = f"search:{gri_code}:{hash(query)}"

    # Check cache first
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # Cache miss — call Embedding API
    results = embedding_search(query)

    # Write to cache, TTL 24 hours (business data updates daily)
    redis_client.setex(cache_key, 86400, json.dumps(results))
    return results

Cache hit rate: 60% — eliminating 60% of redundant Embedding API calls.

Optimization 2: The cost side effect of chunking optimization

Semantic chunking reduced average chunk size from 512 to 280 tokens. Top K dropped from 5 to 3. Token consumption per retrieval dropped by nearly half.

Both optimizations combined: total token consumption reduced by 30%+, while accuracy improved by 8% simultaneously.

8. The Continuous Iteration Loop

The final form of the evaluation framework is a continuously running loop — not a one-time optimization project:

System running in production
    ↓
Collect problem signals
(user-reported misses / accuracy drops / new industry documents)
    ↓
Classify and archive
(chunking issue / retrieval issue / prompt issue)
    ↓
Update golden test set
(add test cases matching the new problem type)
    ↓
Update few-shot examples
(add new miss cases to prompt examples)
    ↓
Targeted optimization
(chunking strategy / retrieval parameters / prompt rules)
    ↓
Regression gate validation
(run golden test set — must pass before release)
    ↓
Release → continue collecting problem signals

Loop trigger conditions (any one triggers the loop):

Trigger	Action
New industry client onboarded	Add industry-specific test cases; verify existing strategies apply
User reports missed detection	Archive miss case, add to test set, update few-shot examples
Accuracy drops > 2%	Trigger three-level verification, locate degraded layer, targeted fix
Monthly routine evaluation	Update test set with new documents, re-run full metrics

Core value of the loop: the system cannot silently degrade due to business changes. Every regression is caught by metrics. Every fix is validated by data. Every release is gated by the regression check.

9. Closing: What the Evaluation Framework Really Is

Looking back at the engineering journey across this series:

Part 1 solved "how to turn documents into a searchable knowledge base"
Part 2 solved "how to chunk without destroying semantic structure"
Part 3 solved "how to retrieve accurately in domain-specific terminology scenarios"
Part 4 solved "how to produce quantifiable conclusions from retrieval results"
Part 5 solved "how to identify root cause in 5 minutes when something breaks"
Part 6 solved "how to make all of the above continuously improve"

Without an evaluation framework, the optimizations in the first five layers are one-time events. The day of launch is the system's peak performance — after that, it can only degrade as business conditions change.

With an evaluation framework, the system gains the ability to improve itself:

$$\text{Golden test set (fixed baseline)} + \text{Three-tier metrics (layered diagnosis)} + \text{Regression gate (closed-loop guarantee)}$$

$$= \text{Turning RAG optimization from intuition into engineering}$$

This evaluation loop applies to any production-grade LLM system. The only things you replace are the content of the golden test set (swap GRI clauses for legal statutes / medical guidelines / financial regulations) and the business tier metric definitions. The three-tier metrics structure, the regression gate logic, the continuous test set update mechanism — these are universal engineering practices, independent of any specific business domain.

Source Code

The complete implementation for all six parts is available here:

👉 github.com/muzinan123/production-rag-engineering

Relevant files for this part:

esg/services/evaluation_service.py — golden dataset evaluation (score_hit + score_find)
esg/routers/evaluation.py — evaluation API entry point

Full series index:

Part 1 — Ingestion Pipeline: esg/services/loading_service.py, parsing_service.py
Part 2 — Chunking Service: esg/services/chunking_service.py
Part 3 — Hybrid Retrieval: esg/services/embedding_service.py, search_service.py
Part 4 — Judgment Engine: esg/services/generation_service.py
Part 5 — Full-Chain Traceability: esg/services/embedding_service.py, routers/evaluation.py
Part 6 — Evaluation & Iteration: esg/services/evaluation_service.py, routers/evaluation.py

This completes the full six-layer breakdown of the production RAG system. From data ingestion to evaluation loop, every layer has explicit engineering decisions and quantifiable outcomes. This methodology has been validated across three different industry scenarios. Whether your domain is legal contracts, financial audits, or medical records — if you need knowledge-based decisions that are traceable, high-precision, and auditable, this architecture is your production-grade baseline.

Top comments (2)

Max Quimby • Jun 21

The test-set-drift story (new client onboards, miss rate climbs back to 50%, baseline silently lied) is the part most teams underestimate, and it's the one I'd lead with. A golden set is a snapshot of a traffic distribution, and the moment that distribution moves — new doc type, new domain, new phrasing — your "fixed" baseline stops representing reality while still reporting green. What's worked for us is making the golden set a living thing: continuously sample production queries into a candidate pool, then periodically promote the hard/novel ones (low-confidence retrievals, user-flagged misses) into the labeled set, so the baseline tracks the business instead of drifting from it. Your three-tier metrics point is the other underrated half — separating chunking recall from retrieval recall from generation faithfulness is the only way to attribute a delta to a layer instead of chasing a single global miss-rate number that could move for three unrelated reasons. How do you size the golden set against labeling cost, and do you weight it by business-criticality of query types or keep it uniform?

James Lee • Jun 22 • Edited

The silent drift is exactly what bit us — a manufacturing client onboarded in month 3, miss rate quietly climbed from 7% back to 19% before the monthly eval caught it.
Golden set had zero manufacturing terminology, so it kept reporting green.
On sizing: we split it into a stable core (80 cases, what the regression gate runs against) and a growing expansion pool fed by low-confidence retrievals (similarity 0.65–0.75) and user-flagged misses. Labeling budget goes to the decision boundary, not the easy cases — that's where the ROI is.
On weighting: not uniform. Environmental clauses are overrepresented (~40% vs. ~30% of actual traffic) because a false negative there carries regulatory risk. We accept the bias deliberately — it matches the client's risk profile.