This article covers the sixth and final layer of the full-stack architecture: the Evaluation & Iteration Loop. Without it, every optimization in the previous five layers is a one-time event. Core engineering value: turning "feels better" into "data proves it's better" — giving the system the ability to improve itself.
📦 Source code: production-rag-engineering —
esg/services/evaluation_service.py,esg/routers/evaluation.py
0. The Pain Point
After all five layers went live, the system ran for two weeks. Miss rate was still 60%.
The team started tuning parameters: chunk size from 512 to 1024 — miss rate dropped a little, but no one knew whether it was the chunking change or something else. Then the similarity threshold from 0.8 to 0.7 — miss rate changed again, but false positive rate went up at the same time.
After every change: no idea what actually improved, by how much, or why.
This isn't a technical problem. It's an engineering methodology problem.
Optimization without an evaluation framework is fundamentally blind shooting — sometimes you hit the target by luck, sometimes you miss, and either way you still don't know where to aim next time.
The evaluation framework doesn't answer "how to optimize." It answers "how to know the optimization worked."
1. What Evaluation Needs to Solve
Three core tensions appear in any production-grade RAG system:
Tension 1: No baseline for optimization
Changed the chunking strategy — miss rate dropped from 60% to 55%. Is that 5% improvement from the chunking change, or was this batch of test data just easier? Without a fixed golden test set, you can't rule out data variance.
Tension 2: Problem location is unclear
The same "missed detection" symptom could come from: a chunking step that truncated a clause, a retrieval threshold set too high, or a prompt that doesn't handle vague language. Without layered metrics, looking at a single "miss rate" number gives you no idea which layer to start fixing.
Tension 3: Iteration has no closed loop
One round of optimization worked well — shipped it. Two months later, a manufacturing industry client was onboarded and miss rate climbed back to 50%. Because the test set was never updated, the evaluation baseline drifted from the business reality, and the system degraded on new scenarios without anyone noticing.
These three tensions define exactly what the evaluation framework needs to do: golden test set (fixed baseline) + three-tier metrics (layered diagnosis) + regression gate (closed-loop guarantee).
2. The Golden Test Set: The Foundation of the Evaluation Framework
The foundation of the entire evaluation framework is a fixed, human-annotated golden test set.
Construction method:
We sampled 10 representative reports from real business documents, covering three industries (manufacturing, financial services, energy). Human annotators labeled 80 rules across Environmental, Social, and Governance categories. Each annotation includes "the content that should be retrieved" and "the correct judgment conclusion" — this is the ground truth.
golden_test_set = [
{
"query": "GRI 305-1 direct greenhouse gas emissions",
"expected_chunks": ["chunk_245", "chunk_246"], # chunks that must be retrieved
"expected_result": "Fully Met",
"industry": "manufacturing",
"clause_type": "Environmental"
},
{
"query": "GRI 306-3 significant spill incidents",
"expected_chunks": ["chunk_162", "chunk_163"],
"expected_result": "Partially Met",
"missing_elements": ["spill volume"],
"industry": "chemical",
"clause_type": "Environmental"
},
# 80 entries total...
]
Why human annotation is irreplaceable:
Someone proposed using GPT-4 to auto-generate annotations to save time. This approach has a fundamental flaw: the standard used to evaluate the system cannot be generated by the system itself.
If GPT-4's annotations contain biases, the "92% accuracy" measured against that standard is meaningless — the system has only learned to make the same mistakes as GPT-4.
Human annotation is a one-time cost (approximately 2 weeks). What it buys is a trustworthy evaluation baseline. The credibility of that baseline is the prerequisite for everything else in the evaluation framework.
Ongoing test set updates:
The test set is not a one-time artifact. We add 10–20 new annotated entries per month from new documents. Two reasons:
- Business scenarios evolve — new industries, updated GRI standards, new linguistic patterns
- If the test set goes stale, the system may overfit to old data and degrade on new inputs without triggering any alerts
Update triggers:
- New industry client onboarded (need coverage of new industry terminology)
- User reports a missed detection (new miss case added to test set)
- Accuracy drops below threshold (signals the current test set no longer covers the real problem space)
3. Three-Tier Metrics
With a golden test set in place, a layered metrics system is needed to pinpoint which layer a problem lives in.
Design logic: good quality but poor efficiency isn't acceptable. Good quality and good efficiency but users aren't satisfied isn't acceptable either. The three tiers answer three different questions.
Quality tier — Is the system accurate?
| Metric | Definition | Why it matters |
|---|---|---|
| Miss rate | % of clauses that should be detected but weren't | Core metric — directly affects compliance risk |
| False positive rate | % of clauses incorrectly judged as satisfied | False confidence — makes companies think they're compliant when they're not |
| Precision (Top1) | % of cases where the first retrieved result is correct | Measures retrieval precision |
| Recall (Top3) | % of cases where the correct answer appears in Top 3 | Measures retrieval coverage |
Efficiency tier — Is the system fast?
| Metric | Definition | Target |
|---|---|---|
| Chunking time | Time to complete chunking for one report | < 5 minutes |
| Retrieval latency | Latency per rule retrieval | < 100ms |
| End-to-end latency | Total time from upload to report delivery | < 2 hours |
Business tier — Is the system useful?
| Metric | Definition | Target |
|---|---|---|
| User satisfaction | % of companies that accept the detection conclusions | > 90% |
| Manual review rate | % of conclusions requiring human intervention | < 15% |
| Remediation clarity rate | % of companies that can act directly on the report | > 85% |
How to use the three tiers:
Don't look at all metrics every time. Route to the relevant tier based on the symptom:
User reports "conclusions are wrong" → Quality tier (miss rate / false positive rate)
User reports "system is too slow" → Efficiency tier (retrieval latency / chunking time)
User reports "report isn't useful" → Business tier (remediation clarity rate)
4. Three Rounds of Evaluation-Driven Iteration
With a golden test set and three-tier metrics, optimization shifts from blind shooting to targeted improvement.
Round 1: Chunking strategy
Quality tier shows miss rate at 60%. Three-level verification (from Part 5) locates the issue in the chunking layer.
Controlled test across three chunking strategies:
| Strategy | Miss rate | Precision | Recall | Issue |
|---|---|---|---|---|
| Fixed 512 tokens | 60% | 85% | 40% | Cross-paragraph clauses truncated |
| Fixed 1024 tokens | 50% | 87% | 52% | Less truncation, but irrelevant content introduced |
| Semantic chunking (paragraph boundary) | 38% | 92% | 62% | ✅ Optimal |
Why does semantic chunking win?
The problem with fixed-token chunking is "not knowing where to cut." A GRI clause may span two natural paragraphs. Fixed chunking may split it in the middle — retrieval then surfaces only half the clause. The similarity score looks acceptable (0.65–0.70), but the content is incomplete.
Semantic chunking splits at paragraph and section boundaries, recognizing headings, paragraphs, and list structures to preserve semantic integrity. Average chunk size drops from 512 to 280 tokens — but every chunk is a complete semantic unit.
Round 2: Retrieval strategy
After chunking optimization, miss rate dropped to 38%. Quality tier shows Top1 precision still has room to improve. Issue located in the retrieval layer.
Top K calibration:
Top1: precision 85%, but miss rate high (relevant content at rank 2–3 gets missed)
Top3: precision 92%, miss rate drops to 38% ← optimal
Top5: precision 92%, but more noise introduced — LLM judgment degrades
Similarity threshold calibration (tested against 100 golden test set rules):
Threshold 0.7: precision 70% (too low — too much irrelevant content retrieved)
Threshold 0.8: precision 92% ← optimal
Threshold 0.9: precision 93%, but recall drops 15% (too high — relevant content missed)
0.8 is the Pareto-optimal point between precision and recall.
Round 3: Prompt optimization
After chunking and retrieval optimization, miss rate dropped to 38%. But two recurring problem types kept appearing in the quality tier metrics:
Problem 1: Vague language miss
Report states: "Some suppliers have completed ESG assessments." The prompt had no rule for handling vague language — the model returned "Not Met" (no specific numbers). The correct answer is "Partially Met."
Fix — added rule to prompt:
"If the report uses vague language (some / a portion / most), classify as Partially Met and note 'specific figures/percentages required' — do not classify as Not Met."
Problem 2: Cross-chapter miss
Scope 3 emissions data across 11 categories was distributed across different chapters. A single retrieval only surfaced one chapter. The model returned "incomplete."
Fix — added rule to prompt:
"If the same element appears across multiple chapters, evaluate holistically. Do not mark an element as missing because a single paragraph is incomplete."
Cumulative effect across three rounds:
$$\text{Miss rate: } 60\% \xrightarrow{\text{chunking}} 38\% \xrightarrow{\text{retrieval}} 25\% \xrightarrow{\text{prompt}} 7\%$$
$$\text{Accuracy: } 85\% \xrightarrow{\text{three rounds}} 93\%$$
5. A Counterintuitive Finding
After semantic chunking, average chunk size dropped from 512 tokens to 280 tokens — chunks got smaller.
Intuitively, smaller chunks mean more fragmented information, which should make retrieval harder.
The actual result: smaller chunks produced better retrieval performance and lower cost.
Why:
The problem with fixed 512-token chunking wasn't "chunks too small" — it was "cutting in the wrong place." A GRI clause spanning two paragraphs gets split in the middle. Retrieval surfaces half a clause. Similarity looks okay (0.65–0.70), but the content is incomplete.
Semantic chunking splits at paragraph boundaries. Each chunk is only 280 tokens, but it's a complete clause. Top3 is sufficient — Top5 is no longer needed.
Quantified comparison:
| Metric | Fixed 512 tokens | Semantic 280 tokens |
|---|---|---|
| Average chunk size | 512 tokens | 280 tokens |
| Top K required | Top5 | Top3 |
| Token consumption | Baseline | -30% |
| Miss rate | 60% | 38% |
Quality improved and cost dropped simultaneously. This is the most important counterintuitive finding the evaluation framework surfaced. Without controlled testing, you'd never discover that "smaller chunks" actually performs better.
6. Regression Gate: Ensuring Optimization Never Goes Backward
The final safeguard in the evaluation framework is the regression gate — every change must pass the golden test set before it can go to production.
Gate logic:
def regression_gate(change_type: str, new_config: dict) -> bool:
"""
change_type: prompt_update / chunk_strategy / retrieval_params
Returns True = approved for release, False = blocked
"""
# Run golden test set
results = run_golden_test_set(new_config)
# Compare against current production metrics
baseline = get_production_metrics()
# Core metric drops beyond threshold → block release
if results["accuracy"] < baseline["accuracy"] - REGRESSION_THRESHOLD:
trigger_rollback(
reason=f"Accuracy dropped {baseline['accuracy'] - results['accuracy']:.1%}"
)
return False
if results["recall_rate"] < baseline["recall_rate"] - REGRESSION_THRESHOLD:
trigger_rollback(
reason=f"Recall dropped {baseline['recall_rate'] - results['recall_rate']:.1%}"
)
return False
return True
# Threshold tightens as test set grows
REGRESSION_THRESHOLD = 0.02 # current: 2% (was 5% early on)
Why did the threshold tighten from 5% to 2%?
Early on, the test set had only 30 entries. Statistical noise was high — a 5% swing could just be sample variance. As the test set grew to 80 entries, statistical significance improved. A 2% drop now reliably signals real degradation rather than noise.
Regression gate trigger log:
Over 8 months in production, the regression gate triggered 3 times:
- A prompt update introduced a new vague language rule, but caused a 3.2% accuracy drop on a specific class of precise disclosures — blocked, revised, re-released
- A chunking parameter change (chunk_size from 2000 to 1500) caused a 2.8% recall drop — blocked
- A similarity threshold change from 0.8 to 0.75 caused a 4.1% false positive rate increase — blocked
These 3 blocks prevented 3 production regressions.
7. Cost Optimization: Cache + Evaluation-Guided Direction
The evaluation framework doesn't just improve quality — it also points to where cost optimization is possible.
Efficiency tier metrics flagged API call costs as elevated. Two optimization opportunities identified:
Optimization 1: Redis cache for high-frequency retrieval results
The GRI rule library is relatively static (updated annually). Retrieval results for the same rule can be reused:
def cached_embedding_search(query: str, gri_code: str) -> list:
cache_key = f"search:{gri_code}:{hash(query)}"
# Check cache first
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss — call Embedding API
results = embedding_search(query)
# Write to cache, TTL 24 hours (business data updates daily)
redis_client.setex(cache_key, 86400, json.dumps(results))
return results
Cache hit rate: 60% — eliminating 60% of redundant Embedding API calls.
Optimization 2: The cost side effect of chunking optimization
Semantic chunking reduced average chunk size from 512 to 280 tokens. Top K dropped from 5 to 3. Token consumption per retrieval dropped by nearly half.
Both optimizations combined: total token consumption reduced by 30%+, while accuracy improved by 8% simultaneously.
8. The Continuous Iteration Loop
The final form of the evaluation framework is a continuously running loop — not a one-time optimization project:
System running in production
↓
Collect problem signals
(user-reported misses / accuracy drops / new industry documents)
↓
Classify and archive
(chunking issue / retrieval issue / prompt issue)
↓
Update golden test set
(add test cases matching the new problem type)
↓
Update few-shot examples
(add new miss cases to prompt examples)
↓
Targeted optimization
(chunking strategy / retrieval parameters / prompt rules)
↓
Regression gate validation
(run golden test set — must pass before release)
↓
Release → continue collecting problem signals
Loop trigger conditions (any one triggers the loop):
| Trigger | Action |
|---|---|
| New industry client onboarded | Add industry-specific test cases; verify existing strategies apply |
| User reports missed detection | Archive miss case, add to test set, update few-shot examples |
| Accuracy drops > 2% | Trigger three-level verification, locate degraded layer, targeted fix |
| Monthly routine evaluation | Update test set with new documents, re-run full metrics |
Core value of the loop: the system cannot silently degrade due to business changes. Every regression is caught by metrics. Every fix is validated by data. Every release is gated by the regression check.
9. Closing: What the Evaluation Framework Really Is
Looking back at the engineering journey across this series:
- Part 1 solved "how to turn documents into a searchable knowledge base"
- Part 2 solved "how to chunk without destroying semantic structure"
- Part 3 solved "how to retrieve accurately in domain-specific terminology scenarios"
- Part 4 solved "how to produce quantifiable conclusions from retrieval results"
- Part 5 solved "how to identify root cause in 5 minutes when something breaks"
- Part 6 solved "how to make all of the above continuously improve"
Without an evaluation framework, the optimizations in the first five layers are one-time events. The day of launch is the system's peak performance — after that, it can only degrade as business conditions change.
With an evaluation framework, the system gains the ability to improve itself:
$$\text{Golden test set (fixed baseline)} + \text{Three-tier metrics (layered diagnosis)} + \text{Regression gate (closed-loop guarantee)}$$
$$= \text{Turning RAG optimization from intuition into engineering}$$
This evaluation loop applies to any production-grade LLM system. The only things you replace are the content of the golden test set (swap GRI clauses for legal statutes / medical guidelines / financial regulations) and the business tier metric definitions. The three-tier metrics structure, the regression gate logic, the continuous test set update mechanism — these are universal engineering practices, independent of any specific business domain.
Source Code
The complete implementation for all six parts is available here:
👉 github.com/muzinan123/production-rag-engineering
Relevant files for this part:
-
esg/services/evaluation_service.py— golden dataset evaluation (score_hit+score_find) -
esg/routers/evaluation.py— evaluation API entry point
Full series index:
- Part 1 — Ingestion Pipeline:
esg/services/loading_service.py,parsing_service.py - Part 2 — Chunking Service:
esg/services/chunking_service.py - Part 3 — Hybrid Retrieval:
esg/services/embedding_service.py,search_service.py - Part 4 — Judgment Engine:
esg/services/generation_service.py - Part 5 — Full-Chain Traceability:
esg/services/embedding_service.py,routers/evaluation.py - Part 6 — Evaluation & Iteration:
esg/services/evaluation_service.py,routers/evaluation.py
This completes the full six-layer breakdown of the production RAG system. From data ingestion to evaluation loop, every layer has explicit engineering decisions and quantifiable outcomes. This methodology has been validated across three different industry scenarios. Whether your domain is legal contracts, financial audits, or medical records — if you need knowledge-based decisions that are traceable, high-precision, and auditable, this architecture is your production-grade baseline.
Top comments (0)