This is Part 8 of the series 8 Weeks from Zero to One: Building a Production-Grade LLM-Powered AI Customer Service System — Full-Stack Engineering Practice. In the previous seven parts, we covered MVP architecture, GraphRAG data pipelines, multi-agent orchestration, safety guardrails, hybrid retrieval, and inference cost optimization. But one question remained unanswered throughout: How do we know the system is "good enough" to ship? And when we change a Prompt, how do we confirm we haven't broken something that was working before?
Note: The evaluation framework and methodology in this article apply to the entire series' tech stack. To keep examples concrete and data-driven, some cases are drawn from the conversational data analysis module (Text2SQL) built on the same stack — sharing the same LangGraph multi-agent architecture, GraphRAG knowledge retrieval, and LangSmith behavior tracking as the customer service system. The Prompt engineering methods and evaluation mechanisms are identical. Everything described here — Golden Dataset construction, regression gates, and feedback loops — has been deployed in both systems.
1. The Problem: Why "It Works on My Machine" Isn't Enough
In the early stages of the project, we validated changes manually. Each time we tweaked a Prompt, we'd run a handful of queries that felt "representative," eyeball the results, and ship if nothing looked obviously broken.
This worked fine at v1. It started breaking down at v2.
The SQL generation Prompt went through three iterations:
v1 (60% accuracy): Only injected table schema
→ Typical failure: JOIN on mismatched column types
(e.g., joining sales.product_id with store.store_id)
v2 (85% accuracy): Added GraphRAG to inject table-field relationships
→ Improvement: LLM correctly identified join conditions
→ Remaining issue: Complex nested JOINs still missed steps
v3 (93% accuracy): Added CoT 5-step reasoning chain
→ Complex SQL accuracy: from 40% to 85%
Every single time, we discovered problems after shipping — relying on user feedback to trace back the root cause. There was no mechanism to tell us, before going live, whether a change had introduced a regression.
That was the core problem we needed to solve: replace "gut feeling" with "data."
2. The Architecture: A Three-Layer Quality Assurance Loop
We designed a system covering "before, during, and after" deployment:
┌─────────────────────────────────────────────────────────────┐
│ Three-Layer Quality Assurance System │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [Pre-Deploy] Offline Evaluation Layer │ │
│ │ Golden Dataset (200 cases) → Regression Test │ │
│ │ → Release Gate │ │
│ └──────────────────────────┬──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [In-Production] Online Monitoring Layer │ │
│ │ LangSmith Tracing → Four Metric Types → Alerting │ │
│ └──────────────────────────┬──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [Post-Incident] Feedback Loop Layer │ │
│ │ Failure Archiving → Root Cause Analysis │ │
│ │ → Dataset Expansion → Next Iteration │ │
│ └──────────────────────────┬──────────────────────────┘ │
│ │ │
│ Continuous iteration: accuracy from 60% → 93% │
└─────────────────────────────────────────────────────────────┘
The three layers are not independent: offline evaluation prevents shipping broken versions, online monitoring catches issues after deployment, and the feedback loop ensures the same problem never recurs.
3. Layer One: Offline Evaluation — The Pre-Deploy Quality Gate
3.1 Building the Golden Dataset
The test set is the foundation of the entire evaluation system. We built it from two sources:
Source 1: Real Historical Queries (150 cases)
Extracted from the archival system, filtered by:
- Queries where users explicitly marked results as "correct" or "incorrect"
- Coverage across 8 core business scenarios (sales analysis, inventory queries, period-over-period comparison, department benchmarking, promotion effectiveness, staff performance, cost analysis, anomaly detection)
- Manually annotated with the correct SQL and expected output
The value here is simple: these are real business scenarios, not edge cases we invented.
Source 2: Manually Constructed Edge Cases (50 cases)
High-risk scenarios that real queries don't naturally cover:
# Edge case categories
EDGE_CASES = {
"nested_multi_table_join": [
"List employees in each department whose sales exceeded "
"the department average over the past 3 months",
# Requires: subquery + GROUP BY + HAVING
],
"window_functions": [
"Rank all products by sales amount and return the top 10",
# Requires: RANK() OVER (ORDER BY ...)
],
"complex_date_calculation": [
"Compare this year's 618 promotion sales against the same period last year",
# Requires: DATE_FORMAT + cross-year date ranges
],
"multi_intent_query": [
"Identify underperforming departments and suggest improvement actions",
# Requires: SQL query + text generation combined
],
}
Final result: 200-case test set, covering 8 scenario types, 25 cases each.
3.2 Defining Accuracy: Two Layers
In a Text2SQL context, "accuracy" is not a single number. We defined two layers:
Layer 1: SQL Execution Accuracy (automated, weight 70%)
Definition: Generated SQL produces results identical to manually annotated answers
Validation: Auto-execute SQL, compare result sets (row count + column values)
Layer 2: Semantic Accuracy (rule checks + manual sampling, weight 30%)
Definition: SQL runs correctly AND correctly captures user intent
Counter-example: User asks for "sales including promotions" but SQL
doesn't JOIN the promotions table
Validation: Key field validation rules + 5% manual spot-check
Combined Accuracy = SQL Execution Accuracy × 0.7 + Semantic Accuracy × 0.3
3.3 Prompt Version Tracking
Every Prompt change is registered with full metadata, making every modification traceable:
class PromptVersionRegistry:
def register(self, version_id: str, change_description: str):
self.registry[version_id] = {
"description": change_description,
"created_at": datetime.now().isoformat(),
"accuracy": None, # filled after running test set
"status": "pending" # pending → approved → deprecated
}
def compare(self, version_a: str, version_b: str) -> dict:
delta = (self.registry[version_b]["accuracy"]
- self.registry[version_a]["accuracy"])
return {
"delta": delta,
"recommendation": "approve" if delta >= -0.02 else "reject"
}
Actual version history:
| Version | Key Change | Typical Failure | Combined Accuracy |
|---|---|---|---|
| v1 | Table schema only | JOIN type mismatch | 60% |
| v2 | Added GraphRAG table-field relationships | Complex nested JOIN missing steps | 85% |
| v3 | Added CoT 5-step reasoning chain | — | 93% |
3.4 The Regression Gate: Let Data Decide
class RegressionGate:
ACCURACY_DROP_THRESHOLD = 0.02 # reject if accuracy drops more than 2%
def evaluate(self, new_version: str, current_version: str) -> dict:
new_acc = self._run_on_golden_dataset(new_version)
cur_acc = self._run_on_golden_dataset(current_version)
delta = new_acc - cur_acc
passed = delta >= -self.ACCURACY_DROP_THRESHOLD
return {
"gate_passed": passed,
"decision": "APPROVE" if passed else "REJECT",
"reason": None if passed else (
f"Accuracy dropped {abs(delta)*100:.1f}%, exceeds threshold. Rejected."
)
}
What does this gate actually catch?
Consider this scenario: a Prompt update introduces a Few-shot example with an inconsistent date format. Date-related query accuracy drops from 88% to 61%, pulling combined accuracy down by more than 2%. The gate rejects the deployment. The production incident never happens.
4. Layer Two: Online Monitoring — Continuous Observation After Deployment
4.1 Four Categories of Data Collected by LangSmith
┌─────────────────────────────────────────────────────────────┐
│ LangSmith Monitoring System │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node Traces │ │ Model Calls │ │ Error Logs │ │
│ │ │ │ │ │ │ │
│ │ · Input / │ │ · Full │ │ · Execution │ │
│ │ Output │ │ Prompt │ │ errors │ │
│ │ · Exec time │ │ · Token use │ │ · Logic bugs │ │
│ │ · Node state │ │ · Model ver │ │ · Timeouts │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Performance: Latency P50/P95 | Success Rate │ │
│ │ | Retry Count | Cost │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
4.2 Async Upload: Monitoring Can't Slow Down the Response
Early on, we used synchronous log uploads. Under high concurrency, response time ballooned from 1.2s to 2s.
Solution: local queue + background batch upload
class AsyncLangSmithLogger:
"""Write logs to local queue first; background thread batch-uploads to LangSmith."""
def __init__(self, batch_size=50, flush_interval=5.0):
self.queue = deque()
self.batch_size = batch_size
asyncio.create_task(self._background_flush(flush_interval))
async def log(self, trace_data: dict):
self.queue.append(trace_data)
if len(self.queue) >= self.batch_size:
await self._flush() # flush immediately when queue is full
async def _background_flush(self, interval: float):
while True:
await asyncio.sleep(interval)
if self.queue:
await self._flush() # periodic fallback flush
Result: response time back to 1.2s, zero log loss.
4.3 Tiered Retention: Controlling Monitoring Costs
# Different retention policies by log type
RETENTION_POLICY = {
"error": 90, # keep 90 days for root cause analysis
"slow_query": 30, # keep 30 days for performance tuning
"normal": 7, # keep 7 days for short-term monitoring
}
Result: storage costs reduced by 60%, critical error logs fully preserved.
5. Layer Three: The Feedback Loop — Making the System Better Over Time
5.1 Three-Step Failure Tracing
Every execution failure is automatically archived with full context:
Failure Record:
User query: "Find departments with MoM growth rate below -10% in the last 3 months"
Error type: ColumnNotFound
Error message: "column 'department_name' does not exist"
Correct field: dept_name (from schema validation)
Prompt version: v2.0
Auto-fixed: True (resolved in 3-round CoT repair)
Three-step trace:
Step 1: Locate the failing node
Use LangSmith execution trace to find which node produced the bad output
Average time to locate: < 5 minutes (previously: 2 hours)
Step 2: Analyze root cause
ColumnNotFound → field name in Prompt doesn't match actual schema
Wrong JOIN → Few-shot examples missing this type of relationship
Date format bug → example date format doesn't match production environment
Step 3: Categorize and archive
Group by error type, count weekly frequency
Frequency > 3/week → trigger Prompt optimization task
Frequency ≤ 3/week → add to test set as edge case
5.2 Auto-Debug: 3-Round CoT Repair
When execution fails, the system doesn't just "retry." It uses 3 rounds of CoT prompting to guide the LLM to locate and fix the issue:
[Standard Mode: 1-Round Quick Fix]
Prompt: "The code you wrote threw an error. Please fix it based on
the error message and re-execute."
[Expert Mode: 3-Round Deep Fix]
Round 1 — Locate the failing step:
"The code you wrote threw an error. Which part do you think went wrong?
Please use the 5-step reasoning chain to locate the issue:
Was it the intent? The table/field? The JOIN? The aggregation?
Or the calculation logic?"
LLM response: "Step 2 is wrong — field 'department_name' doesn't exist,
it should be 'dept_name'"
Round 2 — Confirm the fix approach:
"Got it. Based on your analysis, what's the correct approach in theory?"
LLM response: "Replace 'department_name' with 'dept_name' in the SQL.
Only modify Step 2 — leave everything else unchanged."
Round 3 — Execute the fix:
"Perfect. Now write the corrected code and run it."
→ LLM only modifies the failing step, doesn't rewrite the entire query
Key design insight: Round 2 forces the LLM to articulate the fix in plain language before writing code. This prevents the LLM from introducing new errors while trying to fix the original one.
Fixed within 3 rounds → continue execution. Still failing after 3 rounds → flag as "low confidence," route to human review (review rate: 5%).
5.3 Closing the Loop: From Failures Back to Prompt Improvements
Production failures
│
▼
Group by error type (extracted from archive weekly)
│
├─ High-frequency (> 3/week) → Add new Few-shot examples
│ (covering field aliases, cross-month
│ dates, multi-table nesting)
│
└─ Low-frequency (≤ 3/week) → Add to test set as edge cases
│
▼
New Prompt version → Regression test → Gate check → Deploy
│
▼
Test set grows continuously (200 cases → ongoing)
Actual numbers from the v2 → v3 iteration:
- Extracted 47 high-frequency failure cases from the archive
- Added 12 new Few-shot examples
- Combined accuracy: 85% → 93%
- Execution failure rate: 15% → 5%
6. A Complete End-to-End Example
Here's the full execution chain with all three layers in action:
User query: "Find departments with MoM growth rate below -10% in the last 3 months"
│
▼
[Stage 0: Query Clarification] (triggered by ambiguous term "underperforming")
Generate clarifying questions → User refines → Optimized query
│
▼
[Stage 1: Task Decomposition] Few-shot + CoT
Step 1: Load department sales data table
Step 2: Calculate MoM growth rate for each department over 3 months
Step 3: Filter departments with growth rate below -10%
Step 4: Generate anomaly analysis report
│
▼
[Stage 2: GraphRAG Retrieval]
Retrieved: department_sales table
Fields: department_id / department_name / sales_amount / order_time
│
▼
[Stage 3: SQL Generation] CoT 5-step reasoning chain
Step 1: Identify intent → calculate MoM growth rate, filter anomalies
Step 2: Identify table/fields → department_sales, sales_amount
Step 3: Define JOIN → self-join (current month vs. previous month)
Step 4: Define aggregation → SUM(sales_amount)
Step 5: Define logic → (current - previous) / previous < -0.1
│
▼
Generated SQL:
SELECT
t1.department_name,
(SUM(t1.sales_amount) - SUM(t2.sales_amount))
/ SUM(t2.sales_amount) AS growth_rate
FROM department_sales t1
JOIN department_sales t2
ON t1.department_id = t2.department_id
AND t1.order_time BETWEEN '2024-11-01' AND '2025-01-31'
AND t2.order_time BETWEEN '2024-08-01' AND '2024-10-31'
GROUP BY t1.department_name
HAVING growth_rate < -0.1
│
▼
[Stage 4: Execution + Auto-Debug]
Success → continue
Failure → 3-round CoT repair (80% auto-fix rate)
│
▼
[Stage 5: Result Interpretation] Markdown output
## Results
| Department | MoM Growth Rate |
|------------|-----------------|
| Region A | -15.2% |
| Region B | -12.8% |
## Conclusion
Region A and Region B recorded MoM growth rates of -15.2% and -12.8%
respectively over the past 3 months, both below the -10% threshold.
Recommend prioritizing a review of sales strategy and market conditions
in these two regions.
│
▼
[Archive] LangSmith records full trace + archival system stores result
7. Results
| Metric | Before (v1) | After (v3) |
|---|---|---|
| SQL Combined Accuracy | 60% | 93% |
| Complex SQL Accuracy | 40% | 85% |
| Execution Failure Rate | 28% | 5% |
| Mean Time to Root Cause | 2 hours | 5 minutes |
| Human Review Rate | ~100% | 5% |
| Monthly Inference Cost | $1,000 | $400 |
| Average Response Time | 2.5s | 1.2s |
8. Three Counter-Intuitive Lessons
Lesson 1: The test set matters more than the evaluation algorithm
We spent a lot of time early on exploring "LLM-as-a-judge" approaches. What we learned in practice: a high-quality 200-case test set is worth more than any sophisticated evaluation algorithm. The test set is your explicit definition of "what good looks like." If that definition is fuzzy, no algorithm can save you.
Lesson 2: Set the gate threshold conservatively, not aggressively
We initially set the gate at "accuracy can't drop more than 5%." That let several small regressions slip through, and the cumulative effect was significant. We changed it to 2%. It's better to block a few versions that have net gains but introduce partial regressions than to let any regression through.
Lesson 3: The feedback loop only works if it's enforced
Many teams have monitoring. Few teams actually use the data. Our key practice: every week, we mandatorily extract failure cases from the archive, mandatorily perform root cause analysis, and mandatorily expand the test set. The word "mandatory" matters — without it, business pressure will always push this work to next week.
What's Next
The final article in this series (Part 9) will be a complete architecture retrospective and best practices summary: what architectural decisions we got right over 8 weeks, what we got wrong, and what we'd do differently if we started over.
This article is Part 8 of the series "8 Weeks from Zero to One: Building a Production-Grade LLM-Powered AI Customer Service System — Full-Stack Engineering Practice."
Top comments (0)