DEV Community

James Lee
James Lee

Posted on

From 60% to 93%: How We Built a Continuous Evaluation Framework for LLM Systems

This is Part 8 of the series 8 Weeks from Zero to One: Building a Production-Grade LLM-Powered AI Customer Service System — Full-Stack Engineering Practice. In the previous seven parts, we covered MVP architecture, GraphRAG data pipelines, multi-agent orchestration, safety guardrails, hybrid retrieval, and inference cost optimization. But one question remained unanswered throughout: How do we know the system is "good enough" to ship? And when we change a Prompt, how do we confirm we haven't broken something that was working before?

Note: The evaluation framework and methodology in this article apply to the entire series' tech stack. To keep examples concrete and data-driven, some cases are drawn from the conversational data analysis module (Text2SQL) built on the same stack — sharing the same LangGraph multi-agent architecture, GraphRAG knowledge retrieval, and LangSmith behavior tracking as the customer service system. The Prompt engineering methods and evaluation mechanisms are identical. Everything described here — Golden Dataset construction, regression gates, and feedback loops — has been deployed in both systems.


1. The Problem: Why "It Works on My Machine" Isn't Enough

In the early stages of the project, we validated changes manually. Each time we tweaked a Prompt, we'd run a handful of queries that felt "representative," eyeball the results, and ship if nothing looked obviously broken.

This worked fine at v1. It started breaking down at v2.

The SQL generation Prompt went through three iterations:

v1 (60% accuracy): Only injected table schema
  → Typical failure: JOIN on mismatched column types
    (e.g., joining sales.product_id with store.store_id)

v2 (85% accuracy): Added GraphRAG to inject table-field relationships
  → Improvement: LLM correctly identified join conditions
  → Remaining issue: Complex nested JOINs still missed steps

v3 (93% accuracy): Added CoT 5-step reasoning chain
  → Complex SQL accuracy: from 40% to 85%
Enter fullscreen mode Exit fullscreen mode

Every single time, we discovered problems after shipping — relying on user feedback to trace back the root cause. There was no mechanism to tell us, before going live, whether a change had introduced a regression.

That was the core problem we needed to solve: replace "gut feeling" with "data."


2. The Architecture: A Three-Layer Quality Assurance Loop

We designed a system covering "before, during, and after" deployment:

┌─────────────────────────────────────────────────────────────┐
│               Three-Layer Quality Assurance System            │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  [Pre-Deploy] Offline Evaluation Layer               │    │
│  │  Golden Dataset (200 cases) → Regression Test        │    │
│  │  → Release Gate                                      │    │
│  └──────────────────────────┬──────────────────────────┘    │
│                             │                                 │
│                             ▼                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  [In-Production] Online Monitoring Layer             │    │
│  │  LangSmith Tracing → Four Metric Types → Alerting    │    │
│  └──────────────────────────┬──────────────────────────┘    │
│                             │                                 │
│                             ▼                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  [Post-Incident] Feedback Loop Layer                 │    │
│  │  Failure Archiving → Root Cause Analysis             │    │
│  │  → Dataset Expansion → Next Iteration                │    │
│  └──────────────────────────┬──────────────────────────┘    │
│                             │                                 │
│         Continuous iteration: accuracy from 60% → 93%        │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The three layers are not independent: offline evaluation prevents shipping broken versions, online monitoring catches issues after deployment, and the feedback loop ensures the same problem never recurs.


3. Layer One: Offline Evaluation — The Pre-Deploy Quality Gate

3.1 Building the Golden Dataset

The test set is the foundation of the entire evaluation system. We built it from two sources:

Source 1: Real Historical Queries (150 cases)

Extracted from the archival system, filtered by:

  • Queries where users explicitly marked results as "correct" or "incorrect"
  • Coverage across 8 core business scenarios (sales analysis, inventory queries, period-over-period comparison, department benchmarking, promotion effectiveness, staff performance, cost analysis, anomaly detection)
  • Manually annotated with the correct SQL and expected output

The value here is simple: these are real business scenarios, not edge cases we invented.

Source 2: Manually Constructed Edge Cases (50 cases)

High-risk scenarios that real queries don't naturally cover:

# Edge case categories
EDGE_CASES = {
    "nested_multi_table_join": [
        "List employees in each department whose sales exceeded "
        "the department average over the past 3 months",
        # Requires: subquery + GROUP BY + HAVING
    ],
    "window_functions": [
        "Rank all products by sales amount and return the top 10",
        # Requires: RANK() OVER (ORDER BY ...)
    ],
    "complex_date_calculation": [
        "Compare this year's 618 promotion sales against the same period last year",
        # Requires: DATE_FORMAT + cross-year date ranges
    ],
    "multi_intent_query": [
        "Identify underperforming departments and suggest improvement actions",
        # Requires: SQL query + text generation combined
    ],
}
Enter fullscreen mode Exit fullscreen mode

Final result: 200-case test set, covering 8 scenario types, 25 cases each.

3.2 Defining Accuracy: Two Layers

In a Text2SQL context, "accuracy" is not a single number. We defined two layers:

Layer 1: SQL Execution Accuracy (automated, weight 70%)
  Definition: Generated SQL produces results identical to manually annotated answers
  Validation: Auto-execute SQL, compare result sets (row count + column values)

Layer 2: Semantic Accuracy (rule checks + manual sampling, weight 30%)
  Definition: SQL runs correctly AND correctly captures user intent
  Counter-example: User asks for "sales including promotions" but SQL
                   doesn't JOIN the promotions table
  Validation: Key field validation rules + 5% manual spot-check

Combined Accuracy = SQL Execution Accuracy × 0.7 + Semantic Accuracy × 0.3
Enter fullscreen mode Exit fullscreen mode

3.3 Prompt Version Tracking

Every Prompt change is registered with full metadata, making every modification traceable:

class PromptVersionRegistry:

    def register(self, version_id: str, change_description: str):
        self.registry[version_id] = {
            "description": change_description,
            "created_at": datetime.now().isoformat(),
            "accuracy": None,       # filled after running test set
            "status": "pending"     # pending → approved → deprecated
        }

    def compare(self, version_a: str, version_b: str) -> dict:
        delta = (self.registry[version_b]["accuracy"]
                 - self.registry[version_a]["accuracy"])
        return {
            "delta": delta,
            "recommendation": "approve" if delta >= -0.02 else "reject"
        }
Enter fullscreen mode Exit fullscreen mode

Actual version history:

Version Key Change Typical Failure Combined Accuracy
v1 Table schema only JOIN type mismatch 60%
v2 Added GraphRAG table-field relationships Complex nested JOIN missing steps 85%
v3 Added CoT 5-step reasoning chain 93%

3.4 The Regression Gate: Let Data Decide

class RegressionGate:

    ACCURACY_DROP_THRESHOLD = 0.02  # reject if accuracy drops more than 2%

    def evaluate(self, new_version: str, current_version: str) -> dict:
        new_acc = self._run_on_golden_dataset(new_version)
        cur_acc = self._run_on_golden_dataset(current_version)
        delta = new_acc - cur_acc
        passed = delta >= -self.ACCURACY_DROP_THRESHOLD

        return {
            "gate_passed": passed,
            "decision": "APPROVE" if passed else "REJECT",
            "reason": None if passed else (
                f"Accuracy dropped {abs(delta)*100:.1f}%, exceeds threshold. Rejected."
            )
        }
Enter fullscreen mode Exit fullscreen mode

What does this gate actually catch?

Consider this scenario: a Prompt update introduces a Few-shot example with an inconsistent date format. Date-related query accuracy drops from 88% to 61%, pulling combined accuracy down by more than 2%. The gate rejects the deployment. The production incident never happens.


4. Layer Two: Online Monitoring — Continuous Observation After Deployment

4.1 Four Categories of Data Collected by LangSmith

┌─────────────────────────────────────────────────────────────┐
│                   LangSmith Monitoring System                 │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Node Traces  │  │ Model Calls  │  │  Error Logs  │      │
│  │              │  │              │  │              │      │
│  │ · Input /    │  │ · Full       │  │ · Execution  │      │
│  │   Output     │  │   Prompt     │  │   errors     │      │
│  │ · Exec time  │  │ · Token use  │  │ · Logic bugs │      │
│  │ · Node state │  │ · Model ver  │  │ · Timeouts   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Performance: Latency P50/P95 | Success Rate         │    │
│  │              | Retry Count | Cost                    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

4.2 Async Upload: Monitoring Can't Slow Down the Response

Early on, we used synchronous log uploads. Under high concurrency, response time ballooned from 1.2s to 2s.

Solution: local queue + background batch upload

class AsyncLangSmithLogger:
    """Write logs to local queue first; background thread batch-uploads to LangSmith."""

    def __init__(self, batch_size=50, flush_interval=5.0):
        self.queue = deque()
        self.batch_size = batch_size
        asyncio.create_task(self._background_flush(flush_interval))

    async def log(self, trace_data: dict):
        self.queue.append(trace_data)
        if len(self.queue) >= self.batch_size:
            await self._flush()  # flush immediately when queue is full

    async def _background_flush(self, interval: float):
        while True:
            await asyncio.sleep(interval)
            if self.queue:
                await self._flush()  # periodic fallback flush
Enter fullscreen mode Exit fullscreen mode

Result: response time back to 1.2s, zero log loss.

4.3 Tiered Retention: Controlling Monitoring Costs

# Different retention policies by log type
RETENTION_POLICY = {
    "error":       90,  # keep 90 days for root cause analysis
    "slow_query":  30,  # keep 30 days for performance tuning
    "normal":       7,  # keep 7 days for short-term monitoring
}
Enter fullscreen mode Exit fullscreen mode

Result: storage costs reduced by 60%, critical error logs fully preserved.


5. Layer Three: The Feedback Loop — Making the System Better Over Time

5.1 Three-Step Failure Tracing

Every execution failure is automatically archived with full context:

Failure Record:
  User query:      "Find departments with MoM growth rate below -10% in the last 3 months"
  Error type:      ColumnNotFound
  Error message:   "column 'department_name' does not exist"
  Correct field:   dept_name (from schema validation)
  Prompt version:  v2.0
  Auto-fixed:      True (resolved in 3-round CoT repair)
Enter fullscreen mode Exit fullscreen mode

Three-step trace:

Step 1: Locate the failing node
  Use LangSmith execution trace to find which node produced the bad output
  Average time to locate: < 5 minutes (previously: 2 hours)

Step 2: Analyze root cause
  ColumnNotFound   → field name in Prompt doesn't match actual schema
  Wrong JOIN       → Few-shot examples missing this type of relationship
  Date format bug  → example date format doesn't match production environment

Step 3: Categorize and archive
  Group by error type, count weekly frequency
  Frequency > 3/week  → trigger Prompt optimization task
  Frequency ≤ 3/week  → add to test set as edge case
Enter fullscreen mode Exit fullscreen mode

5.2 Auto-Debug: 3-Round CoT Repair

When execution fails, the system doesn't just "retry." It uses 3 rounds of CoT prompting to guide the LLM to locate and fix the issue:

[Standard Mode: 1-Round Quick Fix]
Prompt: "The code you wrote threw an error. Please fix it based on
         the error message and re-execute."

[Expert Mode: 3-Round Deep Fix]

Round 1 — Locate the failing step:
"The code you wrote threw an error. Which part do you think went wrong?
 Please use the 5-step reasoning chain to locate the issue:
 Was it the intent? The table/field? The JOIN? The aggregation?
 Or the calculation logic?"

LLM response: "Step 2 is wrong — field 'department_name' doesn't exist,
               it should be 'dept_name'"

Round 2 — Confirm the fix approach:
"Got it. Based on your analysis, what's the correct approach in theory?"

LLM response: "Replace 'department_name' with 'dept_name' in the SQL.
               Only modify Step 2 — leave everything else unchanged."

Round 3 — Execute the fix:
"Perfect. Now write the corrected code and run it."

→ LLM only modifies the failing step, doesn't rewrite the entire query
Enter fullscreen mode Exit fullscreen mode

Key design insight: Round 2 forces the LLM to articulate the fix in plain language before writing code. This prevents the LLM from introducing new errors while trying to fix the original one.

Fixed within 3 rounds → continue execution. Still failing after 3 rounds → flag as "low confidence," route to human review (review rate: 5%).

5.3 Closing the Loop: From Failures Back to Prompt Improvements

Production failures
    │
    ▼
Group by error type (extracted from archive weekly)
    │
    ├─ High-frequency (> 3/week) → Add new Few-shot examples
    │                               (covering field aliases, cross-month
    │                                dates, multi-table nesting)
    │
    └─ Low-frequency (≤ 3/week) → Add to test set as edge cases
    │
    ▼
New Prompt version → Regression test → Gate check → Deploy
    │
    ▼
Test set grows continuously (200 cases → ongoing)
Enter fullscreen mode Exit fullscreen mode

Actual numbers from the v2 → v3 iteration:

  • Extracted 47 high-frequency failure cases from the archive
  • Added 12 new Few-shot examples
  • Combined accuracy: 85% → 93%
  • Execution failure rate: 15% → 5%

6. A Complete End-to-End Example

Here's the full execution chain with all three layers in action:

User query: "Find departments with MoM growth rate below -10% in the last 3 months"
    │
    ▼
[Stage 0: Query Clarification] (triggered by ambiguous term "underperforming")
  Generate clarifying questions → User refines → Optimized query
    │
    ▼
[Stage 1: Task Decomposition] Few-shot + CoT
  Step 1: Load department sales data table
  Step 2: Calculate MoM growth rate for each department over 3 months
  Step 3: Filter departments with growth rate below -10%
  Step 4: Generate anomaly analysis report
    │
    ▼
[Stage 2: GraphRAG Retrieval]
  Retrieved: department_sales table
  Fields: department_id / department_name / sales_amount / order_time
    │
    ▼
[Stage 3: SQL Generation] CoT 5-step reasoning chain
  Step 1: Identify intent → calculate MoM growth rate, filter anomalies
  Step 2: Identify table/fields → department_sales, sales_amount
  Step 3: Define JOIN → self-join (current month vs. previous month)
  Step 4: Define aggregation → SUM(sales_amount)
  Step 5: Define logic → (current - previous) / previous < -0.1
    │
    ▼
  Generated SQL:
Enter fullscreen mode Exit fullscreen mode
SELECT
    t1.department_name,
    (SUM(t1.sales_amount) - SUM(t2.sales_amount))
        / SUM(t2.sales_amount) AS growth_rate
FROM department_sales t1
JOIN department_sales t2
    ON t1.department_id = t2.department_id
    AND t1.order_time BETWEEN '2024-11-01' AND '2025-01-31'
    AND t2.order_time BETWEEN '2024-08-01' AND '2024-10-31'
GROUP BY t1.department_name
HAVING growth_rate < -0.1
Enter fullscreen mode Exit fullscreen mode
    │
    ▼
[Stage 4: Execution + Auto-Debug]
  Success → continue
  Failure → 3-round CoT repair (80% auto-fix rate)
    │
    ▼
[Stage 5: Result Interpretation] Markdown output
  ## Results
  | Department | MoM Growth Rate |
  |------------|-----------------|
  | Region A   | -15.2%          |
  | Region B   | -12.8%          |

  ## Conclusion
  Region A and Region B recorded MoM growth rates of -15.2% and -12.8%
  respectively over the past 3 months, both below the -10% threshold.
  Recommend prioritizing a review of sales strategy and market conditions
  in these two regions.
    │
    ▼
[Archive] LangSmith records full trace + archival system stores result
Enter fullscreen mode Exit fullscreen mode

7. Results

Metric Before (v1) After (v3)
SQL Combined Accuracy 60% 93%
Complex SQL Accuracy 40% 85%
Execution Failure Rate 28% 5%
Mean Time to Root Cause 2 hours 5 minutes
Human Review Rate ~100% 5%
Monthly Inference Cost $1,000 $400
Average Response Time 2.5s 1.2s

8. Three Counter-Intuitive Lessons

Lesson 1: The test set matters more than the evaluation algorithm

We spent a lot of time early on exploring "LLM-as-a-judge" approaches. What we learned in practice: a high-quality 200-case test set is worth more than any sophisticated evaluation algorithm. The test set is your explicit definition of "what good looks like." If that definition is fuzzy, no algorithm can save you.

Lesson 2: Set the gate threshold conservatively, not aggressively

We initially set the gate at "accuracy can't drop more than 5%." That let several small regressions slip through, and the cumulative effect was significant. We changed it to 2%. It's better to block a few versions that have net gains but introduce partial regressions than to let any regression through.

Lesson 3: The feedback loop only works if it's enforced

Many teams have monitoring. Few teams actually use the data. Our key practice: every week, we mandatorily extract failure cases from the archive, mandatorily perform root cause analysis, and mandatorily expand the test set. The word "mandatory" matters — without it, business pressure will always push this work to next week.


What's Next

The final article in this series (Part 9) will be a complete architecture retrospective and best practices summary: what architectural decisions we got right over 8 weeks, what we got wrong, and what we'd do differently if we started over.


This article is Part 8 of the series "8 Weeks from Zero to One: Building a Production-Grade LLM-Powered AI Customer Service System — Full-Stack Engineering Practice."

Top comments (0)