James Lee

Posted on Jun 9 • Edited on Jun 22

How We Replaced Gut Feeling with Data: A Production Evaluation Framework for LLM Systems (Text2SQL Case Study)

#ai #llm #testing #machinelearning

This is Part 8 of the series 8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Application. In the previous seven parts, we covered MVP architecture, GraphRAG data pipelines, multi-agent orchestration, safety guardrails, hybrid retrieval, and inference cost optimization. But one question remained unanswered throughout: How do we know the system is "good enough" to ship? And when we change a Prompt, how do we confirm we haven't broken something that was working before?

Note: The evaluation framework and methodology in this article apply to the entire series' tech stack. To keep examples concrete and data-driven, some cases are drawn from the conversational data analysis module (Text2SQL) built on the same stack — sharing the same LangGraph multi-agent architecture, GraphRAG knowledge retrieval, and LangSmith behavior tracking as the core system. The Prompt engineering methods and evaluation mechanisms are identical. Everything described here — Golden Dataset construction, regression gates, and feedback loops — has been deployed in both systems.

1. The Problem: Why "It Works on My Machine" Isn't Enough

In the early stages of the project, we validated changes manually. Each time we tweaked a Prompt, we'd run a handful of queries that felt "representative," eyeball the results, and ship if nothing looked obviously broken.

This worked fine at v1. It started breaking down at v2.

The SQL generation Prompt went through three iterations:

v1 (60% accuracy): Only injected table schema
  → Typical failure: JOIN on mismatched column types
    (e.g., joining sales.product_id with store.store_id)

v2 (85% accuracy): Added GraphRAG to inject table-field relationships
  → Improvement: LLM correctly identified join conditions
  → Remaining issue: Complex nested JOINs still missed steps

v3 (93% accuracy): Added CoT 5-step reasoning chain
  → Complex SQL accuracy: from 40% to 85%

Every single time, we discovered problems after shipping — relying on user feedback to trace back the root cause. There was no mechanism to tell us, before going live, whether a change had introduced a regression.

That was the core problem we needed to solve: replace "gut feeling" with "data."

2. The Architecture: A Three-Layer Quality Assurance Loop

We designed a system covering "before, during, and after" deployment:

┌─────────────────────────────────────────────────────────────┐
│               Three-Layer Quality Assurance System            │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  [Pre-Deploy] Offline Evaluation Layer               │    │
│  │  Golden Dataset (200 cases) → Regression Test        │    │
│  │  → Release Gate                                      │    │
│  └──────────────────────────┬──────────────────────────┘    │
│                             │                                 │
│                             ▼                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  [In-Production] Online Monitoring Layer             │    │
│  │  LangSmith Tracing → Four Metric Types → Alerting    │    │
│  └──────────────────────────┬──────────────────────────┘    │
│                             │                                 │
│                             ▼                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  [Post-Incident] Feedback Loop Layer                 │    │
│  │  Failure Archiving → Root Cause Analysis             │    │
│  │  → Dataset Expansion → Next Iteration                │    │
│  └──────────────────────────┬──────────────────────────┘    │
│                             │                                 │
│         Continuous iteration: accuracy from 60% → 93%        │
└─────────────────────────────────────────────────────────────┘

The three layers are not independent: offline evaluation prevents shipping broken versions, online monitoring catches issues after deployment, and the feedback loop ensures the same problem never recurs.

3. Layer One: Offline Evaluation — The Pre-Deploy Quality Gate

3.1 Building the Golden Dataset

The test set is the foundation of the entire evaluation system. We built it from two sources:

Source 1: Real Historical Queries (150 cases)

Extracted from the archival system, filtered by:

Queries where users explicitly marked results as "correct" or "incorrect"
Coverage across 8 core business scenarios (sales analysis, inventory queries, period-over-period comparison, department benchmarking, promotion effectiveness, staff performance, cost analysis, anomaly detection)
Manually annotated with the correct SQL and expected output

The value here is simple: these are real business scenarios, not edge cases we invented.

Source 2: Manually Constructed Edge Cases (50 cases)

High-risk scenarios that real queries don't naturally cover:

# Edge case categories
EDGE_CASES = {
    "nested_multi_table_join": [
        "List employees in each department whose sales exceeded "
        "the department average over the past 3 months",
        # Requires: subquery + GROUP BY + HAVING
    ],
    "window_functions": [
        "Rank all products by sales amount and return the top 10",
        # Requires: RANK() OVER (ORDER BY ...)
    ],
    "complex_date_calculation": [
        "Compare this year's 618 promotion sales against the same period last year",
        # Requires: DATE_FORMAT + cross-year date ranges
    ],
    "multi_intent_query": [
        "Identify underperforming departments and suggest improvement actions",
        # Requires: SQL query + text generation combined
    ],
}

Final result: 200-case test set, covering 8 scenario types, 25 cases each.

3.2 Defining Accuracy: Two Layers

In a Text2SQL context, "accuracy" is not a single number. We defined two layers:

Layer 1: SQL Execution Accuracy (automated, weight 70%)
  Definition: Generated SQL produces results identical to manually annotated answers
  Validation: Auto-execute SQL, compare result sets (row count + column values)

Layer 2: Semantic Accuracy (rule checks + manual sampling, weight 30%)
  Definition: SQL runs correctly AND correctly captures user intent
  Counter-example: User asks for "sales including promotions" but SQL
                   doesn't JOIN the promotions table
  Validation: Key field validation rules + 5% manual spot-check

Combined Accuracy = SQL Execution Accuracy × 0.7 + Semantic Accuracy × 0.3

3.3 Prompt Version Tracking

Every Prompt change is registered with full metadata, making every modification traceable:

class PromptVersionRegistry:

    def register(self, version_id: str, change_description: str):
        self.registry[version_id] = {
            "description": change_description,
            "created_at": datetime.now().isoformat(),
            "accuracy": None,       # filled after running test set
            "status": "pending"     # pending → approved → deprecated
        }

    def compare(self, version_a: str, version_b: str) -> dict:
        delta = (self.registry[version_b]["accuracy"]
                 - self.registry[version_a]["accuracy"])
        return {
            "delta": delta,
            "recommendation": "approve" if delta >= -0.02 else "reject"
        }

Actual version history:

Version	Key Change	Typical Failure	Combined Accuracy
v1	Table schema only	JOIN type mismatch	60%
v2	Added GraphRAG table-field relationships	Complex nested JOIN missing steps	85%
v3	Added CoT 5-step reasoning chain	—	93%

3.4 The Regression Gate: Let Data Decide

class RegressionGate:

    ACCURACY_DROP_THRESHOLD = 0.02  # reject if accuracy drops more than 2%

    def evaluate(self, new_version: str, current_version: str) -> dict:
        new_acc = self._run_on_golden_dataset(new_version)
        cur_acc = self._run_on_golden_dataset(current_version)
        delta = new_acc - cur_acc
        passed = delta >= -self.ACCURACY_DROP_THRESHOLD

        return {
            "gate_passed": passed,
            "decision": "APPROVE" if passed else "REJECT",
            "reason": None if passed else (
                f"Accuracy dropped {abs(delta)*100:.1f}%, exceeds threshold. Rejected."
            )
        }

What does this gate actually catch?

Consider this scenario: a Prompt update introduces a Few-shot example with an inconsistent date format. Date-related query accuracy drops from 88% to 61%, pulling combined accuracy down by more than 2%. The gate rejects the deployment. The production incident never happens.

4. Layer Two: Online Monitoring — Continuous Observation After Deployment

4.1 Four Categories of Data Collected by LangSmith

┌─────────────────────────────────────────────────────────────┐
│                   LangSmith Monitoring System                 │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Node Traces  │  │ Model Calls  │  │  Error Logs  │      │
│  │              │  │              │  │              │      │
│  │ · Input /    │  │ · Full       │  │ · Execution  │      │
│  │   Output     │  │   Prompt     │  │   errors     │      │
│  │ · Exec time  │  │ · Token use  │  │ · Logic bugs │      │
│  │ · Node state │  │ · Model ver  │  │ · Timeouts   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Performance: Latency P50/P95 | Success Rate         │    │
│  │              | Retry Count | Cost                    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

4.2 Async Upload: Monitoring Can't Slow Down the Response

Early on, we used synchronous log uploads. Under high concurrency, response time ballooned from 1.2s to 2s.

Solution: local queue + background batch upload

class AsyncLangSmithLogger:
    """Write logs to local queue first; background thread batch-uploads to LangSmith."""

    def __init__(self, batch_size=50, flush_interval=5.0):
        self.queue = deque()
        self.batch_size = batch_size
        asyncio.create_task(self._background_flush(flush_interval))

    async def log(self, trace_data: dict):
        self.queue.append(trace_data)
        if len(self.queue) >= self.batch_size:
            await self._flush()  # flush immediately when queue is full

    async def _background_flush(self, interval: float):
        while True:
            await asyncio.sleep(interval)
            if self.queue:
                await self._flush()  # periodic fallback flush

Result: response time back to 1.2s, zero log loss.

4.3 Tiered Retention: Controlling Monitoring Costs

# Different retention policies by log type
RETENTION_POLICY = {
    "error":       90,  # keep 90 days for root cause analysis
    "slow_query":  30,  # keep 30 days for performance tuning
    "normal":       7,  # keep 7 days for short-term monitoring
}

Result: storage costs reduced by 60%, critical error logs fully preserved.

5. Layer Three: The Feedback Loop — Making the System Better Over Time

5.1 Three-Step Failure Tracing

Every execution failure is automatically archived with full context:

Failure Record:
  User query:      "Find departments with MoM growth rate below -10% in the last 3 months"
  Error type:      ColumnNotFound
  Error message:   "column 'department_name' does not exist"
  Correct field:   dept_name (from schema validation)
  Prompt version:  v2.0
  Auto-fixed:      True (resolved in 3-round CoT repair)

Three-step trace:

Step 1: Locate the failing node
  Use LangSmith execution trace to find which node produced the bad output
  Average time to locate: < 5 minutes (previously: 2 hours)

Step 2: Analyze root cause
  ColumnNotFound   → field name in Prompt doesn't match actual schema
  Wrong JOIN       → Few-shot examples missing this type of relationship
  Date format bug  → example date format doesn't match production environment

Step 3: Categorize and archive
  Group by error type, count weekly frequency
  Frequency > 3/week  → trigger Prompt optimization task
  Frequency ≤ 3/week  → add to test set as edge case

5.2 Auto-Debug: 3-Round CoT Repair

When execution fails, the system doesn't just "retry." It uses 3 rounds of CoT prompting to guide the LLM to locate and fix the issue:

[Standard Mode: 1-Round Quick Fix]
Prompt: "The code you wrote threw an error. Please fix it based on
         the error message and re-execute."

[Expert Mode: 3-Round Deep Fix]

Round 1 — Locate the failing step:
"The code you wrote threw an error. Which part do you think went wrong?
 Please use the 5-step reasoning chain to locate the issue:
 Was it the intent? The table/field? The JOIN? The aggregation?
 Or the calculation logic?"

LLM response: "Step 2 is wrong — field 'department_name' doesn't exist,
               it should be 'dept_name'"

Round 2 — Confirm the fix approach:
"Got it. Based on your analysis, what's the correct approach in theory?"

LLM response: "Replace 'department_name' with 'dept_name' in the SQL.
               Only modify Step 2 — leave everything else unchanged."

Round 3 — Execute the fix:
"Perfect. Now write the corrected code and run it."

→ LLM only modifies the failing step, doesn't rewrite the entire query

Key design insight: Round 2 forces the LLM to articulate the fix in plain language before writing code. This prevents the LLM from introducing new errors while trying to fix the original one.

Fixed within 3 rounds → continue execution. Still failing after 3 rounds → flag as "low confidence," route to human review (review rate: 5%).

5.3 Closing the Loop: From Failures Back to Prompt Improvements

Production failures
    │
    ▼
Group by error type (extracted from archive weekly)
    │
    ├─ High-frequency (> 3/week) → Add new Few-shot examples
    │                               (covering field aliases, cross-month
    │                                dates, multi-table nesting)
    │
    └─ Low-frequency (≤ 3/week) → Add to test set as edge cases
    │
    ▼
New Prompt version → Regression test → Gate check → Deploy
    │
    ▼
Test set grows continuously (200 cases → ongoing)

Actual numbers from the v2 → v3 iteration:

Extracted 47 high-frequency failure cases from the archive
Added 12 new Few-shot examples
Combined accuracy: 85% → 93%
Execution failure rate: 15% → 5%

6. A Complete End-to-End Example

Here's the full execution chain with all three layers in action:

User query: "Find departments with MoM growth rate below -10% in the last 3 months"
    │
    ▼
[Stage 0: Query Clarification] (triggered by ambiguous term "underperforming")
  Generate clarifying questions → User refines → Optimized query
    │
    ▼
[Stage 1: Task Decomposition] Few-shot + CoT
  Step 1: Load department sales data table
  Step 2: Calculate MoM growth rate for each department over 3 months
  Step 3: Filter departments with growth rate below -10%
  Step 4: Generate anomaly analysis report
    │
    ▼
[Stage 2: GraphRAG Retrieval]
  Retrieved: department_sales table
  Fields: department_id / department_name / sales_amount / order_time
    │
    ▼
[Stage 3: SQL Generation] CoT 5-step reasoning chain
  Step 1: Identify intent → calculate MoM growth rate, filter anomalies
  Step 2: Identify table/fields → department_sales, sales_amount
  Step 3: Define JOIN → self-join (current month vs. previous month)
  Step 4: Define aggregation → SUM(sales_amount)
  Step 5: Define logic → (current - previous) / previous < -0.1
    │
    ▼
  Generated SQL:

SELECT
    t1.department_name,
    (SUM(t1.sales_amount) - SUM(t2.sales_amount))
        / SUM(t2.sales_amount) AS growth_rate
FROM department_sales t1
JOIN department_sales t2
    ON t1.department_id = t2.department_id
    AND t1.order_time BETWEEN '2024-11-01' AND '2025-01-31'
    AND t2.order_time BETWEEN '2024-08-01' AND '2024-10-31'
GROUP BY t1.department_name
HAVING growth_rate < -0.1

    │
    ▼
[Stage 4: Execution + Auto-Debug]
  Success → continue
  Failure → 3-round CoT repair (80% auto-fix rate)
    │
    ▼
[Stage 5: Result Interpretation] Markdown output
  ## Results
  | Department | MoM Growth Rate |
  |------------|-----------------|
  | Region A   | -15.2%          |
  | Region B   | -12.8%          |

  ## Conclusion
  Region A and Region B recorded MoM growth rates of -15.2% and -12.8%
  respectively over the past 3 months, both below the -10% threshold.
  Recommend prioritizing a review of sales strategy and market conditions
  in these two regions.
    │
    ▼
[Archive] LangSmith records full trace + archival system stores result

7. Results

Metric	Before (v1)	After (v3)
SQL Combined Accuracy	60%	93%
Complex SQL Accuracy	40%	85%
Execution Failure Rate	28%	5%
Mean Time to Root Cause	2 hours	5 minutes
Human Review Rate	~100%	5%
Monthly Inference Cost	$1,000	$400
Average Response Time	2.5s	1.2s

8. Three Counter-Intuitive Lessons

Lesson 1: The test set matters more than the evaluation algorithm

We spent a lot of time early on exploring "LLM-as-a-judge" approaches. What we learned in practice: a high-quality 200-case test set is worth more than any sophisticated evaluation algorithm. The test set is your explicit definition of "what good looks like." If that definition is fuzzy, no algorithm can save you.

Lesson 2: Set the gate threshold conservatively, not aggressively

We initially set the gate at "accuracy can't drop more than 5%." That let several small regressions slip through, and the cumulative effect was significant. We changed it to 2%. It's better to block a few versions that have net gains but introduce partial regressions than to let any regression through.

Lesson 3: The feedback loop only works if it's enforced

Many teams have monitoring. Few teams actually use the data. Our key practice: every week, we mandatorily extract failure cases from the archive, mandatorily perform root cause analysis, and mandatorily expand the test set. The word "mandatory" matters — without it, business pressure will always push this work to next week.

What's Next

The final article in this series (Part 9) will be a complete architecture retrospective and best practices summary: what architectural decisions we got right over 8 weeks, what we got wrong, and what we'd do differently if we started over.

This article is Part 8 of the series "8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Application."

Top comments (9)

Max Quimby • Jun 14

The regression-gate-before-ship part resonates hard. We hit the same wall running multi-agent pipelines — a prompt tweak to one agent would silently degrade a downstream agent three hops away, and we'd only find out from the final output. The 150-case golden set built from marked-correct historical queries is a smart cold-start. The thing nobody warns you about is that the golden set rots: as real usage drifts, your "representative" set quietly stops being representative, and your 93% becomes a comfortable lie. We ended up re-sampling from fresh production traffic on a cadence to keep it honest. Curious how you're grading at scale — exact-match on the SQL, execution-result equivalence, or an LLM judge? We found LLM judges add their own variance (same answer, different score across runs), so we had to pin the judge model + temperature and keep a small human-audited slice to catch judge drift. How are you handling the "who grades the grader" problem?

James Lee • Jun 16

Max, both points hit home.

On grading: we're not using an LLM judge — execution-result equivalence at 70% weight, plus rule-based checks and 5% human spot-check. The variance problem you described is exactly why we dropped the LLM judge idea early. We actually called this out in Lesson 1: spent a while exploring evaluation algorithms, and kept coming back to the same conclusion — a clean, high-quality test set beats any fancy grading approach.

On golden set rot: our mechanism is different from yours. You re-sample from fresh production traffic on a cadence, we go failure-driven — every week we pull last week's failure archive, and anything low-frequency (≤3/week) gets added to the test set. The upside is every new case is something that actually hurt us. The downside is exactly what you flagged: drift that hasn't caused a failure yet stays invisible.

Your re-sampling approach covers that blind spot. Might steal it.

Ken • Jun 16

Strong piece. The part I’d add to Max and Theo’s comments is that the eval system needs its own release artifact, not just a score.

For every prompt/model/provider change I like to see: fixture set version, expected-output/schema version, scorer version, judge model and temperature if an LLM judge is used, a small human-audited slice, and the lifecycle decision that came out of it: promote, pin, retest, rollback, or expand the dataset.

That makes “93%” less of a dashboard number and more of a reproducible decision.

James Lee • Jun 17

Ken, we track prompt version + dataset version, but scorer version and the lifecycle decision (promote / pin / rollback) aren't formalized yet.

They're in Slack threads and commit messages.

Basically the same problem as "it works on my machine" but for the eval system itself.

Ken • Jun 17

Exactly. That is the failure mode I'd want to close before the eval system starts carrying more release authority.

A lightweight release record is often enough: prompt version, dataset version, scorer version, model/provider version, run id, decision, decision owner, and rollback/pin note. The key is that the lifecycle decision lives beside the eval result, not only in Slack archaeology.

Otherwise six weeks later the team can reproduce the score but not the reason the score was trusted.

James Lee • Jun 18

Ken, that last line is the exact thing that bit us.

We had a v2 → v3 promotion where the gate passed, but three weeks later nobody could reconstruct why we trusted the 93% — the scorer had been quietly updated, the human-audited slice was different, and the decision was in a Slack thread that had scrolled off.

Your "release record" framing clicks. We're treating the eval run as a measurement, not a decision artifact. Going to fix that — at minimum: scorer version + decision owner + rollback note, living next to the score, not in Slack archaeology.

Ken • Jun 18

Good plan. This is also why I eventually drifted from wood shop toward software: I still get to build things, but most mistakes are easier to roll back and less threatening to limbs. 😅

Theo Valmis • Jun 15

The golden-dataset-plus-regression-gate is the moment an LLM system becomes software again, and the underrated part is what it buys beyond catching regressions: it makes prompt changes reversible. Without the gate, every tweak is a bet you can't settle until production complains, so you accumulate prompt edits you're afraid to touch, the prompt version of code nobody refactors. With it, a prompt change behaves like a code change: try it, measure it, roll back on a number. Your 60 to 85 to 93 story is only legible because the dataset held the baseline still; the same three edits without it are "v2 felt better." One caveat worth holding: the dataset becomes the spec, so its blind spots become the system's. The queries nobody thought to enumerate are exactly the ones that sail through the gate and break in prod, which is why the golden set has to keep growing from real failures, not stay frozen at launch.

James Lee • Jun 16

Theo, "prompt changes become reversible" — that's a better way to put it than anything I wrote.

Your caveat is exactly what we hit. Multi-intent queries sailed through the gate, then broke in prod. The gate was green, users were not happy.

What we do now: every week, pull last week's failures, high-frequency ones fix the Prompt, low-frequency ones go into the Golden Set. Not perfect, but at least the dataset grows from real pain instead of staying frozen at launch.

The lag is still there though. You're right — the gate only knows what it's seen.