TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 19

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

#ai #programming #machinelearning #webdev

0.02 → 0.61 Cohen's Kappa — LLM judge calibration from near-random to near-human agreement
0.69 — human evaluator Kappa baseline; the meaningful ceiling for any judge
300 examples — hand-crafted benchmark for the Flow agent; necessary but not sufficient
2 weeks — time to close the benchmark-to-production gap using the production mirroring flywheel
Weekly — Qwen3-32B retraining cadence on H200 GPUs (12h full run)
1,200 production LLM deployments analysed by ZenML — Shopify's findings are not exceptional, they are universal

Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between "impressive demo" and "product I'd trust with my customers" — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.

The Story

ZenML analysed 1,200 production LLM deployments and found a pattern so consistent it has become a rule: reaching 80% quality happens quickly, but pushing past 95% requires the majority of total development time. The teams that hit 80% in four weeks and spend the next six months trying to reach 95% are not failing — they are experiencing the standard engineering curve for AI systems. The teams that mistake 80% for done are the ones shipping products that quietly erode user trust.

Shopify's engineering teams, building both Sidekick (the merchant AI assistant) and the Flow agent (automated workflow generation from natural language), lived this curve in production. The Flow agent generates Shopify Flow automations from merchant descriptions — "when an order is over $200, add the customer to my VIP segment" — and produces a structured workflow. It uses tool calling (a pattern where an LLM is given a set of available functions with descriptions and can request that a specific tool be executed by generating a structured function call — enabling LLMs to take real-world actions beyond text generation) and operates in a domain-specific format. The task sounds well-bounded. In practice, the diversity of merchant intent is vast, edge cases accumulate rapidly, and subtle errors — a wrong condition operator, a missing trigger — produce silently incorrect automations that only fail when a merchant's order actually arrives.

Why Evaluation Is the Hard Part

Traditional software has a truth oracle: does the function return the correct value? LLM systems have no such oracle. A response can be grammatically correct, semantically reasonable, formatted perfectly — and still be wrong in ways only a domain expert would notice, or only appear wrong on the tenth interaction in a specific workflow. Without a reliable way to measure quality, you cannot improve systematically. You are optimising blind, hoping the next prompt change or model upgrade makes things better without making other things worse. Evaluation infrastructure is not overhead — it is the prerequisite for all other AI engineering work.

Problem

Benchmarks Said Ready; Production Said Otherwise

Shopify's fine-tuned Flow agent passed a hand-crafted 300-example benchmark at high accuracy. When deployed to production shadow traffic, performance on real merchant workflows diverged from the benchmark. The benchmark had been crafted by engineers who knew the system well and implicitly sampled from the distribution they understood. Real merchant intent had a long tail the benchmark didn't capture.

Cause

No Quality Signal Trustworthy Enough to Drive Iteration

The early LLM judge had a Cohen's Kappa (a statistical measure of agreement between two raters that corrects for chance — Kappa of 0 means agreement no better than random, 1.0 means perfect agreement) of 0.02 — barely better than random agreement with human evaluators. Engineering decisions based on its verdicts were effectively noise. Human evaluation at scale was impractical. Without a trustworthy quality signal, iteration was slow and direction was unclear.

Solution

Calibrated LLM Judge + Production Mirroring Flywheel

The team iteratively improved the LLM judge through systematic calibration against human labels (Kappa 0.02 → 0.61), then used it to score production traffic at scale. Production mirroring — routing real traffic through both current and candidate models — generated the failure cases that didn't appear in benchmarks. Those failures were fed back into the training dataset, closing the benchmark-to-production gap.

Result

Production Gap Closed in Two Weeks with the Flywheel

The gap from "benchmark-ready" to "production-ready" closed in two weeks using the production mirroring flywheel. The fine-tuned Flow agent now serves the majority of production traffic. Weekly retraining cycles on H200 GPUs mean the model continuously improves from new production signal rather than drifting as merchant behaviour evolves.

The Fix

Building the Evaluation Flywheel

Shopify's evaluation architecture is best understood as a flywheel: production traffic generates failures, failures feed the training pipeline, retraining improves the model, the improved model generates fewer failures, and the cycle continues. Each turn reduces the gap between benchmark performance and production performance. The flywheel only works if each component — quality measurement (LLM judge), failure collection (production mirroring), training (fine-tuning pipeline), deployment (shadow traffic + promotion) — is production-grade itself. A miscalibrated judge produces misleading signal. A flaky training pipeline slows iteration.

0.61 — Cohen's Kappa achieved after iterative calibration — close to the human evaluator baseline of 0.69, sufficient to drive reliable engineering decisions
300 — hand-crafted benchmark examples, covering the breadth of expected usage; initial quality gate before shadow testing
2 weeks — time to close the benchmark-to-production gap using the production mirroring flywheel
Weekly — Qwen3-32B retraining cadence on H200 GPUs; 12-hour full training run per cycle

# LLM judge calibration: the process from Kappa 0.02 to 0.61
# A judge is only useful if it agrees with humans. Measure agreement first.

from sklearn.metrics import cohen_kappa_score

def calibrate_llm_judge(judge_prompt: str, calibration_set: list[dict]) -> float:
    """
    calibration_set: list of {conversation, human_label} pairs
    human_label: 'good' | 'bad' | 'needs_improvement'
    Returns Cohen's Kappa between judge and human labels.
    Target: Kappa >= 0.60 before trusting judge at scale.
    """
    judge_labels = []
    for sample in calibration_set:
        verdict = call_llm(judge_prompt, sample['conversation'])
        judge_labels.append(verdict)

    human_labels = [s['human_label'] for s in calibration_set]
    return cohen_kappa_score(human_labels, judge_labels)

# The calibration loop — iterate until the judge is trustworthy
kappa = 0.02  # initial judge is barely better than random
while kappa < 0.60:
    # Analyse where judge and humans disagree
    disagreements = find_disagreements(calibration_set, current_judge_labels)

    # Improve judge prompt based on disagreement patterns:
    # - Add clarifying criteria for ambiguous cases
    # - Add few-shot examples where human label is the ground truth
    # - Adjust rubric language to match human intuitions
    new_judge_prompt = improve_prompt(current_judge_prompt, disagreements)

    kappa = calibrate_llm_judge(new_judge_prompt, calibration_set)
    print(f"Kappa: {kappa:.2f}")  # progression: 0.02 → 0.15 → 0.31 → 0.48 → 0.61

# Once Kappa >= 0.60: use judge to score production traffic at scale
# Once judge is calibrated: production mirroring generates the failure cases
# that benchmarks never captured — feed those failures back into training data

Production Mirroring: The Ground Truth Test

Benchmarks are necessary but not sufficient. A benchmark reflects the understanding of the engineers who created it. Production traffic reflects the actual diversity of user intent — including all edge cases, unusual phrasings, and unexpected use patterns no engineer anticipated. Production mirroring routes a percentage of real traffic through both the current model and the candidate model simultaneously, comparing outputs. Differences trigger human review of high-value or uncertain cases. This is the only way to discover whether a model improvement that looks good on a benchmark actually performs better for real users — or merely performs better on what engineers think real users want.

Synthetic training data: how Shopify generated the Flow agent dataset

The Flow agent's fine-tuning training data was almost entirely synthetic — generated by an LLM, not labelled by humans. The three-step pipeline: (1) sample a diverse set of validated production workflows — at least one per unique workflow descriptor, from merchants with two or more qualifying workflows; (2) use a stronger LLM to generate a plausible natural-language merchant request that would lead to that workflow; (3) construct the ideal multi-turn tool call trajectory from request to completed workflow. The resulting dataset had two properties manual annotation lacks: scale (the production workflow corpus is large) and grounding (every training example was derived from a real workflow that actually ran). Synthetic data from real production outputs is the emerging standard for fine-tuning domain specialists.

Tangle: the ML pipeline that enables weekly retraining

The full training pipeline — data collection, synthetic data generation, fine-tuning, evaluation, deployment — runs on Tangle, Shopify's open-source ML experimentation platform. Tangle composes each pipeline step as a reproducible workflow with intelligent caching: only the steps affected by a change re-run. A change to the synthetic data generator doesn't trigger a full pipeline rerun — only the data generation step and its downstream steps re-execute. The caching infrastructure is what makes weekly retraining economically and operationally viable. Without it, the iteration cycle would be measured in months, not weeks.

Architecture

The evaluation architecture for production LLM systems has four components that form a cycle. Benchmark evaluation provides fast, reproducible quality gates during development. LLM-as-judge scoring provides continuous quality measurement at production traffic scale. Production mirroring provides ground truth about whether a candidate model performs better for real users. The training flywheel converts production failures into training examples, closing the gap each cycle. Each component is necessary; none is sufficient alone.

The Production LLM Evaluation Flywheel

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

LLM Judge Architecture: From Random Agreement to Near-Human

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Merchant Simulator as Pre-Deployment Safety Net

The merchant simulator sits between benchmark evaluation and production mirroring — a synthetic production environment. It replays real merchant intents (extracted from production conversations) against candidate systems in a controlled environment, before any real merchant sees the new system. This catches the specific failure mode benchmarks miss: correct behaviour on engineer-anticipated test cases, incorrect behaviour on the realistic distribution of merchant intent. The simulator doesn't replace production mirroring — it prevents the worst regressions from reaching the production mirroring stage at all.

Golden datasets: why they are non-negotiable

ZenML's analysis is unambiguous: every successful production LLM deployment they analysed maintains human-in-the-loop golden datasets for critical domains. LLM judges are used for velocity — scoring production traffic at scale. But they drift. A judge trained on last month's quality standards may give wrong verdicts on today's outputs. Golden datasets — small, carefully curated, human-labelled examples representing ground truth — anchor judge calibration and detect judge drift. Without a golden dataset, you have no way to know when your quality measurement system itself has stopped working.

Lessons

You will spend more time building evaluation infrastructure than the application logic itself. This is not inefficiency — it is the correct allocation of engineering effort for probabilistic systems. Accept it before starting. Budget for it explicitly. ZenML's summary from 1,200 deployments: "Perhaps this is a truism by now, but you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features."
LLM-as-judge (using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale) is the scalable evaluation pattern. But an uncalibrated judge (Kappa 0.02) is worse than useless — it gives false confidence. Calibrate your judge against human labels before trusting its verdicts. Target Kappa ≥ 0.6. The human evaluator baseline (0.69 for Shopify) is the meaningful ceiling — don't optimise past it.
A benchmark that passes is a necessary condition, not a sufficient one. Benchmarks reflect what engineers anticipated; production reflects what users actually do. Always follow benchmark success with production mirroring — routing real traffic through both current and candidate systems and comparing outputs. Two weeks of shadow traffic is the standard cost of this final validation step.
Synthetic data generation (using an LLM to create training examples from real production outputs — generating natural-language merchant requests from real production workflows) is the path to scalable fine-tuning training data. Manual annotation doesn't scale. Synthetic data derived from production outputs does — and it's grounded in real-world distribution rather than engineer-imagined distribution.
Retraining cycle speed determines how fast you can respond to production drift. Merchant behaviour changes, new workflow patterns emerge, new merchant categories join Shopify — a model trained on last quarter's data will drift from current reality. Weekly retraining on production signal, made economically viable by efficient infrastructure (intelligent caching, H200 GPUs, 12h runs), keeps the model aligned with the world it serves.

Engineering Glossary

Cohen's Kappa — a statistical measure of agreement between two raters that corrects for chance agreement. Kappa of 0 means agreement no better than random; 1.0 means perfect agreement; 0.6+ is generally considered the threshold for a trustworthy judge. The Shopify LLM judge improved from 0.02 to 0.61; the human evaluator baseline was 0.69.

Fine-tuning — the process of further training a pre-trained LLM on a domain-specific dataset to improve performance on a specific task. Used by Shopify to specialise a base model (Qwen3-32B) for Shopify Flow workflow generation, with weekly retraining cycles to keep pace with evolving merchant behaviour.

Golden dataset — a small, carefully curated set of human-labelled evaluation examples representing ground truth for a specific domain. Used to calibrate LLM judges and detect judge drift over time. The anchor of any reliable LLM evaluation system.

LLM-as-judge — the pattern of using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale without requiring manual human evaluation of every production interaction.

Production mirroring — routing a percentage of real production traffic through both the current deployed model and a candidate model simultaneously, comparing outputs to measure whether the candidate performs better for real users. The ground truth test that benchmark evaluation cannot replicate.

Synthetic data generation — using an LLM to create training examples from a production data source — for example, generating plausible natural-language merchant requests from real validated production workflows. Enables scalable training data creation grounded in real-world distribution.

Tool calling — a pattern where an LLM is given a set of available functions (tools) with descriptions and can request that a specific tool be executed by generating a structured function call. Enables LLMs to take real-world actions beyond text generation — used by Shopify's Flow agent to generate and execute workflow operations.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community