Shopify · Reliability · 19 May 2026
Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.
- LLM judge: 0.02 → 0.61 Kappa
- 300-example hand-crafted benchmark
- Production mirroring closes gap in 2 weeks
- Merchant simulator pre-deployment
- Weekly Qwen3-32B retraining cycle
- ZenML: 1,200 prod deployments analyzed
The Story
ZenML analyzed 1,200 production LLM deployments across companies ranging from startups to large enterprises and found a pattern so consistent it has become a rule: reaching 80% quality happens quickly, but pushing past 95% requires the majority of total development time. The teams that hit 80% in four weeks and spend the next six months trying to reach 95% are not failing — they are experiencing the standard engineering curve for AI systems. The teams that mistake 80% for done are the ones shipping products that quietly erode user trust. Shopify's engineering teams, building both Sidekick (the merchant AI assistant) and the Flow agent (automated workflow generation), lived this curve in production. Their solution was not a better model. It was a better measurement system.
WHY EVALUATION IS THE HARD PART
Traditional software has a truth oracle: does the function return the correct value? LLM systems have no such oracle. A response can be grammatically correct, semantically reasonable, formatted perfectly — and still be wrong in ways that only a domain expert would notice, or only appear wrong on the tenth interaction in a specific workflow. Without a reliable way to measure quality, you cannot improve systematically. You are optimizing blind, hoping that the next prompt change or model upgrade makes things better without making other things worse. Evaluation infrastructure is not overhead — it is the prerequisite for all other AI engineering work.
Shopify's Flow agent generates Shopify Flow automations from natural language — merchants describe what they want to happen ('when an order is over $200, add the customer to my VIP segment'), and the agent produces the workflow. The task requires tool calling (a pattern where an LLM is given a set of available functions (tools) with descriptions, and can request that a specific tool be executed by generating a structured function call — enabling LLMs to take real-world actions beyond text generation) and produces a structured output in a domain-specific format. It sounds well-bounded. In practice, the diversity of merchant intent is vast, the edge cases accumulate rapidly, and subtle errors in the generated workflow — a wrong condition operator, a missing trigger — produce silently incorrect automations that only fail when a merchant's order actually arrives.
📏
Shopify calibrated their LLM judge from a Cohen's Kappa of 0.02 (essentially random — the judge agreed with human evaluators no more than chance would predict) to 0.61 , close to the human evaluator baseline of 0.69. The human baseline itself was 0.69 rather than 1.0 — a reminder that human evaluators don't perfectly agree with each other either. The goal is not a perfect judge; it's a judge trustworthy enough that its signals drive reliable engineering decisions.
Problem
Benchmarks Said Ready; Production Said Otherwise
Shopify's fine-tuned Flow agent passed a hand-crafted 300-example benchmark at high accuracy. When deployed to production shadow traffic, performance on real merchant workflows diverged from the benchmark. The benchmark had been crafted by engineers who knew the system well and implicitly sampled from the distribution they understood. Real merchant intent had a long tail the benchmark didn't capture.
Cause
No Quality Signal Trustworthy Enough to Drive Iteration
The early LLM judge had a Cohen's Kappa of 0.02 — barely better than random agreement with human evaluators. This meant the judge's verdicts could not reliably distinguish good responses from bad ones. Engineering decisions based on judge verdicts were effectively noise. Human evaluation at scale was impractical. Without a trustworthy quality signal, iteration was slow and direction was unclear.
Solution
Calibrated LLM Judge + Production Mirroring Flywheel
The team iteratively improved the LLM judge through systematic calibration against human labels (Kappa 0.02 → 0.61), then used it to score production traffic at scale. Production mirroring — routing real traffic through both current and candidate models — generated the failure cases that didn't appear in benchmarks. Those failures were fed back into the training dataset, closing the benchmark-to-production gap.
Result
Production Gap Closed in Two Weeks with the Flywheel
The gap from 'benchmark-ready' to 'production-ready' closed in two weeks using the production mirroring flywheel. The fine-tuned Flow agent now serves the majority of production traffic. Weekly retraining cycles on H200 GPUs mean the model continuously improves from new production signal rather than drifting as merchant behavior evolves.
⚠️
The Human Agreement Ceiling
One of the most grounding facts in Shopify's evaluation system is that human evaluators agreed with each other at Cohen's Kappa of 0.69 — not 1.0. Humans disagree about quality. This is not a failure of the evaluation process; it reflects genuine ambiguity in what 'correct' means for natural language tasks. The practical implication: don't try to build a perfect judge. Build a judge that matches or approaches human agreement levels, and treat that as the meaningful ceiling. Optimizing a judge past the human agreement level is overfitting to individual human annotators, not finding truth.
The merchant simulator deserves particular attention as an engineering pattern. Before any system change ships to production, it is tested against simulated merchant interactions derived from real production conversations. The simulator captures the 'essence' — the underlying merchant goal — from real conversations and replays that goal against the new system. This is fundamentally different from benchmark evaluation: it tests the new system against realistic merchant intent distributions , including the long tail that engineering-crafted benchmarks consistently miss. It is also fundamentally different from A/B testing: it catches regressions before any real merchant sees them, without requiring a traffic split.
ℹ️
Synthetic Data: Closing the Training Data Gap
The Flow agent's fine-tuning training data was almost entirely synthetic — generated by an LLM, not labeled by humans. The process: sample a real production workflow, use a stronger model to generate a plausible natural-language request that would produce it, construct the ideal multi-turn tool trajectory. The synthetic data generation was the majority of the engineering effort. The resulting dataset covered the breadth of Flow's usage in a way that manual annotation never could — because the diversity of real workflows provided the supervision signal, and the LLM provided the scale. This is the emerging pattern for fine-tuning specialists: synthetic data from real production outputs, not expensive human annotation from scratch.
🔬
The Industry Pattern: ZenML's 1,200 Case Studies
ZenML's LLMOps database of 1,200+ production deployments confirms that Shopify's experience is universal, not exceptional. The summary from their analysis: 'Perhaps this is a truism by now, but you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features.' LLM-as-judge has emerged as the dominant pattern for scalable quality measurement. But every successful deployment maintains human-in-the-loop golden datasets for critical domains. The dual-layer approach — LLM judges for velocity, human ground truth for calibration — is the de facto standard.
ℹ️
The A/B Test That Isn't
Teams new to LLM evaluation often reach for A/B testing as the measurement tool: split traffic, measure conversion, pick the winner. A/B testing has a fatal problem for LLM evaluation: a 5% improvement in a downstream metric like merchant click-through might take weeks of traffic to reach statistical significance — and you may have introduced a subtle quality regression in a different dimension that the metric doesn't capture. Production mirroring with direct output comparison is faster and richer: you see whether response quality improved for the same inputs, without waiting for downstream business metric movement. Business metrics confirm value; output comparison guides engineering.
THE COST CURVE IS ASYMMETRIC
The 80% → 95% quality journey is asymmetric in effort. The first 80% comes from model capability: the LLM already knows how to generate text, use tools, and follow instructions. The final 15% comes from understanding the specific failure modes of your specific application on your specific user distribution — and that knowledge cannot be bought or downloaded. It is earned through measurement, systematic failure analysis, and targeted training data creation. The companies that invest in this domain-specific evaluation work build durable advantages over those that simply upgrade to the next model version and hope.
🏭
Notion AI (referenced in ZenML's analysis) built a multi-layer evaluation stack that balances speed and cost: cheap heuristic checks on every commit, LLM judge scoring on every merge, and expensive human evaluation on every release candidate. Teams that adopted this tiered approach reported 10x faster development velocity compared to running full human evaluation on every change. The insight: match eval depth to the stakes of the change, not to a uniform 'always run everything' policy.
The Fix
Building the Evaluation Flywheel
Shopify's evaluation architecture is best understood as a flywheel: production traffic generates failures, failures feed the training pipeline, retraining improves the model, the improved model generates fewer failures, and the cycle continues. Each turn of the flywheel reduces the gap between benchmark performance and production performance. The flywheel only works if each component — quality measurement (LLM judge), failure collection (production mirroring), training (fine-tuning pipeline), deployment (shadow traffic + promotion) — is production-grade itself. A miscalibrated judge produces misleading signal. A flaky training pipeline slows iteration. A low-coverage benchmark misses the failures that actually matter.
- 0.61 — Cohen's Kappa achieved for Shopify's LLM judge after iterative calibration — close to the human evaluator baseline of 0.69 and sufficient to drive reliable engineering decisions
- 300 — Hand-crafted benchmark examples for the Flow agent — covering breadth of expected usage, used as the initial quality gate before production shadow testing
- 2 weeks — Time to close the benchmark-to-production gap using the production mirroring flywheel — from 'benchmark says ready' to 'production confirms ready'
- Weekly — Qwen3-32B retraining cadence on H200 GPUs (12h full training run) — keeping the model aligned with evolving merchant behavior without months-long release cycles
# LLM Judge calibration: the process that takes you from Kappa 0.02 to 0.61
# A judge is only useful if it agrees with humans. Measure this first.
from sklearn.metrics import cohen_kappa_score
def calibrate_llm_judge(judge_prompt: str, calibration_set: list[dict]) -> float:
"""
calibration_set: list of {conversation, human_label} pairs
human_label: 'good' | 'bad' | 'needs_improvement'
Returns Cohen's Kappa between judge and human labels.
"""
judge_labels = []
for sample in calibration_set:
# Ask the judge to evaluate this conversation
judge_verdict = call_llm(judge_prompt, sample['conversation'])
judge_labels.append(judge_verdict)
human_labels = [s['human_label'] for s in calibration_set]
kappa = cohen_kappa_score(human_labels, judge_labels)
return kappa # target: >0.60 before trusting judge at scale
# The calibration loop:
kappa = 0.02 # initial judge is barely better than random
while kappa < 0.60:
# Analyze where judge and humans disagree
disagreements = find_disagreements(calibration_set, current_judge_labels)
# Improve judge prompt based on disagreement patterns:
# - Add clarifying criteria for ambiguous cases
# - Add few-shot examples from disagreements (human = ground truth)
# - Adjust rubric language to match human intuitions
new_judge_prompt = improve_prompt(current_judge_prompt, disagreements)
kappa = calibrate_llm_judge(new_judge_prompt, calibration_set)
print(f"Kappa after iteration: {kappa:.2f}") # logs: 0.02 → 0.15 → 0.31 → 0.48 → 0.61
PRODUCTION MIRRORING: THE GROUND TRUTH TEST
Benchmarks are necessary but not sufficient. A benchmark is a fixed dataset that reflects the understanding of the engineers who created it. Production traffic reflects the actual diversity of user intent — including all the edge cases, unusual phrasings, and unexpected use patterns that no engineer anticipated. Production mirroring routes a percentage of real traffic through both the current model and the candidate model simultaneously , comparing outputs. Differences trigger human review of high-value or uncertain cases. This is the only way to discover whether a model improvement that looks good on a benchmark actually performs better for real users — or merely performs better on what engineers think real users want.
ℹ️
The Synthetic Data Generation Pipeline
Shopify's Flow agent training data was generated through a three-step pipeline: Step 1 — sample a diverse set of validated production workflows (at least one workflow per unique workflow descriptor, from merchants with two or more qualifying workflows). Step 2 — use a stronger LLM to generate a plausible natural-language merchant request that would lead to that workflow. Step 3 — construct the ideal multi-turn tool call trajectory from request to completed workflow. The resulting dataset had two properties manual annotation lacks: scale (the production workflow corpus is large) and grounding (every training example was derived from a real workflow that actually ran).
✅
Tangle: The ML Pipeline That Enables Weekly Retraining
The full training pipeline — data collection, synthetic data generation, fine-tuning, evaluation, deployment — runs on Tangle, Shopify's open-source ML experimentation platform. Tangle composes each pipeline step as a reproducible workflow with intelligent caching: only the steps affected by a change re-run. This means a change to the synthetic data generator doesn't trigger a full pipeline rerun from scratch — only the data generation step and its downstream steps re-execute. The caching infrastructure is what makes weekly retraining economically and operationally viable. Without it, the iteration cycle would be measured in months, not weeks.
⚠️
Golden Datasets Are Non-Negotiable
ZenML's analysis is unambiguous: every successful production LLM deployment they analyzed maintains human-in-the-loop golden datasets for critical domains. LLM judges are used for velocity — scoring production traffic at scale. But they drift. A judge trained on last month's quality standards may give wrong verdicts on today's outputs. Golden datasets — small, carefully curated, human-labeled examples that represent ground truth — anchor the judge calibration and detect judge drift. Without a golden dataset, you have no way to know when your quality measurement system itself has stopped working.
✅
The Two-Week Rule
Shopify's experience with the production mirroring flywheel produced a rule of thumb that has since appeared in multiple other teams' postmortems: if your candidate model passes benchmark evaluation, it takes approximately two weeks of production mirroring to confirm whether it's truly production-ready. Two weeks of real traffic at a shadow percentage generates enough diverse examples to surface the tail failures that the benchmark didn't cover. If the flywheel is working, those failures are incorporated into the training data and the model improves. If the failures are systematic — indicating a training distribution problem rather than isolated edge cases — the two weeks reveals this before the model is promoted to full production.
Architecture
The evaluation architecture for production LLM systems has four components that form a cycle. Benchmark evaluation provides fast, reproducible quality gates during development. LLM-as-judge scoring provides continuous quality measurement at production traffic scale. Production mirroring provides ground truth about whether a candidate model performs better for real users. The training flywheel converts production failures into training examples, closing the gap each cycle. Each component is necessary; none is sufficient alone.
The Production LLM Evaluation Flywheel
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
LLM Judge Architecture: From Random Agreement to Near-Human
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
THE MERCHANT SIMULATOR AS PRE-DEPLOYMENT SAFETY NET
The merchant simulator sits between benchmark evaluation and production mirroring — it's a synthetic production environment. It replays real merchant intents (extracted from production conversations) against candidate systems in a controlled environment, before any real merchant sees the new system. This catches the specific failure mode that benchmarks miss: correct behavior on engineer-anticipated test cases, incorrect behavior on the realistic distribution of merchant intent in production. The simulator doesn't replace production mirroring — it prevents the worst regressions from reaching the production mirroring stage at all.
⚠️
Eval Budget vs Training Budget: The Cost Trap
ZenML's analysis of 1,200 production deployments found that teams frequently discover that running comprehensive evaluations on every commit burns through inference budget faster than production traffic. Running a full eval suite — LLM judge on 1,000 examples × multiple iterations per PR — can cost more per day than serving users. The solution is a tiered eval strategy: fast, cheap unit evals on every commit; comprehensive judge-scored evals on every merge; full production mirroring only for release candidates. Eval should be sized to the stakes of what's being changed, not run at maximum coverage on every code change.
🧪
The Multi-LLM Annotation Pattern
For high-stakes quality assessments (like Shopify's Global Catalogue product taxonomy), a single LLM judge has too much variance. The production pattern is to run multiple LLMs independently on the same evaluation task , then use an arbitration system — a specialized model — to resolve disagreements. This ensemble approach dramatically reduces false positives in quality assessment: a response that confuses one model but is rated correctly by three others is probably correct. The arbitration model applies structured ruling logic for edge cases that simple voting would misclassify. The pattern adds cost but reduces the error rate of the quality signal for critical decisions.
Lessons
The 80% quality curve is the defining challenge of production AI engineering. The teams that accept it and build systematic measurement infrastructure navigate it successfully. The teams that are surprised by it and try to push past it with more prompting and model upgrades are still on it.
- 01. You will spend more time building evaluation infrastructure than the application logic itself. This is not inefficiency — it is the correct allocation of engineering effort for probabilistic systems. Accept it before starting. Budget for it explicitly. The teams shipping reliable AI products have evaluation as a first-class engineering investment, not an afterthought.
- 02. LLM-as-judge (using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale without requiring manual human evaluation of every production interaction) is the scalable evaluation pattern. But an uncalibrated judge (Cohen's Kappa 0.02) is worse than useless — it gives false confidence. Calibrate your judge against human labels before trusting its verdicts. Target Kappa ≥ 0.6.
- 03. A benchmark that passes is a necessary condition, not a sufficient one. Benchmarks reflect what engineers anticipated; production reflects what users actually do. Always follow benchmark success with production mirroring — routing real traffic through both current and candidate systems and comparing outputs. The two weeks Shopify needed to close the benchmark-to-production gap is the standard cost of this final validation step.
- 04. Synthetic data generation (using an LLM to create training examples from a production data source, such as generating natural-language merchant requests from real production workflows) from real production outputs is the path to scalable training data for domain-specific fine-tuning. Manual annotation doesn't scale. Synthetic data derived from production workflows does — and it's grounded in real-world distribution rather than engineer-imagined distribution.
- 05. The retraining cycle speed determines how fast you can respond to production drift. Merchant behavior changes, new workflow patterns emerge, new merchant categories join Shopify — and a model trained on last quarter's data will drift from current reality. Weekly retraining on production signal, made economically viable by efficient infrastructure (Tangle's intelligent caching, H200 GPUs, 12h run), keeps the model in alignment with the world it serves.
✅
The Universal Pattern Across 1,200 Deployments
ZenML's analysis of 1,200 production LLM deployments confirms Shopify's findings are not unique: the organizations extracting real value from AI are not the ones with the most innovative demos — they are the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty, and treating their LLM systems with the same rigor they'd apply to any critical infrastructure. The pattern is consistent across startups, mid-market, and enterprise. Model quality is table stakes. Evaluation infrastructure is competitive differentiation.
EVALUATION INFRASTRUCTURE IS PRODUCT
The merchant simulator, the calibrated LLM judge, the production mirroring pipeline, the golden dataset maintenance process — these are not internal tooling that engineers built for themselves. They are the product quality infrastructure that Shopify's merchants depend on, even though they will never see it. Every improvement to the evaluation system is an improvement to Sidekick's and Flow's reliability. Building evaluation infrastructure is building the product. Teams that separate 'evaluation tooling' from 'product work' are misclassifying one of their highest-value investments.
Shopify's engineers discovered that getting an AI to produce a correct Shopify Flow automation 80% of the time takes two weeks, and getting it to 95% takes the rest of the year — which is either a profound insight about probabilistic systems or a profound insight about how hard it is to write good evals for commerce automation, and it turns out to be both.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)