The Synthetic Data Trap: When It Helps, When It Lies

The Forward Pass — Wed, 20 May 2026 09:44:50 +0000

This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →

The Synthetic Data Trap: When It Helps, When It Lies

When it helps, when it lies, and what to watch for in your eval set.

by Maxim Enis · 5 min read

Synthetic data has quietly become the default answer to every data problem. Not enough training examples? Generate them. Edge cases underrepresented? Synthesize them. Eval set feels thin? Ask the model to write more questions. I have seen this pattern across dozens of teams in the last year, and the results are uneven in a very specific way: synthetic data works well when you use it for training augmentation, and it corrupts your evals in ways that are invisible until you ship something embarrassingly wrong.

When synthetic data genuinely earns its keep. Low-resource tasks are the clearest win. If you are fine-tuning a model on a task where labeled examples genuinely do not exist — niche regulatory summarization, proprietary code patterns, a narrow support domain — synthetic data can get you from 50 examples to 5,000 without asking your annotators to work a second job. The key is that you are using synthesis for training, not for measurement. Edge-case augmentation also works well: take your real distribution, identify the tail that causes failures, generate adversarial variants of those cases, and add them back to your training set. Your eval stays human-sourced and real; only the fine-tuning diet gets synthetic. Controlled distribution shifts are the third legitimate use — if you need to test how your model behaves on a slightly different dialect or domain than your training data, carefully generated synthetic prompts can stand in for real data you do not have yet. The pattern across all three: synthetic data as a lever on the training side, with a clean separation from your eval.

Synthetic data works when you use it for training augmentation. It corrupts evals when you use it to measure the same model that generated it. The line matters more than most teams think.

Where it quietly corrupts your evals. Distribution leakage is the most common failure mode and the hardest to notice. The pattern: you generate 10,000 synthetic Q&A pairs using GPT-4, use 9,000 to fine-tune your model, and hold out 1,000 as your eval set. The eval number looks great. Of course it does — the model has seen examples generated by the same process as the eval. The held-out 1,000 came from the same generative distribution as the 9,000 you trained on, so a good train/test split in that space does not give you independence from the source. You have measured how well your model approximates the original generator, not how well it handles real-world inputs.

Model-generated ground truth is the subtler version of this. Teams use a strong model — often GPT-4 or Claude — to produce "gold" labels for their eval set, then evaluate whether their fine-tuned model agrees with those labels. This is circular validation. You are checking whether your student can reproduce its teacher outputs, not whether either of them is right. If the teacher has a systematic bias — favors verbose answers, consistently hedges in a particular way, has a blind spot for a specific reasoning pattern — your eval inherits that bias invisibly. The eval score does not measure correctness; it measures alignment to a distribution you defined by fiat.

Model-generated ground truth is circular validation. You are checking whether your student matches the teacher outputs — not whether either of them is actually right.

Practical heuristics for auditing your own eval set. First, the provenance test: for every example in your eval, ask whether a human ever verified this is the right answer. If the honest answer is "the model said so," that example is measuring model agreement, not ground truth. Second, the overlap test: run a nearest-neighbor search between your eval set and your training set. If you used the same generator for both, you will find high semantic similarity even across the formal split. Third, the disagreement test: take 50 examples from your eval, bring in a human annotator who has no context on your model, and ask them to produce the correct answer cold. Compare their answers to your eval labels. If they disagree on more than 15-20% of examples, your eval has a label-quality problem that will obscure whatever you are trying to measure. Fourth, the invariance test: perturb a sample of your eval prompts — reorder the question, rephrase with synonyms, add an irrelevant preamble. If your model performance swings dramatically on semantically equivalent inputs, you have found brittleness that your current eval is papering over.

The meta-point I keep coming back to: synthetic data is a power tool, and power tools hurt people who do not respect them. The teams doing it well maintain a strict firewall — human-verified examples for eval, synthetic augmentation only on the training side, and explicit audit passes before any eval set is called production-ready. The teams getting burned are the ones who reach for synthesis to solve both sides of the data problem at once. The shortcut is real; so is the cost.

Maintain a strict firewall: human-verified examples for eval, synthetic augmentation only for training. The teams getting burned are the ones using synthesis to solve both problems at once.

Next issue: how to build a continuous eval pipeline that catches model regressions before your users do — without spending your entire inference budget on it.

Enjoyed this? The Forward Pass publishes practitioner-grade AI/ML signal every week. Free tier available.

Why Your LLM Evals Are Lying to You

The Forward Pass — Wed, 20 May 2026 09:44:49 +0000

This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →

Why Your LLM Evals Are Lying to You

Three failure modes that make most LLM benchmarks decoration, not science.

by Maxim Enis · 4 min read

You ran your model on MMLU. The score went up. You shipped. A week later, the support tickets are different in shape but identical in volume. What happened?

LLM evaluation is the most quietly broken part of the stack right now. The benchmarks have not kept up with what production models can do, and the evals teams build internally rarely correlate with anything users actually care about. Three failure modes to watch.

Contamination is everywhere. The base model has seen MMLU. It has seen GSM8K. It has seen HumanEval. The 2024-era leak audits showed material training-set contamination in every public eval more than 18 months old. If your eval was good before, it is now memorization. The fix is not to write a new MMLU — it is to write evals that are dynamic. Generate questions at eval time from a templated grammar, or freshly translate your private eval set into a synthetic distribution the model cannot have seen. Static evals have a half-life now.

Static evals have a half-life now.

Single-number scores are lossy. Aggregate accuracy hides every interesting failure mode. A model that improves from 68% to 72% on a benchmark might be regressing on the 5% of prompts your users actually send. You need stratified eval — break the score down by prompt difficulty, prompt length, language, domain, and whatever else differs in your traffic. Most teams discover their "improved" model is worse on the long tail when they finally instrument this. Instrument it before the launch, not after the rollback.

LLM-as-a-judge is correlated noise. Using GPT-4 to grade your model is convenient and seductive and wrong about 15% of the time, with biases that are stable per-grader. Same model family judges its own family more favorably. Verbose responses win on style. Order of presentation matters by 5-8 points. If you are going to use a judge, randomize the grader across at least three model families, randomize position, and calibrate against human-graded held-out samples. Otherwise you are measuring grader preference and calling it model quality.

The eval setup we run internally: a private, weekly-rotated test set of 500 prompts drawn from anonymized user traffic, hand-graded against a rubric, plus an automated regression suite of 5000 templated prompts where the correct answer is checkable programmatically. The first catches qualitative drift; the second catches catastrophic regressions. We stopped trusting any benchmark we did not write within the last three months.

We stopped trusting any benchmark we did not write within the last three months.

The deeper point: if you cannot articulate the specific user behavior your eval is supposed to predict, your eval is decoration. Evals are not science fairs. They are production decisions in disguise.

Evals are not science fairs. They are production decisions in disguise.

Next week: how to build a templated regression suite that actually runs in CI without bankrupting your inference budget.

Enjoyed this? The Forward Pass publishes practitioner-grade AI/ML signal every week. Free tier available.

DEV Community: The Forward Pass

The Synthetic Data Trap: When It Helps, When It Lies