This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →
The Synthetic Data Trap: When It Helps, When It Lies
When it helps, when it lies, and what to watch for in your eval set.
by Maxim Enis · 5 min read
Synthetic data has quietly become the default answer to every data problem. Not enough training examples? Generate them. Edge cases underrepresented? Synthesize them. Eval set feels thin? Ask the model to write more questions. I have seen this pattern across dozens of teams in the last year, and the results are uneven in a very specific way: synthetic data works well when you use it for training augmentation, and it corrupts your evals in ways that are invisible until you ship something embarrassingly wrong.
When synthetic data genuinely earns its keep. Low-resource tasks are the clearest win. If you are fine-tuning a model on a task where labeled examples genuinely do not exist — niche regulatory summarization, proprietary code patterns, a narrow support domain — synthetic data can get you from 50 examples to 5,000 without asking your annotators to work a second job. The key is that you are using synthesis for training, not for measurement. Edge-case augmentation also works well: take your real distribution, identify the tail that causes failures, generate adversarial variants of those cases, and add them back to your training set. Your eval stays human-sourced and real; only the fine-tuning diet gets synthetic. Controlled distribution shifts are the third legitimate use — if you need to test how your model behaves on a slightly different dialect or domain than your training data, carefully generated synthetic prompts can stand in for real data you do not have yet. The pattern across all three: synthetic data as a lever on the training side, with a clean separation from your eval.
Synthetic data works when you use it for training augmentation. It corrupts evals when you use it to measure the same model that generated it. The line matters more than most teams think.
Where it quietly corrupts your evals. Distribution leakage is the most common failure mode and the hardest to notice. The pattern: you generate 10,000 synthetic Q&A pairs using GPT-4, use 9,000 to fine-tune your model, and hold out 1,000 as your eval set. The eval number looks great. Of course it does — the model has seen examples generated by the same process as the eval. The held-out 1,000 came from the same generative distribution as the 9,000 you trained on, so a good train/test split in that space does not give you independence from the source. You have measured how well your model approximates the original generator, not how well it handles real-world inputs.
Model-generated ground truth is the subtler version of this. Teams use a strong model — often GPT-4 or Claude — to produce "gold" labels for their eval set, then evaluate whether their fine-tuned model agrees with those labels. This is circular validation. You are checking whether your student can reproduce its teacher outputs, not whether either of them is right. If the teacher has a systematic bias — favors verbose answers, consistently hedges in a particular way, has a blind spot for a specific reasoning pattern — your eval inherits that bias invisibly. The eval score does not measure correctness; it measures alignment to a distribution you defined by fiat.
Model-generated ground truth is circular validation. You are checking whether your student matches the teacher outputs — not whether either of them is actually right.
Practical heuristics for auditing your own eval set. First, the provenance test: for every example in your eval, ask whether a human ever verified this is the right answer. If the honest answer is "the model said so," that example is measuring model agreement, not ground truth. Second, the overlap test: run a nearest-neighbor search between your eval set and your training set. If you used the same generator for both, you will find high semantic similarity even across the formal split. Third, the disagreement test: take 50 examples from your eval, bring in a human annotator who has no context on your model, and ask them to produce the correct answer cold. Compare their answers to your eval labels. If they disagree on more than 15-20% of examples, your eval has a label-quality problem that will obscure whatever you are trying to measure. Fourth, the invariance test: perturb a sample of your eval prompts — reorder the question, rephrase with synonyms, add an irrelevant preamble. If your model performance swings dramatically on semantically equivalent inputs, you have found brittleness that your current eval is papering over.
The meta-point I keep coming back to: synthetic data is a power tool, and power tools hurt people who do not respect them. The teams doing it well maintain a strict firewall — human-verified examples for eval, synthetic augmentation only on the training side, and explicit audit passes before any eval set is called production-ready. The teams getting burned are the ones who reach for synthesis to solve both sides of the data problem at once. The shortcut is real; so is the cost.
Maintain a strict firewall: human-verified examples for eval, synthetic augmentation only for training. The teams getting burned are the ones using synthesis to solve both problems at once.
Next issue: how to build a continuous eval pipeline that catches model regressions before your users do — without spending your entire inference budget on it.
Enjoyed this? The Forward Pass publishes practitioner-grade AI/ML signal every week. Free tier available.
Top comments (0)