Maya Andersson

Posted on Jun 29 • Originally published at Medium

We added synthetic data to our eval set. The pass rate rose, and so did our production incidents.

#ai #machinelearning #llm #mlops

We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked like our traffic, we scored against them, the pass rate went up, and we felt good. Then production incidents went up too, on exactly the inputs the synthetic set said we handled. The test set had grown and its predictive value had dropped, at the same time.

That is the trap with synthetic eval data, and it is not a tooling problem. Generating cases is easy now. Every framework will hand you a thousand. The hard part, the part none of the generators do for you, is proving the synthetic set behaves like the traffic you actually get. A test set that does not match your distribution is not a smaller version of production. It is a different test, and it can pass while production fails.

So when I compare the tools that generate eval data, I do not grade them on how many cases they spit out, or how clean the prompts are. I grade them on one question: how much do they help me check that the generated set looks like reality before I trust a number it produces?

The criterion, stated precisely

A synthetic eval set is trustworthy when two things hold. First, coverage: the cases span the same kinds of inputs your real traffic contains, in roughly the same proportions, including the messy and rare ones. Second, difficulty calibration: the synthetic cases are about as hard as real cases, so the pass rate on synthetic data tracks the pass rate on real data.

Both are measurable, and neither is measured by default. Coverage you check by embedding real and synthetic inputs and comparing the distributions, or by labeling both with the same taxonomy and comparing the histograms. Calibration you check by holding out a labeled slice of real data and confirming the model's pass rate on it lands near its pass rate on the synthetic set. If those two numbers diverge, the synthetic set is lying to you, and no amount of volume fixes it.

That is the lens for everything below.

The generators, by how much they help you validate

DeepEval (Synthesizer). Strong, controllable generation: it builds test cases from documents or from scratch, with knobs for evolution and complexity. The generation is good. What it does not hand you is the distribution-match check against your real traffic. You generate, then you validate the realism yourself. Worth reading alongside the synthetic-data-for-evaluation literature, for example the Self-Instruct work (Wang et al., arXiv:2212.10560), which is honest that generated instructions drift in diversity and difficulty unless you correct for it.

Promptfoo. Dataset and test-case generation wired into a CI-first tool, so the generated cases drop straight into a gate. Convenient for getting volume into a pipeline fast. The realism question is still yours: it will generate and run, but it does not compare the generated set's distribution to production for you.

Giskard. Comes at it from the risk angle, generating adversarial and edge cases to surface failures rather than to mirror average traffic. That is a different and useful goal, finding what breaks, but do not confuse a stress set with a representative set. An eval set built only from Giskard-style probes will over-represent the hard tail, which is great for hardening and misleading for estimating real-world pass rate.

Ragas. For RAG specifically, it generates question-answer test sets from your documents, including multi-hop questions. Good fit if your system is retrieval-shaped. The generated questions still need the same coverage check: documents you own are not the same distribution as questions users actually ask.

Future AGI. The thing it does differently is integration, not the generator itself. It is an end-to-end open-source platform, and synthetic data generation lives inside the same Datasets and evaluation surface that runs your evals and holds your traces, so the generated set, the eval that scores it, and the production traces you would validate it against are in one place rather than three. The repo is github.com/future-agi/future-agi. Be clear on what that does and does not buy you: it does not auto-prove your synthetic set matches production any more than the others do, that check is still methodology you run. What it removes is the stitching, because comparing synthetic-set behavior to real-trace behavior is a lot easier when both already live in the same system than when you are exporting CSVs between a generator, an eval library, and a tracing tool. On raw generation controllability, DeepEval's Synthesizer is at least as configurable.

The honest summary across all five: every one of them generates, and not one of them validates realism as the default first step. The validation is the work, and it is on you regardless of which generator you pick.

The procedure I actually run

Tool aside, this is the sequence, and steps 1 and 4 are the ones teams skip.

Pull a real sample. A few hundred genuine production inputs, with their outcomes if you have them.
Generate the synthetic set with whichever tool fits your shape.
Embed both real and synthetic inputs, compare the distributions. If the synthetic set clusters somewhere your real traffic does not, or misses a cluster real traffic has, fix the generation prompts and regenerate.
Hold out a labeled real slice. Score the model on it and on the synthetic set. If the two pass rates differ by more than a few points, the synthetic set is miscalibrated and its pass rate is not a proxy for anything. Do not trust it until they converge.
Only then use the synthetic set for volume, and keep the real slice as the anchor you re-check against.

The generator changes how pleasant steps 2 and 3 are. It does not change whether you have to do 1, 4, and 5.

FAQ

Why not just use real data and skip synthetic entirely?
Because real data is often scarce, imbalanced, or sensitive, and you cannot get enough of the rare cases that matter. Synthetic data is a reasonable way to fill those gaps. The point is not to avoid it, it is to validate it before you trust a number it produces.

How much real data do I need to validate the synthetic set?
Enough to estimate a distribution and a pass rate with a usable confidence interval, which is usually a few hundred examples, not tens of thousands. The validation slice is smaller than the synthetic set it is checking.

What is the single most common failure?
Difficulty miscalibration. Generated cases skew easy, because models write clean, unambiguous inputs and real users do not. The pass rate looks great and means nothing. The held-out real slice is what catches this.

Does generating adversarial cases count as a synthetic eval set?
It is a stress set, not a representative one. Use it to harden the system, not to estimate real-world pass rate. Keep the two sets and the two questions separate.

Open question

Distribution-match has a chicken-and-egg problem on genuinely new features, where you have little or no real traffic yet, so there is nothing to validate the synthetic set against. You are forced to trust generated data precisely when you can least check it. I do not have a clean answer here. The best I have is to treat the synthetic pass rate on a brand-new feature as a smoke test rather than a measurement, and to re-validate aggressively the moment real traffic arrives. If you have a principled way to bound how wrong a synthetic set can be before you have any real data to compare against, I would genuinely like to see it.

Top comments (2)

Tae Kim • Jul 1

The distribution-mismatch problem you describe gets especially sharp in RAG evals: when you generate synthetic questions from your own documents, the generator naturally asks things the documents answer clearly, systematically skipping the ambiguous multi-hop queries and edge cases that cause real user failures. Running our news-corpus RAG pipeline, we found that embedding-space coverage checks (your step 3) caught topic cluster gaps but missed difficulty miscalibration entirely — pass rates converged on the distribution but the synthetic questions were consistently easier than user queries on the same topics. The anchor that actually worked was your step 4: holding out a labeled real slice and treating divergence in pass rates as a signal to reject the synthetic set, not a number to average away. For your open question on brand-new features, the closest thing I've found is treating the first week of real traffic as a mandatory calibration window, accepting that the initial synthetic pass rate is a smoke test and scheduling an explicit re-validation once you have even 50-100 real examples.

Alice • Jun 29

This matches something I keep hitting: synthetic eval data inherits the generator's blind spots. If a model writes test cases for a system built on the same kind of model, the cases cluster where the model is already confident — so you over-sample the inputs you already handle and systematically under-generate the ones that break you. The set gets bigger and less representative at the same time, which is exactly the 'pass rate up, incidents up' you saw.

A few things that have helped me:

Anchor synthetic to real distribution. Keep a held-out set of actual production inputs — especially past incidents — as ground truth, and check whether synthetic pass/fail correlates with real pass/fail. A case that passes synthetic but fails on real traffic means the synthetic set is lying to you. That correlation is the only thing that makes a generated set trustworthy.
Mine failures, don't generate successes. The highest-value cases come from real incidents and adversarial edges, not from a model imagining 'typical' traffic. Every incident should become a permanent eval case, so the set can only get harder over time, never easier.
Treat pass rate as a smoke detector, not a target. The moment it becomes the goal, generation just optimizes the proxy (Goodhart). The real signal was the one you ended up trusting — production reality.

Generators raise the ceiling on volume; representativeness is what actually earns the trust, and nothing automatic gives you that yet.