Meta FAIR just published a result that's hard to ignore: a 4B parameter model, after being trained on data generated by Autodata, outperformed their own 397B model on PRBench-Legal - without any architectural changes.
The only variable: how the training data was created.
The Problem With Current Synthetic Data Pipelines
Most synthetic data workflows follow the same pattern: prompt a model, collect outputs, filter, done. The problem is that data quality is essentially uncontrolled.
Two failure modes keep showing up:
- Too easy - the model you're trying to train already solves it. No learning signal.
- Too hard - every rollout scores near zero. GRPO has no gradient to work with.
Autodata reframes the question: instead of generating data and then evaluating it, why not let model behavior itself define what good data looks like?
How the Pipeline Works
Autodata runs an orchestrator agent that coordinates four LLM subagents:
- Challenger - generates questions and rubrics from source material (papers, legal docs, math problems)
- Weak solver - a small model that should struggle if the data is good
- Strong solver - a large model that should succeed, validating that the question is actually answerable
- Judge - scores both solvers against the rubric and sends structured feedback back to the orchestrator
An example is only accepted when all three conditions hold simultaneously: weak solver scores low, strong solver scores high, and the gap between them is large enough. Otherwise, the orchestrator sends specific feedback to the Challenger to generate a new question from an entirely different angle - not a rephrasing, but a new angle.
It takes an average of 6.59 iterations to produce a single accepted question.
The Results
On both PRBench-Legal and the harder PRBench-Legal-Hard subset, graded independently by both GPT-5 and Kimi-K2.6, the 4B model trained on Autodata came out on top across every column - outperforming the CoT-trained 4B and the 397B baseline without RL.
The same pattern holds across CS research tasks and scientific reasoning: Agentic data leads from the start, and the gap widens through training.
What's Actually Going On Here
The 4B > 397B result isn't the most interesting part. The more important question is why it happened.
On legal tasks, standard CoT Self-Instruct produced questions that were too hard - weak solver scores clustered near zero across almost every rollout. When every rollout fails the same way, GRPO has nothing to learn from. Autodata didn't make the questions easier. It reshaped the reward distribution, pushing the weak solver into a range with enough variance for gradient descent to do its job.
That's the difference between hard data and useful data.
The Real Takeaway
If you're building an SFT or RL training pipeline, it's worth asking: what are you actually using to measure data quality?
If the answer is a static rubric or a generic LLM-as-judge score, Autodata suggests the more important metric is target model behavior: good data is data that sits in the right difficulty zone for the model you're training - not data that scores well on a judge prompt.
Full paper:



Top comments (0)