A 4B Model Just Beat a 397B Baseline - By Changing How Training Data Was Made

#ai #machinelearning #llm #datascience

Meta FAIR just published a result that's hard to ignore: a 4B parameter model, after being trained on data generated by Autodata, outperformed their own 397B model on PRBench-Legal - without any architectural changes.

The only variable: how the training data was created.

The Problem With Current Synthetic Data Pipelines

Most synthetic data workflows follow the same pattern: prompt a model, collect outputs, filter, done. The problem is that data quality is essentially uncontrolled.

Two failure modes keep showing up:

Too easy - the model you're trying to train already solves it. No learning signal.
Too hard - every rollout scores near zero. GRPO has no gradient to work with.

Autodata reframes the question: instead of generating data and then evaluating it, why not let model behavior itself define what good data looks like?

How the Pipeline Works

Autodata runs an orchestrator agent that coordinates four LLM subagents:

Challenger - generates questions and rubrics from source material (papers, legal docs, math problems)
Weak solver - a small model that should struggle if the data is good
Strong solver - a large model that should succeed, validating that the question is actually answerable
Judge - scores both solvers against the rubric and sends structured feedback back to the orchestrator

An example is only accepted when all three conditions hold simultaneously: weak solver scores low, strong solver scores high, and the gap between them is large enough. Otherwise, the orchestrator sends specific feedback to the Challenger to generate a new question from an entirely different angle - not a rephrasing, but a new angle.

It takes an average of 6.59 iterations to produce a single accepted question.

The Results

On both PRBench-Legal and the harder PRBench-Legal-Hard subset, graded independently by both GPT-5 and Kimi-K2.6, the 4B model trained on Autodata came out on top across every column - outperforming the CoT-trained 4B and the 397B baseline without RL.

The same pattern holds across CS research tasks and scientific reasoning: Agentic data leads from the start, and the gap widens through training.

What's Actually Going On Here

The 4B > 397B result isn't the most interesting part. The more important question is why it happened.

On legal tasks, standard CoT Self-Instruct produced questions that were too hard - weak solver scores clustered near zero across almost every rollout. When every rollout fails the same way, GRPO has nothing to learn from. Autodata didn't make the questions easier. It reshaped the reward distribution, pushing the weak solver into a range with enough variance for gradient descent to do its job.

That's the difference between hard data and useful data.

The Real Takeaway

If you're building an SFT or RL training pipeline, it's worth asking: what are you actually using to measure data quality?

If the answer is a static rubric or a generic LLM-as-judge score, Autodata suggests the more important metric is target model behavior: good data is data that sits in the right difficulty zone for the model you're training - not data that scores well on a judge prompt.

Full paper:

[2606.25996] Autodata: An agentic data scientist to create high quality synthetic data

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

arxiv.org