A 35-billion-parameter model called Agents-A1 matches trillion-parameter models on multi-step agent tasks, according to a new paper from Shanghai AI Lab. The key insight: instead of scaling parameter count, the researchers scaled the "horizon" — the length and variety of action sequences the model trains on — producing a small model that sustains plans across long sequences of tool use as well as giants do. The work is on arXiv, and its title captures the thesis: scaling the horizon, not the parameters.
Key facts
- What: Shanghai AI Lab argues you can reach giant-model performance on long tasks not by adding parameters, but by training on much longer chains of real work.
- When: 2026-06-30
- Primary source: read the source (arXiv 2606.30616)
Agents-A1 has 35 billion total parameters — small by frontier standards — yet matches trillion-parameter models on agent tasks: long, multi-step jobs where the AI must use tools, take actions, observe results, and keep working toward a goal across many turns. A giant model has vast raw knowledge, but agent work depends less on knowing more facts and more on sustaining a plan across a long sequence of actions without losing the thread. So instead of scaling parameter count, the researchers scaled the horizon — the length and variety of the action sequences the model learns from.
Concretely, they built an infrastructure that connects external knowledge, actions, observations, and checks on whether each action worked, and used it to generate training examples that average around forty-five thousand words per task. The model learns from full, extended episodes of real problem-solving, not short snippets. Training on long trajectories teaches the specific skill agents need: carrying context and a goal across dozens of steps, the difference between studying finished essays and watching someone work through an entire project from start to finish.
The training structure leans on distillation, an idea we cover in distillation. Rather than making one model good at everything at once, the team first trained separate specialist teacher models, each expert in one domain, then distilled all of them into a single student model — routing the student to learn from whichever teacher was most relevant for a given kind of task. This lets one modestly sized model absorb the strengths of several specialists. It is also built as a mixture-of-experts model, so only part of it activates at any moment, keeping running costs down; our lesson on mixture of experts explains why that design is everywhere now.
The reported results are strong across a spread of demanding agent and science benchmarks — the paper claims leading or highly competitive numbers on tasks involving tool use, web browsing, and scientific reasoning, holding its own against trillion-parameter systems on the long-horizon work it was built for. If that holds up under independent testing, the implication is meaningful: the path to capable agents runs partly through better, longer training data rather than only through ever-larger and more expensive models — good news for anyone who cannot afford to train a trillion-parameter system.
The honest caveat is the standard one for a self-reported paper: these are the authors' own benchmark numbers, and benchmark performance and real-world reliability are not the same thing — a point our lesson on how AI gets benchmarked makes at length. Matching a giant model on a curated test set is impressive but does not guarantee matching it on the messy, open-ended tasks people actually throw at agents, where, as a wave of new benchmarks this week showed, even the best frontier models still struggle badly. There is also a selection effect: it is easier to reach parity on the exact kinds of tasks you designed your training data around. Still, the core argument is a healthy corrective to size-worship. Bigger is one way to get better, but it is not the only way — and for the specific challenge of agents that have to think across a long stretch of work, teaching a smaller model on longer examples may be the smarter bet.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (1)
This piece is essentially about a new class of agentic scaling claim: getting trillion-parameter-level performance from a much smaller 35B model by expanding the “agent horizon” rather than raw parameter count.
The core idea is that performance doesn’t only come from more weights, but from longer and richer action trajectories—multi-step reasoning loops where the model interacts with tools, environments, and verifiers over long contexts. In this setup, the system builds long “knowledge–action chains” (tens of thousands of tokens) and distills across multiple domain-specialized teacher agents to unify capabilities into one deployable model.
In practical terms, it’s trying to simulate what huge trillion-parameter models achieve through scale, but instead redistributing that power into:
long-horizon planning
tool use and feedback loops
multi-agent distillation
structured verification signals
The interesting tension here is efficiency vs reliability. These systems can look extremely powerful on benchmarks, but they also inherit classic agent risks: tool overuse, compounding errors across long trajectories, and hidden amplification of mistakes over time.
So the “punches like a trillion-parameter model” claim is less about raw intelligence and more about system-level emergence through orchestration.