We've spent two years obsessing over model benchmarks. Meanwhile, a quieter shift has been happening: the realization that throwing more parameters at a problem isn't the same as building systems that can actually work together.
Sakana's Conductor is the clearest signal yet. A 7B model trained with reinforcement learning not to solve tasks directly, but to decide which agent should solve them. It reached 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond—not by being smarter than the frontier models it orchestrates, but by being better at dispatching them.
The implications are uncomfortable for anyone who's built their architecture around single-model dominance. When a 7B parameter orchestrator can outperform individual 70B+ workers by routing queries intelligently, the economics of inference change completely.
Google's TPU split into training (8t) and inference (8i) variants reinforces this trajectory. When you start optimizing silicon specifically for inference workloads, you're acknowledging that the action has shifted from training massive models to deploying them efficiently at scale.
The question isn't whether multi-agent systems will dominate. It's whether your stack is built to route between them efficiently.
The orchestration layer is no longer theoretical. It's here, it's 7B parameters, and it's beating the frontier models at their own game.
Top comments (0)