Every big AI story this year points the same way: the edge is moving from the model to the layer around it. Sakana AI's newly launched Fugu (and the heavier Fugu Ultra) is the most literal version of that idea — a system that beats frontier models by conducting them, without training a frontier model of its own.
What it actually is
Fugu isn't a bigger LLM. It's a small (~7B) model trained to route: take a task, decide which strong model in a pool should handle each part — Gemini 3.1 Pro, Claude Opus 4.8, GPT-5.5 — dispatch the work, and synthesize one answer. It can call itself recursively on long tasks (run, read its prior output, revise). Two tiers ship behind one OpenAI-compatible API: regular Fugu for everyday speed/quality, Fugu Ultra for hard multi-step work with a wider expert pool.
The claimed results — and the asterisks
Sakana reports Fugu beating Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 on 10 of 11 benchmarks:
- SWE-Bench Pro: Fugu Ultra 73.7 vs Opus 4.8 69.2, GPT-5.5 58.6, Gemini 3.1 Pro 54.2.
- Humanity's Last Exam: 50.0, edging Opus 4.8 (49.8).
- GPQA-D: 95.5, top of the field.
- Only loss: MRCRv2 (GPT-5.5 94.8 vs Fugu Ultra 93.6).
Two caveats matter as much as the numbers: results are vendor-reported and not independently verified, and Anthropic's strongest models (Fable 5, Mythos) aren't in the pool because they aren't public. So Fugu matches the frontier by orchestrating what it can reach — not the absolute best models. Read the leaderboard as a claim.
The real insight: resilience as a feature
The sharpest part of Sakana's pitch isn't benchmarks — it's framing orchestration as insurance. Routing across providers means that if one model is restricted, rate-limited, repriced, or pulled, Fugu reroutes to the rest of the pool. That's a procurement argument, and it lands: single-vendor dependence is a real operational risk. A routing layer turns "our AI went down because our provider did" into "it failed over."
The takeaway
Whether or not the exact numbers hold, Fugu makes the year's quiet thesis loud: orchestration is becoming a frontier capability in its own right. Frontier-grade outcomes no longer require owning a frontier model — just the engineering to route, govern, and synthesize across the ones that exist. The tradeoffs are real too: added latency/cost, and broader data exposure across every provider you call (the mirror image of self-hosting an open-weight model for sovereignty). Most serious platforms end up doing both, deliberately.
Full version, with the real-estate / PropTech angle, on the VSBD blog. Source: Sakana AI — Fugu.
Top comments (0)