Hamza

Posted on Jun 26 • Originally published at tekmag.thsite.top

Sakana Fugu: The Multi-Agent Orchestration System That Commands GPT, Claude, and Gemini Through One API

#sakanai #fugu #aiagents #opensource

Sakana AI ’s Fugu isn’t a traditional large language model — it’s an “orchestration model” that dynamically routes every query across a pool of frontier LLMs (Opus 4.8, Gemini 3.1 Pro, GPT 5.5, and itself) and synthesises the result through a single OpenAI-compatible API. Launched days after US export controls cut access to Anthropic’s top models, Fugu represents both a technical paradigm shift and a geopolitical statement from Japan’s most ambitious AI lab.

What Is Sakana Fugu?

Fugu is a multi-agent orchestration system released on June 22, 2026 by Sakana AI, the Tokyo-based lab founded by Llion Jones (co-author of “Attention Is All You Need”) and David Ha (ex-Google, World Models paper). Instead of training a single ever-larger model, Sakana trained a ~7B-parameter language model to act as an intelligent dispatcher: it decides how to handle each query, delegates subtasks to the best-suited models, verifies the results, and synthesises a final answer.

Two tiers are available:

Fugu — Balanced for low-latency, dynamic per-query pricing based on the highest-tier model involved
Fugu Ultra — Performance-max mode at $5/M input tokens and $30/M output tokens, with subscription tiers at $20, $100, or $200 per month

The API is fully OpenAI-compatible (/v1/chat/completions and /v1/responses endpoints), meaning any existing tool that works with OpenAI can point at Fugu with a single URL swap. The model ID is fugu-ultra-20260615.

This places Fugu in a growing ecosystem of agentic AI systems reshaping the development landscape — but with a fundamentally different philosophy than the single-model approach most developers are used to.

How Fugu’s Orchestration Architecture Works

Unlike a typical LLM that processes every query through the same weights, Fugu employs a multi-stage pipeline. When you send a query, the orchestrator model:

Routes — Analyses the task type and determines which models should be involved
Delegates — Assigns roles dynamically: Thinker (plan), Worker (execute), Verifier (check)
Verifies — Cross-checks outputs for consistency and correctness
Synthesises — Combines the best results into a coherent final response

Critically, Fugu can also call instances of itself to decompose hard subtasks recursively, creating a self-scaling reasoning tree.

The Conductor and TRINITY Papers

Fugu is backed by two peer-reviewed papers accepted at ICLR 2026:

TRINITY (arXiv:2512.04695) — An evolved lightweight coordinator (~0.6B parameters) that uses CMA-ES to assign agent roles efficiently
Conductor (arXiv:2512.04388) — A ~7B RL-trained model that discovers natural-language coordination strategies — essentially learning how to “talk to” other models

The official Fugu technical report (arXiv:2606.21228) describes how these two components combine into a production system.

The Agent Pool

Fugu’s agent pool includes (according to Sakana): Opus 4.8, Gemini 3.1 Pro, GPT 5.5, plus instances of itself. The exact per-query model selection is not disclosed — users see only the final answer. However, the system lets you opt out of specific providers for compliance reasons.

Important caveat: While Sakana pitches Fugu as an answer to export-control vulnerability, the underlying agent pool is still entirely US-controlled models (OpenAI, Anthropic, Google DeepMind). The orchestration itself is open-source on GitHub (558 ★), but the frontier models it routes to are not.

Featured image: Sakana Fugu functional diagram — official Sakana AI architecture illustration. Source: Sakana AI via VentureBeat. Used for editorial purposes.

Video: Sakana Fugu Multi-Agent System — A comprehensive deep dive into Fugu’s architecture, benchmarks, and real-world demos. (Source: Nikki, YouTube)

Benchmark Performance: Fugu vs. Frontier Models

Sakana published an extensive benchmark comparison. These scores are vendor-reported and have not been independently replicated — a point worth keeping in mind given Sakana’s track record of contested claims (the AI CUDA Engineer evaluation loophole, AI Scientist’s 42% failure rate).

Benchmark	Fugu	Fugu Ultra	Opus 4.8	Gemini 3.1 Pro	GPT 5.5
SWE-Bench Pro	59.0	73.7	69.2	54.2	58.6
LiveCodeBench	92.9	93.2	87.8	88.5	85.3
GPQA-Diamond	95.5	95.5	92.0	94.3	93.6
Humanity’s Last Exam	47.2	50.0	49.8	44.4	41.4
TerminalBench 2.1	80.2	82.1	74.6	70.3	78.2

Fugu Ultra leads on 4 out of 7 coding and reasoning benchmarks. GPQA-Diamond is a three-way tie at the top. Only GPT 5.5 wins on multi-round conversational recall (MRCRv2, not shown). All scores are vendor-reported.

Pricing: Is the Orchestration Tax Worth It?

Fugu’s pricing is where things get complicated. The $20/month basic plan reportedly exhausts its credits on a single complex prompt — VentureBeat’s Carl Franzen noted that the $200 plan gives fewer than three hours of active use per week. The orchestration tokens (internal model-to-model traffic) are billed separately on Fugu Ultra and can significantly inflate the effective cost.

However, for specific tasks, the math works. One developer reported building a Crossy Road clone in 22 minutes at $7.32 using Fugu, versus 79 minutes and $37.85 on Opus 4.8 alone — suggesting Fugu is more cost-effective when its orchestration genuinely adds value (complex multi-step coding). For simple Q&A, you’re paying a premium for coordination overhead you don’t need.

It’s worth noting that Fugu is blocked in the EU/EEA pending GDPR alignment, and it’s currently offering a free second month for new sign-ups before July 31, 2026.

Hands-On Reception: What Early Users Say

Beta access (~500 users) has produced distinctly mixed impressions — a pattern familiar to anyone who follows Sakana’s releases.

Where Fugu Excels

Code review: One developer reported it “surfaced more than twenty issues where other models flag ~3”
Long-running multi-step tasks: The self-scaling reasoning tree shines on problems that require sustained context over many steps
Persona stability: Maintains role-play and instruction adherence across very long sessions
Autonomous research: Fugu Ultra ran 123 ML training experiments autonomously (~14 hours on one H100)

Where It Falls Short

Latency: Routinely described as “incredibly slow” (Ethan Mollick)
Frontend tasks: A ThreeJS task was “notably worse than GPT 5.5,” requiring 7-8 fix rounds
Consistency: “A bit jagged in its abilities” (Hamel Husain) — outstanding at code review, weak at frontend rendering
Perceived value: Hacker News commenters called it “a black box in front of other black boxes for slower service and more money”

This uneven profile mirrors what we’ve seen with other agentic coding models like Ornith-1.0 — orchestration systems tend to excel on well-scoped engineering problems while struggling with open-ended creative or visual tasks.

The Geopolitical Angle

Fugu’s launch on June 22 came just 10 days after US export controls (announced June 12) cut public access to Anthropic’s Claude Fable 5 and Mythos 5 — models that reportedly score ~86.0 on SWE-Bench Pro, well above Fugu Ultra’s 73.7. The timing is almost certainly deliberate.

David Ha framed the strategic case directly: “ Relying on a single company’s model for national infrastructure is a massive risk.” For developers and enterprises worried about vendor lock-in, Fugu offers an API abstraction layer that can swap model providers without code changes. Whether this constitutes true “AI sovereignty” is debatable — the underlying models remain US-controlled — but as a technical hedge against access restrictions, the architecture is compelling.

For a deeper look at how AI model geopolitics is reshaping the industry, check our coverage of the Anthropic-Alibaba distillation attack and the broader implications for cross-border AI competition.

Video: SAKANA FUGU: The AI Orchestrator — Why orchestration models represent a fundamental shift in how we deploy AI, and the sovereignty argument behind Fugu. (Source: Ops and Odds, YouTube)

FAQ

Is Fugu really better than Claude Opus 4.8 or GPT 5.5?

On vendor-reported benchmarks, Fugu Ultra leads on several coding and reasoning tests. But real-world reviews suggest it’s highly task-dependent — excellent at code review and multi-step research, worse at frontend tasks and simple Q&A. The latency and cost premiums mean it’s not a drop-in replacement for every use case.

How much does Fugu actually cost?

Fugu Ultra costs $5/M input tokens and $30/M output tokens, plus internal orchestration tokens billed separately. Subscriptions start at $20/month (basic) through $200/month (pro). The free second-month promotion runs until July 31, 2026.

Can I use Fugu from within the EU?

No — Fugu is currently blocked in the EU/EEA pending GDPR compliance alignment.

Is Fugu open source?

Partially. The orchestration installer and wrapper are on GitHub (558 stars), but the orchestrator model weights and the frontier models it routes to are proprietary.

What models does Fugu route to?

The claimed pool includes Opus 4.8, Gemini 3.1 Pro, GPT 5.5, and instances of itself. Per-query selection is not disclosed, but you can opt out of specific providers.

Should You Try Fugu?

Fugu is a genuinely new architectural paradigm — the “orchestration model” as a product category, not just an academic experiment. For developers working on complex multi-step coding tasks, research automation, or anything that benefits from sustained reasoning across multiple perspectives, the free second month is a low-risk way to see if the orchestration tax pays off for your specific workload.

For simple chat, single-turn coding, or latency-sensitive applications, you’re likely better served by a standard frontier model directly — at least until Fugu’s speed improves and the pricing settles into something more predictable.

If you’re exploring the broader shift toward AI coding agents and how to instruct them effectively, Fugu is worth watching as a bellwether for where the industry is heading — even if the current implementation is, in the words of one beta tester, “a jagged diamond that cuts well in some directions and barely at all in others.”

Originally published on TekMag

DEV Community