Sakana’s Fugu benchmark claims are vendor-reported, not independently verified scorecards. Per Sakana’s release page, Fugu Ultra “stands shoulder-to-shoulder with leading models like Fable 5 and Mythos Preview” on engineering, scientific, and reasoning tasks, while Fugu “consistently outperforms” Gemini 3.1 Pro, Opus 4.8, and GPT 5.5 on a named set of applications. The key implementation detail: Fugu is an orchestrator that calls other vendors’ frontier models, so its results are not single-model wins in the same way Fable 5’s are.
What Fugu is and why it changes benchmark interpretation
Fugu is not a single foundation model. It is a multi-agent orchestration system exposed as one model behind an OpenAI-compatible API.
Sakana describes Fugu as a trained language model specialized in:
- Delegation
- Agent communication
- Work synthesis
- Dynamic coordination across multiple LLMs
- Recursive use of Fugu itself
That matters when reading benchmark claims.
For a normal model benchmark, the score usually reflects one model’s weights doing the work. For Fugu, a result may involve Fugu calling Opus 4.8, GPT 5.5, Gemini 3.1 Pro, or other models, then synthesizing their outputs.
So if Fugu “beats Opus 4.8” on a task, that does not necessarily mean a single Sakana model out-reasoned Opus. It may mean an orchestration system used Opus plus other models more effectively.
If you want the architecture context, see this explainer on what Sakana Fugu is.
Read the parity claim carefully
Sakana’s first claim is a parity claim:
Fugu Ultra “stands shoulder-to-shoulder with leading models like Fable 5 and Mythos Preview.”
That is not the same as saying Fugu Ultra beats those models. It positions Fugu Ultra as a peer across engineering, scientific, and reasoning benchmarks.
Two implementation-relevant caveats:
- The comparison names Mythos Preview, not the current generally available Mythos 5. Mythos Preview and the shipped Mythos line are different artifacts. See the Mythos-class model explainer for that distinction.
- No public benchmark table is available to rerun. Sakana has not published a per-task score grid, methodology, or third-party reproduction for this claim.
Treat “shoulder-to-shoulder” as Sakana’s framing of its internal results, not as an independently reproducible measurement.
Separate the stronger application-level claim
Sakana also claims that Fugu “consistently outperforms” these configured competitors:
- Gemini 3.1 Pro
(high) - Opus 4.8
(max) - GPT 5.5
(xhigh)
The named applications are:
- AutoResearch
- Rubik’s Cube
- Mechanical Design
- Japanese Handwriting Analysis
- One-Shot Chess
- Financial Time Series Prediction
This is not a standard academic benchmark suite. It is application-level evaluation.
That distinction matters for developers. End-to-end applications are exactly where an orchestrator can outperform a single model because it can:
- Split a task into sub-problems.
- Route each sub-problem to a suitable model.
- Compare or verify intermediate outputs.
- Synthesize a final answer.
That can be genuinely useful in production workflows. But it is still a model-of-models result, not a single-model result.
Do not summarize the claim as “Fugu beats Fable 5.” Sakana did not claim that. The parity claim and outperform claim target different rivals.
Why you cannot independently verify the numbers yet
No independent replication yet. Every Fugu benchmark figure discussed here is vendor-reported, measured on Sakana’s setup, with competitor configurations Sakana chose. As of 2026-06-22, no third party has rerun these tasks, no per-task score grid has been published, and no evaluation harness has been released.
For a single model benchmark, reproduction requires:
- The model
- The test set
- The scoring method
- The inference configuration
For Fugu, reproduction also requires:
- Fugu access
- Access to every underlying model it routes to
- Matching model versions
- Matching effort settings
- Matching orchestration behavior
- Matching dynamic agent topology
Because Fugu can adapt its internal team per task, two runs of the same prompt may not use the same underlying model mix.
That adaptivity can be valuable for users, but it makes clean benchmark reproduction harder.
So be skeptical of secondary “Fugu scored X” claims unless they include a reproducible harness. This Fugu Ultra vs Fable 5 vs Mythos comparison stays qualitative for the same reason.
Research lineage: Trinity and Conductor
Sakana’s product story is connected to real research, but the papers should not be read as product spec sheets.
Two ICLR 2026 papers are relevant:
- Trinity, “An Evolved LLM Coordinator” arXiv:2512.04695
Trinity is a sub-20,000-parameter coordinator optimized by derivative-free evolution. It uses Thinker, Worker, and Verifier roles. It is tiny and evolved, not trained by gradient descent.
- Conductor, “Learning to Orchestrate Agents in Natural Language” arXiv:2512.04388
Conductor is a 7B model trained with reinforcement learning to learn communication structure between agents. The paper claims it beats Mixture-of-Agents at lower cost.
Do not conflate them:
| System | Method | Size | What to infer |
|---|---|---|---|
| Trinity | Derivative-free evolution | Sub-20K parameters | Research lineage |
| Conductor | Reinforcement learning | 7B parameters | Research lineage |
| Fugu | Productized orchestration system | Not published | Do not infer exact specs |
The official Fugu release does not publish a product parameter count. Mapping the 7B Conductor number directly onto Fugu is third-party inference.
What is established vs. unconfirmed
| Item | What Sakana / sources say | Confidence |
|---|---|---|
| System type | Multi-agent orchestrator behind one model | Stated on release page |
| Variants | Fugu (balanced, low latency) and Fugu Ultra (max quality) | Stated on release page |
| Old beta name | Small variant was called “Fugu Mini” in beta and press | Historical |
| API surface | One OpenAI-compatible endpoint, both variants | Stated on release page |
| Underlying models | Calls multiple frontier LLMs, recursively including itself | Stated on release page |
| Product parameter count | Not published; 7B / Conductor specifics are third-party inference | [VERIFY] |
| Benchmark methodology | Vendor-reported, Sakana’s own setup, no harness released | [VERIFY] |
Naming note: the small variant was called “Fugu Mini” during the roughly 500-user beta that opened around April 24-25, 2026. The release page uses “Fugu” and “Fugu Ultra,” so use the current names in your code and docs.
How to test Fugu yourself
You cannot verify Sakana’s internal benchmark claims yet, but you can run your own evaluation.
Because Fugu uses an OpenAI-compatible chat-completions interface, you can reuse an existing OpenAI client and change the base URL.
As of 2026-06-22, the public base URL is not published on a public page. Copy it from your Sakana console at console.sakana.ai. Do not trust invented hosts from secondary posts.
Here is the basic pattern:
from openai import OpenAI
# Copy the real base URL from console.sakana.ai after you sign in.
client = OpenAI(
api_key="YOUR_FUGU_API_KEY",
base_url="<YOUR_FUGU_BASE_URL_FROM_CONSOLE>",
)
resp = client.chat.completions.create(
model="fugu-ultra", # use "fugu" for the balanced variant; verify exact IDs in console
messages=[
{"role": "system", "content": "You are a precise code reviewer."},
{
"role": "user",
"content": "Review this function for security issues:\n<paste code>",
},
],
)
print(resp.choices[0].message.content)
Reported model IDs include:
fugufugu-ultra
However, confirm the exact IDs in the console before hardcoding them. Some providers use dated model IDs or aliases.
Build a useful evaluation harness
For Fugu, latency and cost can vary more than with a single model because the orchestrator may assemble different internal teams per request.
At minimum, log:
- Prompt
- Model ID
- Response text
- HTTP status code
- Latency
- Token usage
- Cost, if available
- Timestamp
- Run ID
- Any provider-specific request metadata
A simple JSONL logging pattern is enough to start:
import json
import time
from openai import OpenAI
client = OpenAI(
api_key="YOUR_FUGU_API_KEY",
base_url="<YOUR_FUGU_BASE_URL_FROM_CONSOLE>",
)
test_cases = [
{
"id": "code-review-001",
"messages": [
{"role": "system", "content": "You are a precise code reviewer."},
{"role": "user", "content": "Review this code for security issues:\n<paste code>"},
],
},
{
"id": "api-design-001",
"messages": [
{"role": "system", "content": "You are an API design reviewer."},
{"role": "user", "content": "Review this REST API design:\n<paste spec>"},
],
},
]
with open("fugu-eval-results.jsonl", "a") as f:
for case in test_cases:
start = time.time()
resp = client.chat.completions.create(
model="fugu-ultra",
messages=case["messages"],
)
elapsed_ms = round((time.time() - start) * 1000)
record = {
"case_id": case["id"],
"model": "fugu-ultra",
"latency_ms": elapsed_ms,
"content": resp.choices[0].message.content,
"usage": resp.usage.model_dump() if resp.usage else None,
}
f.write(json.dumps(record) + "\n")
Then run the same test cases against the single models you already use. Your goal is not to reproduce Sakana’s AutoResearch or one-shot chess results. Your goal is to measure performance on your own tasks.
How this fits your Apidog workflow
You do not need a special benchmarking platform to pressure-test vendor claims. You need a repeatable way to send identical requests to several endpoints and compare the outputs.
Apidog can help you turn this into a practical workflow:
- Register the Fugu endpoint as an OpenAI-compatible API.
- Save your real evaluation prompts as requests.
- Add competing endpoints, such as Fable 5 or Opus, to the same environment.
- Send identical inputs to each model.
- Capture outputs, status codes, latency, and token usage.
- Rerun the same scenario when model versions change.
That gives you a more useful comparison than a parity claim with no public methodology.
For Fugu specifically, add assertions around:
- Response time
- Token count
- Status code
- Required output structure
- Presence or absence of specific fields
- JSON schema validity, if your application requires structured output
This helps surface cost or latency drift caused by adaptive routing.
Practical checklist before adopting Fugu
Use this checklist before putting Fugu into a production workflow:
- [ ] Confirm the exact base URL from console.sakana.ai.
- [ ] Confirm the current model IDs in the console.
- [ ] Run your own prompt set, not only vendor benchmark tasks.
- [ ] Compare against the single models you already use.
- [ ] Log latency and token usage per request.
- [ ] Test repeated runs of the same prompt to observe variance.
- [ ] Validate structured outputs with schemas.
- [ ] Track cost over multiple runs.
- [ ] Avoid describing Fugu results as single-model wins.
- [ ] Re-run your evaluation when model aliases or versions change.
Frequently Asked Questions
Does Fugu beat Fable 5 on benchmarks?
No. Sakana’s claim is parity: Fugu Ultra “stands shoulder-to-shoulder with” Fable 5 and Mythos Preview. The separate “outperforms” claim targets Gemini 3.1 Pro, Opus 4.8, and GPT 5.5 on specific applications, not Fable 5.
For the single-model side of that comparison, see the Claude Fable 5 benchmarks.
Are the Fugu benchmark numbers independently verified?
No. As of 2026-06-22, every figure is vendor-reported on Sakana’s own setup, with competitor effort settings Sakana chose. No third party has rerun the tasks, and no evaluation harness has been published.
Treat the claims as claims until someone outside Sakana reproduces them.
Why does it matter that Fugu is an orchestrator?
Because Fugu calls other vendors’ frontier models, recursively including itself. A “beats Opus 4.8” result may come from Fugu calling Opus and synthesizing that output with other model outputs.
That is a model-of-models win, not a single-model win.
Fable 5 and the Mythos line are single Anthropic models, so direct head-to-head comparisons are apples-to-oranges.
Which Mythos did Sakana compare against?
Sakana referenced the older Mythos Preview from April, the frontier model Anthropic described as too dangerous to release, not the current Mythos 5.
Some secondary write-ups name the wrong version. The Mythos-class explainer covers the difference between Preview and the shipped line.
What is the difference between Trinity and Conductor?
They are two separate ICLR 2026 papers:
- Trinity (arXiv:2512.04695) is a sub-20,000-parameter coordinator optimized by evolution.
- Conductor (arXiv:2512.04388) is a 7B model trained with reinforcement learning.
Different methods, different sizes. Neither paper should be treated as the shipped Fugu product spec sheet.
How can I test Fugu’s performance myself?
Point an OpenAI-compatible client at the Fugu base URL from console.sakana.ai, send your own tasks, and measure quality, latency, and cost.
You can also register the endpoint in Apidog to compare Fugu against the single models you already use with identical prompts and captured metrics.

Top comments (0)