Hassann

Posted on Jun 22 • Originally published at apidog.com

Sakana Fugu Benchmarks: What "Shoulder-to-Shoulder With Fable 5" Actually Means

Sakana’s Fugu benchmark claims are vendor-reported, not independently verified scorecards. Per Sakana’s release page, Fugu Ultra “stands shoulder-to-shoulder with leading models like Fable 5 and Mythos Preview” on engineering, scientific, and reasoning tasks, while Fugu “consistently outperforms” Gemini 3.1 Pro, Opus 4.8, and GPT 5.5 on a named set of applications. The key implementation detail: Fugu is an orchestrator that calls other vendors’ frontier models, so its results are not single-model wins in the same way Fable 5’s are.

Try Apidog today

What Fugu is and why it changes benchmark interpretation

Fugu is not a single foundation model. It is a multi-agent orchestration system exposed as one model behind an OpenAI-compatible API.

Sakana describes Fugu as a trained language model specialized in:

Delegation
Agent communication
Work synthesis
Dynamic coordination across multiple LLMs
Recursive use of Fugu itself

That matters when reading benchmark claims.

For a normal model benchmark, the score usually reflects one model’s weights doing the work. For Fugu, a result may involve Fugu calling Opus 4.8, GPT 5.5, Gemini 3.1 Pro, or other models, then synthesizing their outputs.

So if Fugu “beats Opus 4.8” on a task, that does not necessarily mean a single Sakana model out-reasoned Opus. It may mean an orchestration system used Opus plus other models more effectively.

If you want the architecture context, see this explainer on what Sakana Fugu is.

Read the parity claim carefully

Sakana’s first claim is a parity claim:

Fugu Ultra “stands shoulder-to-shoulder with leading models like Fable 5 and Mythos Preview.”

That is not the same as saying Fugu Ultra beats those models. It positions Fugu Ultra as a peer across engineering, scientific, and reasoning benchmarks.

Two implementation-relevant caveats:

The comparison names Mythos Preview, not the current generally available Mythos 5. Mythos Preview and the shipped Mythos line are different artifacts. See the Mythos-class model explainer for that distinction.
No public benchmark table is available to rerun. Sakana has not published a per-task score grid, methodology, or third-party reproduction for this claim.

Treat “shoulder-to-shoulder” as Sakana’s framing of its internal results, not as an independently reproducible measurement.

Separate the stronger application-level claim

Sakana also claims that Fugu “consistently outperforms” these configured competitors:

Gemini 3.1 Pro (high)
Opus 4.8 (max)
GPT 5.5 (xhigh)

The named applications are:

AutoResearch
Rubik’s Cube
Mechanical Design
Japanese Handwriting Analysis
One-Shot Chess
Financial Time Series Prediction

This is not a standard academic benchmark suite. It is application-level evaluation.

That distinction matters for developers. End-to-end applications are exactly where an orchestrator can outperform a single model because it can:

Split a task into sub-problems.
Route each sub-problem to a suitable model.
Compare or verify intermediate outputs.
Synthesize a final answer.

That can be genuinely useful in production workflows. But it is still a model-of-models result, not a single-model result.

Do not summarize the claim as “Fugu beats Fable 5.” Sakana did not claim that. The parity claim and outperform claim target different rivals.

Why you cannot independently verify the numbers yet

No independent replication yet. Every Fugu benchmark figure discussed here is vendor-reported, measured on Sakana’s setup, with competitor configurations Sakana chose. As of 2026-06-22, no third party has rerun these tasks, no per-task score grid has been published, and no evaluation harness has been released.

For a single model benchmark, reproduction requires:

The model
The test set
The scoring method
The inference configuration

For Fugu, reproduction also requires:

Fugu access
Access to every underlying model it routes to
Matching model versions
Matching effort settings
Matching orchestration behavior
Matching dynamic agent topology

Because Fugu can adapt its internal team per task, two runs of the same prompt may not use the same underlying model mix.

That adaptivity can be valuable for users, but it makes clean benchmark reproduction harder.

So be skeptical of secondary “Fugu scored X” claims unless they include a reproducible harness. This Fugu Ultra vs Fable 5 vs Mythos comparison stays qualitative for the same reason.

Research lineage: Trinity and Conductor

Sakana’s product story is connected to real research, but the papers should not be read as product spec sheets.

Two ICLR 2026 papers are relevant:

Trinity, “An Evolved LLM Coordinator” arXiv:2512.04695

Trinity is a sub-20,000-parameter coordinator optimized by derivative-free evolution. It uses Thinker, Worker, and Verifier roles. It is tiny and evolved, not trained by gradient descent.

Conductor, “Learning to Orchestrate Agents in Natural Language” arXiv:2512.04388

Conductor is a 7B model trained with reinforcement learning to learn communication structure between agents. The paper claims it beats Mixture-of-Agents at lower cost.

Do not conflate them:

System	Method	Size	What to infer
Trinity	Derivative-free evolution	Sub-20K parameters	Research lineage
Conductor	Reinforcement learning	7B parameters	Research lineage
Fugu	Productized orchestration system	Not published	Do not infer exact specs

The official Fugu release does not publish a product parameter count. Mapping the 7B Conductor number directly onto Fugu is third-party inference.

What is established vs. unconfirmed

Item	What Sakana / sources say	Confidence
System type	Multi-agent orchestrator behind one model	Stated on release page
Variants	Fugu (balanced, low latency) and Fugu Ultra (max quality)	Stated on release page
Old beta name	Small variant was called “Fugu Mini” in beta and press	Historical
API surface	One OpenAI-compatible endpoint, both variants	Stated on release page
Underlying models	Calls multiple frontier LLMs, recursively including itself	Stated on release page
Product parameter count	Not published; 7B / Conductor specifics are third-party inference	[VERIFY]
Benchmark methodology	Vendor-reported, Sakana’s own setup, no harness released	[VERIFY]

Naming note: the small variant was called “Fugu Mini” during the roughly 500-user beta that opened around April 24-25, 2026. The release page uses “Fugu” and “Fugu Ultra,” so use the current names in your code and docs.

How to test Fugu yourself

You cannot verify Sakana’s internal benchmark claims yet, but you can run your own evaluation.

Because Fugu uses an OpenAI-compatible chat-completions interface, you can reuse an existing OpenAI client and change the base URL.

As of 2026-06-22, the public base URL is not published on a public page. Copy it from your Sakana console at console.sakana.ai. Do not trust invented hosts from secondary posts.

Here is the basic pattern:

from openai import OpenAI

# Copy the real base URL from console.sakana.ai after you sign in.
client = OpenAI(
    api_key="YOUR_FUGU_API_KEY",
    base_url="<YOUR_FUGU_BASE_URL_FROM_CONSOLE>",
)

resp = client.chat.completions.create(
    model="fugu-ultra",  # use "fugu" for the balanced variant; verify exact IDs in console
    messages=[
        {"role": "system", "content": "You are a precise code reviewer."},
        {
            "role": "user",
            "content": "Review this function for security issues:\n<paste code>",
        },
    ],
)

print(resp.choices[0].message.content)

Reported model IDs include:

fugu
fugu-ultra

However, confirm the exact IDs in the console before hardcoding them. Some providers use dated model IDs or aliases.

Build a useful evaluation harness

For Fugu, latency and cost can vary more than with a single model because the orchestrator may assemble different internal teams per request.

At minimum, log:

Prompt
Model ID
Response text
HTTP status code
Latency
Token usage
Cost, if available
Timestamp
Run ID
Any provider-specific request metadata

A simple JSONL logging pattern is enough to start:

import json
import time
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_FUGU_API_KEY",
    base_url="<YOUR_FUGU_BASE_URL_FROM_CONSOLE>",
)

test_cases = [
    {
        "id": "code-review-001",
        "messages": [
            {"role": "system", "content": "You are a precise code reviewer."},
            {"role": "user", "content": "Review this code for security issues:\n<paste code>"},
        ],
    },
    {
        "id": "api-design-001",
        "messages": [
            {"role": "system", "content": "You are an API design reviewer."},
            {"role": "user", "content": "Review this REST API design:\n<paste spec>"},
        ],
    },
]

with open("fugu-eval-results.jsonl", "a") as f:
    for case in test_cases:
        start = time.time()

        resp = client.chat.completions.create(
            model="fugu-ultra",
            messages=case["messages"],
        )

        elapsed_ms = round((time.time() - start) * 1000)

        record = {
            "case_id": case["id"],
            "model": "fugu-ultra",
            "latency_ms": elapsed_ms,
            "content": resp.choices[0].message.content,
            "usage": resp.usage.model_dump() if resp.usage else None,
        }

        f.write(json.dumps(record) + "\n")

Then run the same test cases against the single models you already use. Your goal is not to reproduce Sakana’s AutoResearch or one-shot chess results. Your goal is to measure performance on your own tasks.

How this fits your Apidog workflow

You do not need a special benchmarking platform to pressure-test vendor claims. You need a repeatable way to send identical requests to several endpoints and compare the outputs.

Apidog can help you turn this into a practical workflow:

Register the Fugu endpoint as an OpenAI-compatible API.
Save your real evaluation prompts as requests.
Add competing endpoints, such as Fable 5 or Opus, to the same environment.
Send identical inputs to each model.
Capture outputs, status codes, latency, and token usage.
Rerun the same scenario when model versions change.

That gives you a more useful comparison than a parity claim with no public methodology.

For Fugu specifically, add assertions around:

Response time
Token count
Status code
Required output structure
Presence or absence of specific fields
JSON schema validity, if your application requires structured output

This helps surface cost or latency drift caused by adaptive routing.

Practical checklist before adopting Fugu

Use this checklist before putting Fugu into a production workflow:

[ ] Confirm the exact base URL from console.sakana.ai.
[ ] Confirm the current model IDs in the console.
[ ] Run your own prompt set, not only vendor benchmark tasks.
[ ] Compare against the single models you already use.
[ ] Log latency and token usage per request.
[ ] Test repeated runs of the same prompt to observe variance.
[ ] Validate structured outputs with schemas.
[ ] Track cost over multiple runs.
[ ] Avoid describing Fugu results as single-model wins.
[ ] Re-run your evaluation when model aliases or versions change.

Frequently Asked Questions

Does Fugu beat Fable 5 on benchmarks?

No. Sakana’s claim is parity: Fugu Ultra “stands shoulder-to-shoulder with” Fable 5 and Mythos Preview. The separate “outperforms” claim targets Gemini 3.1 Pro, Opus 4.8, and GPT 5.5 on specific applications, not Fable 5.

For the single-model side of that comparison, see the Claude Fable 5 benchmarks.

Are the Fugu benchmark numbers independently verified?

No. As of 2026-06-22, every figure is vendor-reported on Sakana’s own setup, with competitor effort settings Sakana chose. No third party has rerun the tasks, and no evaluation harness has been published.

Treat the claims as claims until someone outside Sakana reproduces them.

Why does it matter that Fugu is an orchestrator?

Because Fugu calls other vendors’ frontier models, recursively including itself. A “beats Opus 4.8” result may come from Fugu calling Opus and synthesizing that output with other model outputs.

That is a model-of-models win, not a single-model win.

Fable 5 and the Mythos line are single Anthropic models, so direct head-to-head comparisons are apples-to-oranges.

Which Mythos did Sakana compare against?

Sakana referenced the older Mythos Preview from April, the frontier model Anthropic described as too dangerous to release, not the current Mythos 5.

Some secondary write-ups name the wrong version. The Mythos-class explainer covers the difference between Preview and the shipped line.

What is the difference between Trinity and Conductor?

They are two separate ICLR 2026 papers:

Trinity (arXiv:2512.04695) is a sub-20,000-parameter coordinator optimized by evolution.
Conductor (arXiv:2512.04388) is a 7B model trained with reinforcement learning.

Different methods, different sizes. Neither paper should be treated as the shipped Fugu product spec sheet.

How can I test Fugu’s performance myself?

Point an OpenAI-compatible client at the Fugu base URL from console.sakana.ai, send your own tasks, and measure quality, latency, and cost.

You can also register the endpoint in Apidog to compare Fugu against the single models you already use with identical prompts and captured metrics.

DEV Community