How Japan Just Beat Claude Mythos

#ai #llm #productivity #agents

A founder I know forwarded me the Sakana launch tweet with one line on top: "should we switch?"
Look, the tweet worked on him. Fugu Ultra stands shoulder-to-shoulder with Fable 5 and Mythos preview, no export controls attached, and that was all it took. My honest first reaction was a small stomach drop, because if it were true it'd reset half my stack.
Then I read the actual post. Fugu isn't a model. It's an orchestration layer wearing a model's coat.
And the part that should bother you isn't the marketing. It's that a lot of senior people saw one benchmark beat Fable and stopped reading right there.
I've had three of these models in production this year, so I read launch posts like this one with a particular kind of suspicion. This one earned it.

Why this matters before you forward it to your CTO

The export-controls line isn't a benchmark claim. It's a procurement claim. Fable and Mythos access got suspended under a directive, so a vendor shows up saying you can have that frontier output anyway, through them, no paperwork.
And honestly, if legal just killed your Mythos access mid-project, that hurts. A wrapper that routes around it sounds like oxygen.
But you're not buying a model. You're buying a middleman who rents the same models you could rent yourself, then charges you to pick between them. Somebody should say that out loud before the contract gets signed.

The number everyone misread

The standard instinct is that a higher LiveCodeBench score means a smarter system, and usually that instinct is fair. Most of us live and die by SWE-bench and terminal-bench numbers when we pick a coding model.
That breaks the moment the thing on the leaderboard isn't a model.
Fugu Ultra scored 93.2 on LiveCodeBench. Fable scored 89.3. But Fable is inside Fugu Ultra. So is GPT. We don't actually know the full roster underneath, and that right there is part of the problem.
You didn't watch Japan beat Anthropic. You watched Anthropic's model, plus a few other models, plus a router, beat Anthropic's model running alone.
A collection of models beating one of its own members isn't a discovery. We've known that since boosting. Four of you post a faster total than Usain Bolt, and somehow that's "we beat Bolt."

What they actually built is a router, not a model

The framing is junk. The engineering underneath is actually interesting.
OpenRouter already shipped this shape with their Fusion idea, where one prompt fans out to several models at once and a judge model reads all the answers and stitches them into a single reply. Same path on every request, nothing learned about when to do what.
Sakana's twist is that the orchestrator is itself a small trained LLM. It doesn't fire every model on every request. It's trained to decide which models to call, when to delegate, and how to stitch the result. They ship two tiers. Fugu handles low-latency work; Fugu Ultra pulls in a deeper pool of agents for the hard multi-step stuff.
So the thing you call is a learned dispatcher, right? The raw intelligence still comes from the frontier models underneath: Opus, GPT, Fable, take your pick. Sakana owns the routing and the synthesis, and that's the whole product.
There's a name for this. It's the mixture-of-agents pattern, productized behind one endpoint. Useful, but nowhere near frontier.
Sakana FuguThat's why the leaderboard comparison doesn't survive a second look. The number proves the bundle is good. It says nothing about whether Sakana built the good part, and for a buying decision that's the whole game.
It's closer to an AI harness than a model, honestly. Claude Code is a harness too, and nobody puts Claude Code on a model leaderboard.

Same call, four invoices

The API is the seductive part. You hit one endpoint, you get one answer.

Looks like any other model call. That's the whole pitch. resp = client.messages.create( model="fugu-ultra", # not a model, a dispatcher messages=[{"role": "user", "content": prompt}], ) Under the hood this may have called Opus + GPT + Fable, run a judge pass, and billed you for every token of all of them. You can't see which. The selection logic is closed source.

That last comment is the real problem. You don't get to see the routing. You can't inspect which models ran, why, or how the final answer got assembled. It's a black box sitting on top of other black boxes.
Now the bill, in dollars. Somebody on Hacker News put it better than any analyst:
You already pay $200 each to Anthropic, OpenAI, Cursor, Google. It doesn't round up nicely, so you end up paying another $200 a month to Sakana just to coordinate it.
Call it another $200. The exact figure doesn't matter, the shape does. You're already paying every provider underneath, and now you're paying a margin on top to have something choose between them.
Do the math on a single hard request and it's worse than one tax.

`One hard request, Fugu Ultra fans out to ~3 models + a synthesis pass.

Solo call (what you do today):
~20K input + ~4K output, billed once, to one provider.

Fugu Ultra, same request:
~20K input x 3 models -> ~60K input across 3 providers
~4K output x 3 models -> ~12K output
synthesis pass reads it all -> ~32K input + ~4K output

~3x to 4x the tokens, in and out, for ONE answer,
then Sakana's margin on top.

At frontier prices, that's not a rounding error.
`

For a one-off hard problem where being right beats the bill, sure, maybe it's worth it. For your day-to-day coding loop, you're lighting money on fire to get an answer your existing frontier model would've handed you anyway.

The aftermath

To be fair, the team isn't a bunch of clowns. David Ha, co-founder and CEO, made managing director at Goldman running rates trading in Japan before he left for Google Brain, where he co-authored the World Models paper that a lot of us actually read. The beta ran with close to 500 early users building real things, and the demos aren't faked: small UIs, chess, 3D-cube solving, some ML work.
But the reaction split for a reason. The same crowd that respects the founder is also side-eyeing a lab that calls itself frontier while mostly selling B2B AI apps to Japanese businesses, with recruiting people describe as abrasive. He's clearly driven. But this thing just doesn't feel thought through.

There's no moat here

Plenty could still go wrong here, and I don't do happy endings.
First, defensibility. If the routing is a genuinely novel trick that reliably squeezes more out of the same models, then every frontier lab ships their own version inside a week. Anthropic and OpenAI already hold all the pieces. Why would they hand the coordination margin to a third party sitting on top of their own models? They wouldn't. They'd absorb it.
Then there's variance. A chunk of that "feels smarter" could be retry behavior and lucky sampling, not real lift. Run it across many benchmarks instead of one cherry-picked bench and the gap might shrink to noise. They showed one bench, which is the tell.
And the black box. When Fugu hands you a wrong answer, you can't tell which underlying model failed or why the router picked it. Good luck debugging that during an incident. You've outsourced the exact layer you need to see into, and that opacity bites you in production every single time.
So no, Japan didn't beat Mythos. A clever dispatcher rented Mythos by the token and stacked a few friends on top. If you're already on a frontier model, don't jump ship. If you're coming from something three or four months old, you'll feel a lift, but that lift is the frontier models underneath, and you can rent those directly without the extra invoice.
They're selling it as the last API key you'll ever need. What it actually is: four bills to coordinate them all.

Enjoyed the read? Let's stay connected!
🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go - follow for a fresh article every morning & night.

Your support means the world and helps me create more content you'll love. ❤️