김이더

Posted on Jun 10

The Strongest Model Yet Beat Pokémon With Vision Alone — Breaking Down Fable 5

#ai #claude #fable5 #gamedev

More posts at radarlog.kr. Original announcement on Anthropic.

Claude just beat Pokémon FireRed from start to finish.

No maps. No navigation aids. No game-state feed. Just raw screen pixels.

Here's why that's a big deal: older Claude models flailed at Pokémon even when you handed them a harness stuffed with helper tools. This new one finished the game on a bare, vision-only harness.

The model is called Fable 5. Anthropic shipped it today, June 9.

And there's a catch. If you're on a paid plan you get it for free right now — but only through June 22. From the 23rd you pay separately. More on that trap later.

What Mythos Is — The Tier They Kept Locked

Anthropic's lineup was Opus, Sonnet, Haiku. Big, medium, fast.

But there's another tier sitting above Opus: Mythos.

The problem was that Mythos was so strong Anthropic deliberately kept it off the public shelf. When they revealed Mythos Preview back in April, they didn't hand it to regular users — they routed it through a cybersecurity program called Project Glasswing, to a small vetted set of organizations.

The reason is blunt. This tier is freakishly good at finding and exploiting software vulnerabilities. In a cyber defender's hands it hardens defenses. In an attacker's hands it's a weapon.

So Fable 5 is Mythos with guardrails bolted on.

The naming is a nice touch. Fable comes from the Latin fabula — "that which is told" — which shares a root with the Greek mythos. Same model, split into two names by whether the safety classifiers are on. The unguarded original is Mythos 5, still restricted to a vetted few.

Game-dev analogy: if Mythos 5 is the dev build with all cheats on, Fable 5 is that same build shipped to the public with the dangerous console commands disabled. Same engine. Different doors locked.

Concretely, when a query touches cybersecurity, biology, chemistry, or distillation, Fable 5 doesn't answer directly — it hands off to Opus 4.8. You don't get a flat refusal; a slightly-lower (but still very strong) model catches the request, and you're told it happened.

How often does that fallback fire? Anthropic says under 5% of sessions. Flip that around: in 95%+ of sessions nothing gets routed away, and in those you're effectively running the same thing as Mythos 5.

They stress-tested the guardrails with an external bug bounty — over 1,000 hours of attempts and no universal jailbreak found. (Though the UK AISI reportedly made some progress in a short window, to be fair.)

Pricing — $10/$50, Looks Steep But There's Context

Down to business. The number you actually care about.

Fable 5 and Mythos 5 both cost $10 per million input tokens and $50 per million output tokens.

Lined up against the rest:

Fable 5 / Mythos 5   $10 / $50
Opus 4.8             $5  / $25   (Fast Mode is $10 / $50)
Sonnet 4.6           $3  / $15
Haiku 4.5            $1  / $5

On paper it's the most expensive publicly-available model ever. But two things matter alongside that.

First, it's the same rate as Opus 4.8 Fast Mode. Fast Mode runs Opus at ~2.5x speed for $10/$50 — so for that same money you're getting Fable, which sits a tier above Opus. Same price, stronger engine.

Second, it's less than half the price of Mythos Preview. They raised the tier and halved the cost at the same time.

And token rates aren't the whole story. Prompt caching shaves up to 90% off cached input, and the batch API is a flat 50% off. Any workflow with heavy repeated context lands well below the sticker price.

Anyone who's run servers knows this: judging "expensive" by hourly instance rate is a trap. The real cost is how many hours the job takes. Fable is built to compress exactly those hours.

Performance — A 50-Million-Line Codebase Migrated in a Day

One demo settles the performance question.

In early testing, Stripe ran a codebase-wide migration on a 50-million-line Ruby codebase. A human team doing it by hand would've spent over two months. Fable 5 did it in a day.

Human team:  2+ months
Fable 5:     1 day

The thing that makes this possible is "long-horizon" — the ability to work for days without losing the thread. It plans in stages, delegates to sub-agents, and tests and verifies its own output.

That Planner/Generator/Evaluator harness you sweat over building by hand? The model ships with a chunk of it baked in. It holds together for days without you orchestrating from the outside.

The benchmarks back it up. Cognition's FrontierCode eval doesn't just check whether a model solves a coding task — it checks whether it solves it while meeting production-grade codebase standards. Fable 5 tops frontier models there, and it does so at medium effort. First place without even going full power.

It's not just coding. It posted the top score on Hebbia's senior-level finance reasoning benchmark, and one partner called it the first model to break 90% on their analytics benchmark — a 10-point jump over Opus.

Token efficiency improved too. A physics research partner said Fable reached in 36 hours, using a third of the reasoning tokens, what GPT-5.5 took four days to reach.

Proven Through Games — This Is the Fun Part

As a game programmer, the most interesting thing is that they validated this model through games. It's far more intuitive than a benchmark number.

Pokémon FireRed. The one from the intro. No map, no pathfinding, pure screen pixels, start to finish. Instead of claiming vision is state-of-the-art with a score, they showed it by pushing a whole game to the credits. There's also a demo of it rebuilding a web app's entire source code from screenshots alone.

Slay the Spire. A roguelike deckbuilder — every run is fresh and you learn across attempts. The interesting bit here is the memory experiment. When they gave it persistent file-based memory, the performance boost was 3x larger than the same memory gave Opus 4.8. It also reached the game's final act three times as often.

Why does that matter? Bolt the same memory system onto two models and they use it differently. Handing over a save file is pointless unless the model has the head to read it and apply it next run. Fable's ability to exploit memory is on another level.

Factorio. The factory-automation game engineers obsess over. Fable strategized and built an automated factory on its own, no human input. Anyone who's played it knows that's not a "clear the level" task — it's systems design plus long-term optimization. The days-long, never-dropping long-horizon stamina shows up right inside the game.

There's more. Deriving a solar-system simulation from physics first principles to predict eclipses. Building a browser CAD editor and then designing a 3D-printable model inside it. Writing a fluid simulation synced to an EDM beat — using code, having never actually heard music.

Games and simulations don't lie. With a benchmark you can at least suspect overfitting; finishing a game either works or it doesn't. I think that's exactly why Anthropic put the game demos front and center.

So What Should a Subscriber Do — June 22 Is the Line

The practical part. Back to that trap.

If you're on Pro, Max, Team, or a seat-based Enterprise plan, Fable 5 is included in your plan limits at no extra cost from June 9 through June 22.

On June 23, Anthropic pulls Fable from those plans. After that, using it requires usage credits, billed at API rates ($10/$50).

Jun 9 – Jun 22   included, free
Jun 23 –         removed from plans, drawn from credits

The reason, as always, is capacity. Demand is hard to predict, so they're rolling out conservatively in stages and say they'll restore Fable as a standard plan feature once there's headroom. No date on that, though.

Some of the community is calling it a bait-and-switch — give it for two weeks, then yank it. Opinions split. Either way, the user move is obvious.

If you're going to try it, do it before June 22.

Throw a task at it you've never been able to tackle. A days-long migration, a gnarly multi-agent workflow, a big refactor. The long-horizon work where Opus used to gas out mid-run — you get two free weeks to test exactly that.

One last note. Business traffic gets a new 30-day data retention policy. Mythos-class models retain data for 30 days for safety purposes (not used for training). If you're running sensitive code, factor that in.

Same price, a stronger engine, free for two weeks.
Not trying it is the actual loss.

Top comments (1)

Sloan the DEV Moderator • Jun 10

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!