Jason Agostoni

Posted on Jun 15 • Originally published at jason.agostoni.net

Can Fable 5 Finish Off the Other Frontiers?

#ai #development #claude #coding

Can Anthropic's Fable 5 justify its staggering cost and live up to the massive hype to unseat the top specialized models? I ran Ship-Bench against the model to find out, stacking it up directly against the best overall performances so far across previous benchmarks.

Hypothesis: Given the premium market rate and the recent headlines regarding its capabilities, I expected Fable 5 to perform exceptionally well. Pitting it against a composite "Best-in-Class" lineup, where models like Sonnet 4.6 and DeepSeek v4 Pro are cherry-picked for their strongest roles, seemed like the only fair thing to do.

Key Insights

Fable 5 is the new benchmark king: It finished with a perfect 100 in architecture, an overall average of 96.49, and 5/5 passes, decisively beating DeepSeek v4 Pro's previous top average of 94.18.
The early-stage roles were highly competitive, but the biggest late-stage separation occurred in the Reviewer role, where Fable 5 set a new high bar (89.29) in a phase where all other models seem to struggle against the rubric.
Testing the limits of a multi-turn SDLC: Ship-Bench is designed to test a closer to reality process over an extended chain of handoffs. Fable's legendary strength usually shines in extreme reasoning capability and awesome success with one-shotting ideas, so seeing how its consistency held up across a multi-step workflow was a core focus of this run.
Cost in this run was exorbitant at nearly $180, driven by massive cache token volumes during the implementation phase. The practical answer to its viability depends heavily on whether near-flawless reliability justifies the extreme API spend.

Setup

All runs used the same machine, the same benchmark process, and the same underlying task. The harness differed slightly to accommodate the different models, and that is worth documenting up front because tooling can shape workflow, context handling, and operator experience even when the benchmark target stays the same.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1 (commit `0e7cc28`)
Benchmark task	Simplified knowledge base app

Run configuration

Item	Fable 5 run	Composite Best-in-Class
Harness	Claude Code v2.1.177	Various
Model	Anthropic Fable 5	Sonnet 4.6 / DeepSeek v4 Pro / Gemini 3.5 Flash
Backend	Anthropic subscription	Various
Run repo	`evals_jun10_fable`	Various

Judge configuration

Item	Value
Judge harness	Claude Code v2.1.177
Judge model	Opus 4.8 medium
Evaluation mode	LLM judge plus human review

Ship-Bench Context

Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase is scored independently and produces artifacts that feed the next stage, which makes the benchmark useful for measuring not only isolated output quality but also handoff quality across a realistic workflow.

This run used the standard simplified knowledge base app task. That task is intentionally large enough to expose differences in architecture, planning, implementation, and review without becoming too open-ended to compare across runs.

Overall Results

Metric	Fable 5	Composite Best-in-Class
Architect	100	98.00 (Sonnet 4.6)
UX Designer	98.57	98.60 (DeepSeek v4 Pro)
Planner	97.20	99.00 (Gemini 3.5 Flash)
Developer	97.37	98.75 (DeepSeek v4 Pro)
Reviewer	89.29	85.00 (DeepSeek v4 Pro)
Average score	96.49	95.87
Passes	5/5	5/5

Fable 5 averaged higher across the entire workflow than an aggregate of the best single-role performances we've seen to date. While it narrowly lost the UX, Planner, and Developer rounds to specialized heavyweights, its sheer dominance in Architecture and Review elevated its total package.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Fable 5	Best-in-Class (Sonnet 4.6)
Score	100	98.00
Pass	Yes	Yes
Output	docs/architecture.md	docs/architecture.md
Eval	Architect Eval	Architect Eval

LLM judge summary: Fable 5 delivered a flawless performance. Every framework dependency was explicitly pinned to current stable versions, and the local-first, zero-service SQLite design (with WAL and busy_timeout for scale) perfectly matched the brief.

Human notes: Fable 5 was the first to achieve a perfect 100/100 in this phase. This demonstrates Fable's excellence in thinking through the design of a software application, which showed in the relatively straightforward development process that followed. Comparing Fable's and Sonnet's output reveals some similarities, but Fable was more accurate with the dependency versions and added more depth in each area. The conclusion here is that this level of upfront planning is likely what allows Fable to succeed in one-shotting application builds.

Practical takeaway: Fable 5 sets the new gold standard for technical specification, leaving zero ambiguity for the downstream phases.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Fable 5	Best-in-Class (DeepSeek v4 Pro)
Score	98.57	98.60
Pass	Yes	Yes
Output	docs/design-spec.md	docs/design-spec.md
Eval	UX Eval	UX Eval

LLM judge summary: The spec featured exhaustive state coverage (including a11y focus rings and ARIA combobox specs verified to AA contrast) and deterministic handoff instructions. It explicitly categorized mobile design as a "graceful, untested" layer, which capped its responsive score slightly.

Human notes: I actually differ from the LLM judge's decision here—I feel Fable performed much better than even DeepSeek's previous high marks. Fable provided significantly more depth in each area and even added a nice navigation map. The real question is whether this level of depth is worth the additional token costs, as Fable's cache reads were very high as a result of producing such a detailed spec.

Practical takeaway: Fable 5 delivers incredibly deep and implementable UX specs, but you pay a steep price in context tokens for that level of exhaustive detail.

Planner

The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Fable 5	Best-in-Class (Gemini 3.5 Flash)
Score	97.20	99.00
Pass	Yes	Yes
Output	docs/backlog.md	docs/backlog.md
Eval	Planner Eval	Planner Eval

LLM judge summary: Fable 5 delivered 6 clean, demonstrable iterations that perfectly mapped features to chunks. The only minor deviation was stretching the plan to 6 iterations instead of the nominal 3–5 band, but it maintained a rigorous MVP focus.

Human notes: Fable favored a more horizontal approach, which may work for a small app like this, but for larger enterprise apps, I prefer vertical slices. Gemini 3.5 got closer to vertical slices, though both models still saved the E2E tests for last and relied on unit tests and cURL command checks during the dev iterations. This is a case where Gemini 3.5 Flash was able to do more with less: less content in the backlog plan, fewer iterations, and more vertically oriented work.

Practical takeaway: Fable 5 provides incredibly detailed planning for a coding agent to follow, but it could have condensed its setup into just one "foundational" iteration rather than two or three.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.

Metric	Fable 5	Best-in-Class (DeepSeek v4 Pro)
Score	97.37	98.75
Pass	Yes	Yes
Output	evals_jun10_fable	evals_may2026_deepseek-v4-pro
Eval	Developer Eval	Developer Eval

LLM judge summary: The developer successfully shipped a working MVP with 59/59 unit tests and 8/8 E2E tests passing natively. The code was cleanly typed and strictly layered. It lost minor points for lacking a configured test coverage tool and a minor unrecoverable error boundary for multi-byte payloads exceeding 100KB.

Human notes: Honestly, both models generated very similar code, but DeepSeek added some extra flair, such as using Zod on the APIs. I also preferred DeepSeek's UI organization a little better, and its final application simply looked nicer.

Practical takeaway: In a structured, agentic development approach, Fable is brutal overkill and far too expensive when a mid-tier model like Sonnet or Gemini 3.5 Flash—or a flagship like DeepSeek—would be much more cost-efficient; plan ahead with Fable, but execute with a smaller model.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.

Metric	Fable 5	Best-in-Class (DeepSeek v4 Pro)
Score	89.29	85.00
Pass	Yes	Yes
Output	docs/qa-report.md	docs/qa-report.md
Eval	Reviewer Eval	Reviewer Eval

LLM judge summary: Fable 5 performed a highly accurate QA audit. It actively verified MVP flows, effectively cataloged states and edge cases, and even accurately reproduced the specific multi-byte body limit defect from the Developer stage. It missed perfection only by failing to measure application performance latencies against the 100-concurrent-user requirement.

Human notes: Despite Fable getting a higher score, I felt DeepSeek's output was far more detailed and complete, though Fable offered a little more narrative to its report.

Practical takeaway: Fable performed about as good as any of the other models, and would likely begin to excel on larger and more complex code bases.

Token and Cost Analysis

The quality difference matters, but the economics still matter, so cost deserves its own section.

Fable 5 was priced on Anthropic's Claude API at $10 per million input tokens and $50 per million output tokens. For context operations, it applies Anthropic's caching multiplier: roughly $1.00/M for cache reads and $12.50/M for cache writes.

Primary cost view

Metric	Fable 5
Input tokens	223,500
Output tokens	29,254
Cache read tokens	95,082,600
Cache write tokens	6,471,100
Estimated total cost	$179.67

Fable 5 is astonishingly expensive for a single Ship-Bench run. While the raw input and output generation were minimal, its deeply recursive workflow in the Developer iterations aggressively utilized context caching (consuming over 82 million cache read tokens to implement the build alone).

App Comparison

The benchmark scores matter most, but the shipped app is still the most tangible output. Screenshots are a useful complement here because they show polish, coherence, and UX flow in a way score tables cannot fully capture.

Screenshots

View	Fable 5 app
Home page / Search	articles.png
Article detail	article.png
Article editor	edit.png

Subjective UX review

Fable 5 UI

Gemini 3.5 Flash UI

The UI Fable delivered was a brutalistic, functionality-first design which was clean and complete. Gemini 3.5 Flash added a little more flair over Fable, but this benchmark specifically requests a calm, readable, information-first experience, and Fable created exactly that. This exercise by no means offered Fable a challenge, and others have seen it create some really fantastic experiences when pushed.

Interpretation

Fable performed quite well, as expected, but seemed somewhat out of place for a structured agentic process compared to its legendary one-shotting capabilities. The incredible detail it put into the architecture, design, and planning phases would likely matter tremendously for much more complex enterprise applications where deep reasoning really makes a difference.

For the iterative development phase, however, it is simply way too expensive compared to other models to justify. The ideal workflow here seems to be using a model like Fable to deeply "pre-think" and specify the architecture and plans for a coding agent, and then executing the actual iterative loops with a cheaper, highly capable mid-tier model.

Verdict

Fable 5 is highly capable, there is no doubt about that, but for this benchmark's end-to-end SDLC loop, it is excessively expensive and likely overkill. I would absolutely add it as an option in your agentic process for limited, high-reasoning specific tasks—should the model be made broadly available again someday.

What's Next?

What should the next Ship-Bench matchup test?

Are there two models or tools you want compared head to head?
Are you more interested in raw quality, cost efficiency, or open-vs-closed performance?
Do you want to know which setup is best for end-to-end autonomous runs, or which one is good enough for specific roles?

If there is a comparison or question you want to see tested, let me know. That feedback helps shape the next run.

DEV Community