Can Anthropic's Fable 5 justify its staggering cost and live up to the massive hype to unseat the top specialized models? I ran Ship-Bench against the model to find out, stacking it up directly against the best overall performances so far across previous benchmarks.
Hypothesis: Given the premium market rate and the recent headlines regarding its capabilities, I expected Fable 5 to perform exceptionally well. Pitting it against a composite "Best-in-Class" lineup, where models like Sonnet 4.6 and DeepSeek v4 Pro are cherry-picked for their strongest roles, seemed like the only fair thing to do.
Key Insights
Fable 5 is the new benchmark king: It finished with a perfect 100 in architecture, an overall average of 96.49, and 5/5 passes, decisively beating DeepSeek v4 Pro's previous top average of 94.18.
The early-stage roles were highly competitive, but the biggest late-stage separation occurred in the Reviewer role, where Fable 5 set a new high bar (89.29) in a phase where all other models seem to struggle against the rubric.
Testing the limits of a multi-turn SDLC: Ship-Bench is designed to test a closer to reality process over an extended chain of handoffs. Fable's legendary strength usually shines in extreme reasoning capability and awesome success with one-shotting ideas, so seeing how its consistency held up across a multi-step workflow was a core focus of this run.
Cost in this run was exorbitant at nearly $180, driven by massive cache token volumes during the implementation phase. The practical answer to its viability depends heavily on whether near-flawless reliability justifies the extreme API spend.
Setup
All runs used the same machine, the same benchmark process, and the same underlying task. The harness differed slightly to accommodate the different models, and that is worth documenting up front because tooling can shape workflow, context handling, and operator experience even when the benchmark target stays the same.
Environment
| Item | Value |
|---|---|
| Machine | Windows 11 |
| Runtime | Node v24 |
| Ship-Bench repo |
ship-bench v1 (commit 0e7cc28) |
| Benchmark task | Simplified knowledge base app |
Run configuration
| Item | Fable 5 run | Composite Best-in-Class |
|---|---|---|
| Harness | Claude Code v2.1.177 | Various |
| Model | Anthropic Fable 5 | Sonnet 4.6 / DeepSeek v4 Pro / Gemini 3.5 Flash |
| Backend | Anthropic subscription | Various |
| Run repo | evals_jun10_fable |
Various |
Judge configuration
| Item | Value |
|---|---|
| Judge harness | Claude Code v2.1.177 |
| Judge model | Opus 4.8 medium |
| Evaluation mode | LLM judge plus human review |
Ship-Bench Context
Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase is scored independently and produces artifacts that feed the next stage, which makes the benchmark useful for measuring not only isolated output quality but also handoff quality across a realistic workflow.
This run used the standard simplified knowledge base app task. That task is intentionally large enough to expose differences in architecture, planning, implementation, and review without becoming too open-ended to compare across runs.
Overall Results
| Metric | Fable 5 | Composite Best-in-Class |
|---|---|---|
| Architect | 100 | 98.00 (Sonnet 4.6) |
| UX Designer | 98.57 | 98.60 (DeepSeek v4 Pro) |
| Planner | 97.20 | 99.00 (Gemini 3.5 Flash) |
| Developer | 97.37 | 98.75 (DeepSeek v4 Pro) |
| Reviewer | 89.29 | 85.00 (DeepSeek v4 Pro) |
| Average score | 96.49 | 95.87 |
| Passes | 5/5 | 5/5 |
Fable 5 averaged higher across the entire workflow than an aggregate of the best single-role performances we've seen to date. While it narrowly lost the UX, Planner, and Developer rounds to specialized heavyweights, its sheer dominance in Architecture and Review elevated its total package.
Architect
The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.
| Metric | Fable 5 | Best-in-Class (Sonnet 4.6) |
|---|---|---|
| Score | 100 | 98.00 |
| Pass | Yes | Yes |
| Output | docs/architecture.md | docs/architecture.md |
| Eval | Architect Eval | Architect Eval |
LLM judge summary: Fable 5 delivered a flawless performance. Every framework dependency was explicitly pinned to current stable versions, and the local-first, zero-service SQLite design (with WAL and busy_timeout for scale) perfectly matched the brief.
Human notes: Fable 5 was the first to achieve a perfect 100/100 in this phase. This demonstrates Fable's excellence in thinking through the design of a software application, which showed in the relatively straightforward development process that followed. Comparing Fable's and Sonnet's output reveals some similarities, but Fable was more accurate with the dependency versions and added more depth in each area. The conclusion here is that this level of upfront planning is likely what allows Fable to succeed in one-shotting application builds.
Practical takeaway: Fable 5 sets the new gold standard for technical specification, leaving zero ambiguity for the downstream phases.
UX Designer
The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.
| Metric | Fable 5 | Best-in-Class (DeepSeek v4 Pro) |
|---|---|---|
| Score | 98.57 | 98.60 |
| Pass | Yes | Yes |
| Output | docs/design-spec.md | docs/design-spec.md |
| Eval | UX Eval | UX Eval |
LLM judge summary: The spec featured exhaustive state coverage (including a11y focus rings and ARIA combobox specs verified to AA contrast) and deterministic handoff instructions. It explicitly categorized mobile design as a "graceful, untested" layer, which capped its responsive score slightly.
Human notes: I actually differ from the LLM judge's decision here—I feel Fable performed much better than even DeepSeek's previous high marks. Fable provided significantly more depth in each area and even added a nice navigation map. The real question is whether this level of depth is worth the additional token costs, as Fable's cache reads were very high as a result of producing such a detailed spec.
Practical takeaway: Fable 5 delivers incredibly deep and implementable UX specs, but you pay a steep price in context tokens for that level of exhaustive detail.
Planner
The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.
| Metric | Fable 5 | Best-in-Class (Gemini 3.5 Flash) |
|---|---|---|
| Score | 97.20 | 99.00 |
| Pass | Yes | Yes |
| Output | docs/backlog.md | docs/backlog.md |
| Eval | Planner Eval | Planner Eval |
LLM judge summary: Fable 5 delivered 6 clean, demonstrable iterations that perfectly mapped features to chunks. The only minor deviation was stretching the plan to 6 iterations instead of the nominal 3–5 band, but it maintained a rigorous MVP focus.
Human notes: Fable favored a more horizontal approach, which may work for a small app like this, but for larger enterprise apps, I prefer vertical slices. Gemini 3.5 got closer to vertical slices, though both models still saved the E2E tests for last and relied on unit tests and cURL command checks during the dev iterations. This is a case where Gemini 3.5 Flash was able to do more with less: less content in the backlog plan, fewer iterations, and more vertically oriented work.
Practical takeaway: Fable 5 provides incredibly detailed planning for a coding agent to follow, but it could have condensed its setup into just one "foundational" iteration rather than two or three.
Developer
The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.
| Metric | Fable 5 | Best-in-Class (DeepSeek v4 Pro) |
|---|---|---|
| Score | 97.37 | 98.75 |
| Pass | Yes | Yes |
| Output | evals_jun10_fable | evals_may2026_deepseek-v4-pro |
| Eval | Developer Eval | Developer Eval |
LLM judge summary: The developer successfully shipped a working MVP with 59/59 unit tests and 8/8 E2E tests passing natively. The code was cleanly typed and strictly layered. It lost minor points for lacking a configured test coverage tool and a minor unrecoverable error boundary for multi-byte payloads exceeding 100KB.
Human notes: Honestly, both models generated very similar code, but DeepSeek added some extra flair, such as using Zod on the APIs. I also preferred DeepSeek's UI organization a little better, and its final application simply looked nicer.
Practical takeaway: In a structured, agentic development approach, Fable is brutal overkill and far too expensive when a mid-tier model like Sonnet or Gemini 3.5 Flash—or a flagship like DeepSeek—would be much more cost-efficient; plan ahead with Fable, but execute with a smaller model.
Reviewer
The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.
| Metric | Fable 5 | Best-in-Class (DeepSeek v4 Pro) |
|---|---|---|
| Score | 89.29 | 85.00 |
| Pass | Yes | Yes |
| Output | docs/qa-report.md | docs/qa-report.md |
| Eval | Reviewer Eval | Reviewer Eval |
LLM judge summary: Fable 5 performed a highly accurate QA audit. It actively verified MVP flows, effectively cataloged states and edge cases, and even accurately reproduced the specific multi-byte body limit defect from the Developer stage. It missed perfection only by failing to measure application performance latencies against the 100-concurrent-user requirement.
Human notes: Despite Fable getting a higher score, I felt DeepSeek's output was far more detailed and complete, though Fable offered a little more narrative to its report.
Practical takeaway: Fable performed about as good as any of the other models, and would likely begin to excel on larger and more complex code bases.
Token and Cost Analysis
The quality difference matters, but the economics still matter, so cost deserves its own section.
Fable 5 was priced on Anthropic's Claude API at $10 per million input tokens and $50 per million output tokens. For context operations, it applies Anthropic's caching multiplier: roughly $1.00/M for cache reads and $12.50/M for cache writes.
Primary cost view
| Metric | Fable 5 |
|---|---|
| Input tokens | 223,500 |
| Output tokens | 29,254 |
| Cache read tokens | 95,082,600 |
| Cache write tokens | 6,471,100 |
| Estimated total cost | $179.67 |
Fable 5 is astonishingly expensive for a single Ship-Bench run. While the raw input and output generation were minimal, its deeply recursive workflow in the Developer iterations aggressively utilized context caching (consuming over 82 million cache read tokens to implement the build alone).
App Comparison
The benchmark scores matter most, but the shipped app is still the most tangible output. Screenshots are a useful complement here because they show polish, coherence, and UX flow in a way score tables cannot fully capture.
Screenshots
| View | Fable 5 app |
|---|---|
| Home page / Search | articles.png |
| Article detail | article.png |
| Article editor | edit.png |
Subjective UX review
Fable 5 UI
Gemini 3.5 Flash UI
The UI Fable delivered was a brutalistic, functionality-first design which was clean and complete. Gemini 3.5 Flash added a little more flair over Fable, but this benchmark specifically requests a calm, readable, information-first experience, and Fable created exactly that. This exercise by no means offered Fable a challenge, and others have seen it create some really fantastic experiences when pushed.
Interpretation
Fable performed quite well, as expected, but seemed somewhat out of place for a structured agentic process compared to its legendary one-shotting capabilities. The incredible detail it put into the architecture, design, and planning phases would likely matter tremendously for much more complex enterprise applications where deep reasoning really makes a difference.
For the iterative development phase, however, it is simply way too expensive compared to other models to justify. The ideal workflow here seems to be using a model like Fable to deeply "pre-think" and specify the architecture and plans for a coding agent, and then executing the actual iterative loops with a cheaper, highly capable mid-tier model.
Verdict
Fable 5 is highly capable, there is no doubt about that, but for this benchmark's end-to-end SDLC loop, it is excessively expensive and likely overkill. I would absolutely add it as an option in your agentic process for limited, high-reasoning specific tasks—should the model be made broadly available again someday.
What's Next?
What should the next Ship-Bench matchup test?
Are there two models or tools you want compared head to head?
Are you more interested in raw quality, cost efficiency, or open-vs-closed performance?
Do you want to know which setup is best for end-to-end autonomous runs, or which one is good enough for specific roles?
If there is a comparison or question you want to see tested, let me know. That feedback helps shape the next run.


Top comments (0)