Hassann

Posted on Jun 10 • Originally published at apidog.com

Claude Fable 5 Benchmarks: What the Numbers Say

When Anthropic launched Claude Fable 5 on June 9, 2026, it described the model as state-of-the-art on nearly every benchmark it tested. If you want hard numeric Claude Fable 5 benchmark tables, start with one caveat: Anthropic’s announcement emphasized benchmark placements more than full scoreboards, and several headline charts were published as images instead of copy-pasteable tables. This article focuses on what those placements mean, where Fable 5 appears strongest, and how to run a small benchmark on your own prompts. For a broader frontier-model comparison, see our breakdown of Opus 4.8 against GPT-5.5 and Gemini 3.5.

Try Apidog today

Fable 5 ships at $10 per million input tokens and $50 per million output tokens under the model ID claude-fable-5. It sits above Opus 4.8 in both capability and price, and Anthropic positions it as its strongest publicly available Claude model for software engineering, knowledge work, vision, and scientific research.

TL;DR

Claude Fable 5 ranks first among frontier models on FrontierCode and FrontierBench from Cognition, is state-of-the-art on CursorBench, and posts the highest score on Hebbia’s Finance Benchmark. The strongest pattern is long-horizon autonomous work: coding agents, multi-step reasoning, document-heavy workflows, and tasks that require the model to stay consistent over long contexts.

Because Anthropic reported many results as placements instead of full public score tables, treat the rankings as directional. Validate Fable 5 against your own prompts before switching production workloads.

What the headline benchmark claim means

Anthropic’s headline claim is that Claude Fable 5 is state-of-the-art on nearly all benchmarks it ran across:

Software engineering
Knowledge work
Vision
Scientific research
Long-horizon reasoning

Read that carefully. “State-of-the-art on nearly all benchmarks” means Fable 5 either tops the leaderboard or lands in the top tier on most reported evals. It does not mean:

Fable 5 wins every test by a wide margin
Every result has been independently reproduced
Every benchmark has a public numeric table
The model is automatically the best choice for your workload

The useful signal is consistency. A model that performs well across coding, finance, document reasoning, vision, and science is harder to dismiss as benchmark-specific tuning. If you are deciding whether Fable 5 is worth the higher price, this breadth is the main point to evaluate. For a model primer, see what Claude Fable 5 is.

The second important pattern is long-horizon work. Anthropic says Fable 5 “stays focused across millions of tokens in long-running tasks” and works autonomously longer than previous Claude models. Several highlighted benchmarks reward sustained execution, not just one-shot answers.

Coding benchmarks: FrontierCode and CursorBench

Coding is where Fable 5’s benchmark story is most concrete.

FrontierCode

On FrontierCode, a coding eval from Cognition, Anthropic reports that Fable 5 is the highest-scoring frontier model. It also holds that lead at medium effort.

That matters because effort settings affect cost and latency. Some models improve only when you spend more inference compute. A model that leads at medium effort is more relevant for day-to-day developer workflows than a result that only appears at maximum effort.

CursorBench

On CursorBench, Anthropic describes Fable 5 as state-of-the-art and says it “opened up a class of long-horizon problems that were out of reach” for prior models.

CursorBench is closer to real agentic coding than isolated function completion. It stresses workflows such as:

Editing multiple files
Maintaining project-level context
Planning several steps ahead
Running and responding to tests
Avoiding drift across long sessions

The practical takeaway: Fable 5 is not just positioned as a snippet-completion model. Its strongest coding case is sustained engineering work where an agent plans, edits, tests, and iterates across a codebase.

Knowledge and finance: Hebbia Finance Benchmark

Outside coding, the clearest knowledge-work result comes from the Finance Benchmark built by Hebbia, a company focused on AI for document-heavy financial and legal workflows.

Anthropic reports that Fable 5 posts the highest score on this benchmark, with gains concentrated in:

Document reasoning
Charts
Tables

That combination is important for developers building extraction or analysis systems. Financial workflows rarely involve clean text only. They often require the model to:

Read long filings
Trace numbers across pages
Compare charts with surrounding text
Extract values from dense tables
Avoid mixing up rows, columns, units, or time periods

This is also partly a vision result. Many charts and tables appear as images or mixed-layout PDF content, not structured JSON. A strong Finance Benchmark placement suggests Fable 5 may be useful for document pipelines where layout and visual reasoning matter.

Good candidate workloads include:

Financial report analysis
Contract review
Statement extraction
PDF question answering
Table and chart interpretation
Compliance-document summarization

Still, do not rely on the public benchmark alone. Test on your own documents, especially if the output affects money, legal review, or compliance.

Long-horizon reasoning: FrontierBench

The second Cognition eval, FrontierBench, is where Anthropic connects Fable 5’s benchmark performance to autonomous reasoning.

Anthropic reports Fable 5 as the highest-scoring model on FrontierBench and points to long-horizon reasoning as the key driver.

Long-horizon reasoning means the model can keep a goal and plan coherent across:

Many turns
Many tool calls
Large context windows
Intermediate failures
Self-generated notes or artifacts
Long-running tasks

This is different from answering a single hard question. A long-horizon task gives the model many chances to lose track, repeat itself, contradict earlier decisions, or optimize for the wrong objective.

This result is also harder to verify externally because long-horizon eval methodology is still evolving. Scoring must define what counts as progress, how partial success is measured, and how to prevent models from gaming the task.

Treat FrontierBench as a strong directional signal: Fable 5 is designed for autonomous agents that need to operate for long periods without falling apart.

Real-world signals beyond benchmarks

Benchmarks are useful, but deployment examples can be more informative because they show model behavior inside actual workflows.

Anthropic highlighted two examples.

Stripe codebase migration

Anthropic reports that Fable 5 migrated a 50-million-line Ruby codebase for Stripe in a single day, work the team estimated would have taken two months or more.

The key signal is not that the model solved a clever puzzle. A large migration is repetitive, context-heavy, and consistency-sensitive. Small mistakes can break builds or create subtle behavior changes.

For developers, this points to use cases like:

Large refactors
Framework upgrades
API migrations
Test generation across many files
Repetitive codebase cleanup
Multi-repository maintenance

Slay the Spire memory test

Anthropic also reported a Slay the Spire test to evaluate persistent memory. With file memory enabled, Fable 5 showed a 3x improvement over Opus 4.8.

The mechanism matters: the model could write notes to files and read them back across runs. That let it accumulate strategy instead of starting fresh every session.

For agent builders, the takeaway is straightforward: Fable 5 may benefit significantly from durable memory and tool access. If you are building long-running agents, test the model with the same memory, files, tools, and state management your production system will use.

How to interpret the benchmark results

Use the reported placements, but keep these caveats in mind.

1. Some benchmark owners are launch partners

FrontierCode and FrontierBench come from Cognition. The Finance Benchmark comes from Hebbia. These are credible organizations, but they are also part of the launch narrative.

That does not make the results invalid. It means you should wait for independent reproduction before treating the rankings as settled.

For comparison context, see our look at MiniMax M3 versus Opus 4.7 versus GPT-5.5.

2. Effort settings affect cost and quality

The FrontierCode result was reported at medium effort, which is encouraging. But effort is still a major variable.

When comparing models, check:

Effort level
Number of attempts
Temperature
Tool access
Context length
Whether retries were allowed
Whether the result is pass@1 or best-of-n

A score without those details is incomplete.

3. Public numeric scores are limited

Anthropic’s announcement emphasized placements, and some charts were published as images. Secondary sources may cite numbers, but if they are not traceable to a primary leaderboard, do not base a production decision on them.

Prefer primary sources from Anthropic, Cognition, Hebbia, or independent benchmark maintainers when available.

4. Rank is not margin

“Highest-scoring” tells you placement, not distance from the next model. A model can lead by a tiny margin or a large one. Those imply different upgrade decisions, especially at Fable 5’s $10/$50 per million token pricing.

The conclusion: Fable 5’s reported benchmark profile is strong, but you still need workload-specific validation. Confirm current model IDs, pricing, and limits in Anthropic’s models overview before implementation.

Run your own benchmark with Apidog

The most useful benchmark is one built from your own prompts and your own definition of quality.

You do not need a research harness. A lightweight eval can compare Fable 5 against Opus 4.8 using:

Output quality
Latency
Token usage
Cost per successful answer

You can do this with Apidog, an API platform for designing, testing, and documenting requests.

The workflow:

Create one reusable Claude API request.
Run it against claude-fable-5.
Duplicate the request.
Change only the model field to claude-opus-4-8.
Compare output, latency, and token usage.

Step 1: Create a Claude Messages API request

Create a POST request in Apidog:

POST https://api.anthropic.com/v1/messages
x-api-key: {{ANTHROPIC_API_KEY}}
anthropic-version: 2023-06-01
content-type: application/json

Use an environment variable for your API key:

ANTHROPIC_API_KEY=your_api_key_here

Step 2: Add a fixed benchmark prompt

Use a prompt that resembles your real workload. For a coding-agent test, start with a migration-style instruction:

{
  "model": "claude-fable-5",
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Refactor this Ruby method to use keyword arguments and add RSpec tests. Return only the updated code:\n\ndef charge(amount, currency, customer_id, idempotency_key)\n  # ...\nend"
    }
  ]
}

Run the request once against claude-fable-5.

Step 3: Duplicate the request for Opus 4.8

Duplicate the same request and change only this field:

"model": "claude-opus-4-8"

Keep the prompt, max_tokens, headers, and other settings identical. That way, differences are caused by the model, not by the test setup.

Step 4: Score the responses

Use a simple rubric before looking at which model generated which response.

For a coding task, score each answer on:

Criterion	What to check
Correctness	Does the refactor preserve behavior?
Test quality	Do tests cover useful edge cases?
Completeness	Did the model return all requested code?
Minimality	Did it avoid unnecessary rewrites?
Maintainability	Is the result readable and idiomatic?

For document tasks, use criteria such as:

Criterion	What to check
Extraction accuracy	Are numbers, dates, and entities correct?
Citation quality	Does the model point to the right source text?
Table reasoning	Did it read rows and columns correctly?
Hallucination rate	Did it invent unsupported facts?
Format compliance	Did it return the requested schema?

Step 5: Compare latency

Apidog shows response timing for each request. Track this because the best model on quality may not be the best model for an interactive app.

For each prompt, record:

model,response_time_ms,quality_score
claude-fable-5,____,____
claude-opus-4-8,____,____

Step 6: Calculate token cost

Claude responses include a usage block similar to:

{
  "usage": {
    "input_tokens": 250,
    "output_tokens": 900
  }
}

For Fable 5, using the published rates in the original announcement:

input cost  = input_tokens  / 1,000,000 * 10
output cost = output_tokens / 1,000,000 * 50
total cost  = input cost + output cost

Example:

input_tokens  = 250
output_tokens = 900

input cost  = 250 / 1,000,000 * 10  = $0.0025
output cost = 900 / 1,000,000 * 50  = $0.0450
total cost  = $0.0475

Run the same calculation for Opus 4.8 using its $5 input and $25 output per million token rates.

Step 7: Test more than one prompt

Do not decide from a single example. Build a small prompt set that reflects your production workload.

A useful starter set:

1. One easy task
2. One average task
3. One difficult task
4. One long-context task
5. One task with ambiguous requirements
6. One task requiring strict JSON output
7. One task with charts, tables, or PDFs if relevant
8. One task requiring multi-step reasoning
9. One task involving code edits across files
10. One failure-mode prompt from your current system

After 5–10 prompts, you will have a practical answer to the real question: does Fable 5 produce better results on your tasks at a price and latency you can accept?

You can download Apidog and set up this comparison in a few minutes. For cost details, see our Fable 5 pricing guide.

DEV Community