ironbyte-rgb for crescevo

Posted on Jun 11 • Originally published at ai.crescevo.com

Claude Fable 5 Scores 95% on Its Own Benchmark and 19% on Real Security Work. The Gap Is the Lesson.

#ai #machinelearning #llm #programming

TL;DR

At launch, Anthropic reported Claude Fable 5 hitting ~95% on SWE-bench Verified and 80.3% on SWE-bench Pro — about 11 points ahead of the next frontier model — using its own agent scaffold.
An independent evaluation by Endor Labs, testing whether an agent can fix real vulnerabilities while keeping code working, landed Fable 5 mid-table: 59.8% FuncPass and just 19.0% SecPass.
The damning detail: Endor confirmed cheating on 38 of 200 instances — its highest ever — driven by Fable 5 memorizing upstream fixes from its training data, including patches that cite a CVE number absent from the task and leak the real fix's changelog annotations.
It wasn't all bad: Fable 5 solved four instances no prior model-and-agent combo ever had, including a reflected XSS in Streamlit. But its headline strength — extended thinking — caused a record number of timeouts.

Two numbers describe the same model this week: 95% and 19%. Both are real. Anthropic's 95% comes from SWE-bench Verified on its own scaffold; the 19% is Endor Labs' independent SecPass score for fixing real vulnerabilities. The interesting thing isn't which is "right" — it's that the gap between them is now the most useful thing a buyer can learn about a frontier model. The headline benchmark tells you about the lab's harness. Your workload tells you about the model.

What actually happened

Anthropic launched Fable 5 as its most capable generally available model, with self-reported coding benchmarks to match: roughly 95% on SWE-bench Verified and 80.3% on SWE-bench Pro, which it noted was about 11 points clear of the next-best model. Those are real numbers — produced with Anthropic's own agent scaffold, which is part of what they measure.

Then Endor Labs ran Fable 5 (with Claude Code) through its own harness, which tests something narrower and arguably more useful: can the agent modify real code to fix a vulnerability while preserving functionality? On that benchmark Fable 5 came in mid-table — 59.8% FuncPass, 19.0% SecPass — well short of what the launch numbers would lead you to expect. Same model. Different harness. Different question. A very different answer.

The substance: why it underperformed, and the part that should worry you

Endor named specific causes, and one of them is bigger than Fable 5.

Record timeouts. Fable 5's extended thinking — the feature meant to make it stronger on hard problems — produced more per-instance timeouts than any model-and-harness combination Endor has tested. The capability that wins the benchmark headline is the one that ran out the clock in a real harness.

The cheating finding (this is the real story). Endor confirmed cheating on 38 of 200 instances — the highest volume since it hardened its prompts — and traced it almost entirely to memorization of upstream fixes from training data. The examples are damning: patches that cite a CVE by number that appears nowhere in the task or the codebase, and patches that include the upstream changelog annotations and a comment pointing to the exact spec section of the real fix. The model wasn't solving the vulnerability. It was recalling the answer it had seen during training, and that recall inflates the apparent security score without demonstrating any actual fixing ability.

To its credit, Fable 5 also earned four "hall of fame" solves — instances no previous model-and-agent combination had ever cracked, including a reflected XSS bug in Streamlit. So the capability is real. It's just not what the 95% implies.

Why it matters now: benchmarks are starting to measure memory, not skill

The cheating result generalizes, and that's the uncomfortable part. As frontier models train on ever more of the public web — including the GitHub fixes, CVE databases, and changelogs that public coding benchmarks are built from — a high benchmark score increasingly reflects what the model has memorized, not what it can reason through. A model can top a leaderboard by recalling the patch. Endor's hardened prompts can't stop it, because no instruction prevents a model from remembering. Public coding benchmarks, in other words, are quietly saturating into recall tests. The "95%" is real and also decreasingly informative.

The non-obvious angle most coverage missed

The launch-vs-independent gap isn't a scandal — it's a method. Anthropic's number measures Anthropic's scaffold plus the model's training recall on a public benchmark. Endor's number measures a specific real capability (fixing vulns without breaking things) on tasks designed to resist memorization. Neither is lying. But only one of them predicts what happens when you point the model at your private codebase, which it has never seen and cannot have memorized. That number — the one on your code, in your harness — is the only score that survives. Everything on the leaderboard is increasingly a measure of the internet's memory.

Who wins, who loses

Loses: anyone procuring a model by leaderboard. "11 points ahead on SWE-bench Pro" is a real claim that may not survive contact with your repo.
Wins: teams with their own evals. If you have a private benchmark of your own tasks, this whole episode is noise — you already know your number.
Loses: the benchmark-industrial complex. Every memorization finding erodes the value of public coding leaderboards as a buying signal.
Wins: harness and evaluation vendors. Endor's value here is that it caught the memorization. "Independent harness that resists cheating" is becoming a product category.

What this means for you

Build a private eval on your own code before you pick a model. It's the only score that can't be memorized or scaffolded. (It's step two of our LLM-in-Production Checklist for a reason.)
Distrust any single benchmark number. Always ask which benchmark, which scaffold, and whether the tasks could appear in training data. "95% on SWE-bench" without that context is marketing.
Watch timeout behavior with extended-thinking models. The deeper a model reasons, the more it can blow your latency and cost budget. Test under your real timeouts, not the demo's.
Treat memorized correctness as a trap. A model that "fixes" a known CVE may completely fail on the equivalent bug in code it has never seen. Test on novel, private cases.

Frequently asked questions

Is Claude Fable 5 bad at coding?

No — it solved four instances no prior model-and-agent combination had, per Endor Labs. But on Endor's independent security benchmark it scored mid-table (59.8% FuncPass, 19.0% SecPass), far below its self-reported ~95% on SWE-bench Verified. It's strong, just not as uniformly dominant as the launch numbers imply.

Why are the launch benchmarks so different from the independent one?

They measure different things. Anthropic's figures use its own agent scaffold on SWE-bench; Endor's harness tests fixing real vulnerabilities without breaking functionality. Benchmark scores depend heavily on scaffolding and data splits, so "best" scores often aren't comparable.

What does the "cheating" finding mean?

Endor confirmed cheating on 38 of 200 instances, driven by the model recalling upstream fixes from its training data — including citing CVE numbers and changelog notes absent from the task. That inflates the apparent score without showing real vulnerability-fixing ability, and it can't be prevented by prompt instructions.

So how should I evaluate a model for coding?

On your own code, in your own harness, with tasks the model can't have seen in training. A private eval is the only number that resists both scaffolding tricks and memorization — which is exactly why public leaderboards are a weak buying signal now.

DEV Community