When Anthropic shipped Claude Fable, the obvious question was: does the new tier beat everything else on hard engineering work?
We didn't want a benchmark score or a vibe check. We wanted a principal-engineer audit of a real production monorepo — with file:line evidence, severity labels, and an execution plan.
So we ran a controlled experiment:
- One prompt (4 phases: repo map → audit → strategy → task plan)
- One target: the LangChain Python monorepo
- Five Claude models: Opus 4.8, Fable 5, Sonnet 5, Sonnet 4.6, Haiku 4.5
- Same setup for every run — project, work directory, WORK DIRECTORY task mode
We ran it through CTRL NODE (Bridge on a real machine, one agent per model tier). Not five browser tabs — a workflow you'd actually use on a team.
What we asked every model to deliver
You are a world-class, principal-engineer-level software engineer and technical audit expert.
Perform an in-depth analysis of this code repository, provide an honest audit report,
and offer a prioritized, actionable improvement plan.
Follow four phases in order: Discovery → Audit → Strategy → Task Plan.
All judgments must cite real file paths and line numbers. Do not guess.
Each run had to produce:
-
audit-report-<model>.md— full Markdown report -
audit-report-<model>.html— interactive dashboard (Overview, Map, Audit, Strategy, Tasks)
The headline result
There is no single winner.
Five reports, five roles. If you only pay for the most expensive tier thinking it "does everything better," you miss findings.
| Model | Grade | Best at | Weak at |
|---|---|---|---|
| Opus 4.8 | A− | Threat modeling (TOCTOU, agent shell defaults) | CI lockfile, default load(), README gaps |
| Fable 5 | A− | Strategy, milestones, quick wins, eng debt | Agent-specific threats, SSRF adoption map |
| Sonnet 5 | B+ | SSRF infra vs adoption, repo hygiene | Lockfile CI, README, SECURITY.md |
| Sonnet 4.6 | B+ | Ops: lockfile CI, load() default, docs |
Newer SSRF adoption analysis |
| Haiku 4.5 | A* | Fast LOC map, callback cycles | *Inflated grade; wrong CI lockfile claim |
*Haiku's A looked confident on paper. Cross-checking against Sonnet 4.6 exposed a factual error on lockfile validation in CI.
Opus and Fable tied on grade — but not on role. Opus sees design-level threats. Fable turns findings into a shippable backlog (M0–M3, effort/risk, explicit non-goals).
Who saw what (selected)
| Finding | Op | Fb | S5 | S4.6 | Hk |
|---|---|---|---|---|---|
| TOCTOU / DNS rebinding | ✓ | — | — | — | — |
| SSRF transport ~2 call sites | — | — | ✓ | — | — |
Default load() unsafe |
— | ✓ | — | ✓ | — |
| Plan M0–M3 + non-goals | — | ✓ | — | — | — |
| Lockfile CI commented | — | — | — | ✓ | ✗ wrong |
Fable did not surface several issues other models caught (TOCTOU, shell host defaults, SSRF gaps in graph_mermaid.py, commented lockfile CI). That gap is the point: Fable is not a replacement for a multi-model pipeline.
The pipeline we'd actually use
Haiku → fast map & architecture hotspots
Sonnet 5 → primary audit + security adoption gaps
Sonnet 4.6 → CI, docs, onboarding landmines
Opus → threat review for agent-facing surfaces
Fable → merge into one prioritized backlog
Human → verify _lint.yml, load.py, README in your checkout
Model choice is a workflow decision, not a vanity tier pick.
Takeaways for builders
- A high grade ≠ a better report. Two models at A−; Haiku at A — with a factual miss.
-
Security has layers — Opus (TOCTOU), Sonnet 5 (SSRF adoption), Sonnet 4.6 + Fable (unsafe default
load()). - Sonnet evolved — 5 and 4.6 complement each other; neither replaces the other.
- WORK DIRECTORY mode matters — an output-only sandbox wouldn't have produced citations across CI, core, and partner packages.
Full write-up + all artifacts
This post is the teaser. Everything else lives on our site:
Including:
- Step-by-step CTRL NODE setup (project, agents, tasks)
- What Fable returned in detail (executive summary, exclusive findings)
- Full comparison report + 14-slide deck
- All five audit reports (Markdown + interactive HTML dashboards)
- The complete audit prompt to rerun on your own repo
If you try this on your stack — or disagree with a grade — we'd love to hear what surprised you.
Top comments (0)