DEV Community

Cover image for We Gave Five Claude Models the Same Repo Audit. Fable Didn't Win — and That's the Point.
CTRLNODE.AI
CTRLNODE.AI

Posted on • Originally published at ctrlnode.ai

We Gave Five Claude Models the Same Repo Audit. Fable Didn't Win — and That's the Point.

When Anthropic shipped Claude Fable, the obvious question was: does the new tier beat everything else on hard engineering work?

We didn't want a benchmark score or a vibe check. We wanted a principal-engineer audit of a real production monorepo — with file:line evidence, severity labels, and an execution plan.

So we ran a controlled experiment:

  • One prompt (4 phases: repo map → audit → strategy → task plan)
  • One target: the LangChain Python monorepo
  • Five Claude models: Opus 4.8, Fable 5, Sonnet 5, Sonnet 4.6, Haiku 4.5
  • Same setup for every run — project, work directory, WORK DIRECTORY task mode

We ran it through CTRL NODE (Bridge on a real machine, one agent per model tier). Not five browser tabs — a workflow you'd actually use on a team.


What we asked every model to deliver

You are a world-class, principal-engineer-level software engineer and technical audit expert.
Perform an in-depth analysis of this code repository, provide an honest audit report,
and offer a prioritized, actionable improvement plan.

Follow four phases in order: Discovery → Audit → Strategy → Task Plan.
All judgments must cite real file paths and line numbers. Do not guess.
Enter fullscreen mode Exit fullscreen mode

Each run had to produce:

  • audit-report-<model>.md — full Markdown report
  • audit-report-<model>.html — interactive dashboard (Overview, Map, Audit, Strategy, Tasks)

The headline result

There is no single winner.

Five reports, five roles. If you only pay for the most expensive tier thinking it "does everything better," you miss findings.

Model Grade Best at Weak at
Opus 4.8 A− Threat modeling (TOCTOU, agent shell defaults) CI lockfile, default load(), README gaps
Fable 5 A− Strategy, milestones, quick wins, eng debt Agent-specific threats, SSRF adoption map
Sonnet 5 B+ SSRF infra vs adoption, repo hygiene Lockfile CI, README, SECURITY.md
Sonnet 4.6 B+ Ops: lockfile CI, load() default, docs Newer SSRF adoption analysis
Haiku 4.5 A* Fast LOC map, callback cycles *Inflated grade; wrong CI lockfile claim

*Haiku's A looked confident on paper. Cross-checking against Sonnet 4.6 exposed a factual error on lockfile validation in CI.

Opus and Fable tied on grade — but not on role. Opus sees design-level threats. Fable turns findings into a shippable backlog (M0–M3, effort/risk, explicit non-goals).


Who saw what (selected)

Finding Op Fb S5 S4.6 Hk
TOCTOU / DNS rebinding
SSRF transport ~2 call sites
Default load() unsafe
Plan M0–M3 + non-goals
Lockfile CI commented ✗ wrong

Fable did not surface several issues other models caught (TOCTOU, shell host defaults, SSRF gaps in graph_mermaid.py, commented lockfile CI). That gap is the point: Fable is not a replacement for a multi-model pipeline.


The pipeline we'd actually use

Haiku        → fast map & architecture hotspots
Sonnet 5     → primary audit + security adoption gaps
Sonnet 4.6   → CI, docs, onboarding landmines
Opus         → threat review for agent-facing surfaces
Fable        → merge into one prioritized backlog
Human        → verify _lint.yml, load.py, README in your checkout
Enter fullscreen mode Exit fullscreen mode

Model choice is a workflow decision, not a vanity tier pick.


Takeaways for builders

  1. A high grade ≠ a better report. Two models at A−; Haiku at A — with a factual miss.
  2. Security has layers — Opus (TOCTOU), Sonnet 5 (SSRF adoption), Sonnet 4.6 + Fable (unsafe default load()).
  3. Sonnet evolved — 5 and 4.6 complement each other; neither replaces the other.
  4. WORK DIRECTORY mode matters — an output-only sandbox wouldn't have produced citations across CI, core, and partner packages.

Full write-up + all artifacts

This post is the teaser. Everything else lives on our site:

👉 Read the full experiment

Including:

  • Step-by-step CTRL NODE setup (project, agents, tasks)
  • What Fable returned in detail (executive summary, exclusive findings)
  • Full comparison report + 14-slide deck
  • All five audit reports (Markdown + interactive HTML dashboards)
  • The complete audit prompt to rerun on your own repo

If you try this on your stack — or disagree with a grade — we'd love to hear what surprised you.


Top comments (0)