DEV Community: Zohar Babin

AI assistants lie about citations. Here's how to catch them.

Zohar Babin — Sat, 13 Jun 2026 04:26:52 +0000

In 2023, a New York lawyer submitted a brief citing six cases that didn't exist. ChatGPT had hallucinated them — complete with plausible docket numbers, judges, and holdings. The lawyer was fined $5,000.

In 2024, a Nature study found that roughly 1 in 6 AI-generated citations in scientific writing referred to nonexistent or misrepresented papers.

In June 2026, the Ninth Circuit sanctioned an attorney for citing fabricated AI-generated precedents, calling it "an affront to the judicial system."

These aren't edge cases. They're the normal behavior of a language model doing its job — generating plausible text — applied to a domain where plausibility isn't enough.

The core problem

Language models are trained to produce fluent, contextually appropriate text. A citation is just a specific format: author names, title, journal, year, DOI. The model has seen millions of them. It can generate one that looks exactly right.

The model doesn't know whether the paper exists. It has no mechanism to check. It's pattern-matching, not remembering.

When you ask an AI assistant to "find me papers on X" or "what does the literature say about Y," you're asking a pattern-matcher to retrieve facts. It will produce citations. Some will be real. Some will be fabricated. From the output alone, you can't tell which.

What verification actually requires

To know a citation is real, you need to check it against an external, authoritative source:

DOI resolution: Does this DOI exist? Does it resolve to the paper being cited?
Retraction status: Has this paper been retracted, corrected, or flagged?
Content match: Does the paper actually say what the citation claims it says?
Link liveness: Is the URL still live, or has it 404'd?

None of this can happen inside the model. It has to happen at runtime, against live external systems.

This is the architecture that makes verification possible: an AI assistant calling external tools at inference time — not relying on training data.

How web-researcher-mcp catches citation hallucinations

web-researcher-mcp is an open-source MCP server (MIT, single Go binary) that gives AI assistants like Claude and Cursor a verification layer at the tool-call layer.

Here's what happens when an AI assistant calls verify_citation:

{
  "tool": "verify_citation",
  "input": {
    "doi": "10.1038/s41586-021-03819-2",
    "claimed_title": "Highly accurate protein structure prediction with AlphaFold",
    "claimed_authors": ["Jumper", "Evans"],
    "claimed_year": 2021
  }
}

The tool:

Resolves the DOI against Crossref — the authoritative DOI registry
Checks if the resolved record matches the claimed title, authors, and year
Queries Crossref Retraction Watch for retraction/correction status
Returns a structured result: verified, title_mismatch, not_found, retracted, or unchecked

A not_found means the DOI doesn't exist in Crossref — strong evidence of fabrication. A title_mismatch means the DOI resolves to a real paper, but not the one being cited — the model hallucinated the DOI for a real title, or swapped metadata.

Auditing a full bibliography

Individual citation checking is useful. Full bibliography auditing is where it gets powerful.

audit_bibliography takes a complete reference list (BibTeX, RIS, CSL-JSON, or a plain list) and runs every entry through the verification pipeline:

Results for 12 references:
  ✓ verified        8  (DOI resolves, metadata matches, not retracted)
  ✗ not_found       2  (DOI absent from Crossref — possible fabrication)
  ⚠ title_mismatch  1  (DOI resolves to different paper)
  ~ unchecked       1  (book chapter, no DOI — can't verify)

The two not_found entries are the hallucinations. The title_mismatch is a DOI that got swapped — the paper exists, but it's not the one being cited.

This is what distinguishes citation verification from citation search. Search finds you papers. Verification tells you whether the papers the AI found actually exist and say what they're claimed to say.

Dead links and retraction detection

Two more signals the tool surfaces:

Retraction status via Crossref Retraction Watch. When you call scrape_page on a PDF or academic URL, the tool automatically checks the detected DOI against the retraction database. A retracted paper that the AI cited as evidence is worse than a fabricated one — it's a real paper that the scientific community has repudiated.

Dead link archiving via Wayback Machine. When a cited URL 404s, archive_source can retrieve the archived version and return the Wayback URL. This is especially common in legal and policy citations where government pages move or are taken down.

The MCP architecture that makes this possible

MCP (Model Context Protocol) is an Anthropic-backed open standard that lets AI clients call external tools at inference time. It's roughly: the model decides to call a tool, the MCP server executes it, the result comes back as context.

This architecture is exactly what citation verification requires. The model can't verify a citation from memory — but it can call verify_citation, get a structured result, and incorporate that into its response.

Any MCP-compatible client works: Claude Desktop, Cursor, Windsurf, VS Code with Continue, and others. Install is one command:

# macOS/Linux
curl -fsSL https://raw.githubusercontent.com/zoharbabin/web-researcher-mcp/main/install.sh | bash

# Python users
uvx web-researcher-mcp

Zero config — DuckDuckGo works out of the box. Add API keys for Google, Brave, or other providers to extend coverage.

When this matters most

Research and academia: Before citing a paper, verify_citation checks that it exists, isn't retracted, and matches the claimed metadata. audit_bibliography sweeps a full reference list before submission.

Legal work: AI-assisted legal research is now common. Citation hallucination in legal briefs has led to sanctions in multiple jurisdictions. Verify every case cite before it goes in a brief.

Journalism and fact-checking: Dead links, retracted studies, and papers that don't say what they're claimed to say. The tool catches all three.

Any workflow where AI summarizes research: The model will find real papers and fabricated ones with equal confidence. The only way to tell them apart is to check.

What this doesn't do

The tool verifies whether citations are real and whether they match claimed metadata. It doesn't:

Read and summarize the full paper for you (though scrape_page can fetch and extract content from open-access PDFs)
Judge whether a paper's methodology is sound
Replace domain expertise

The goal is narrow: catch the most common and most dangerous failure mode — a citation that doesn't correspond to a real, non-retracted paper saying what it's claimed to say.

Try it

The project is open source on GitHub, MIT licensed, and launching today on Product Hunt.

35+ tools covering web search (Google/Brave/DuckDuckGo/Tavily/Exa), academic search (PubMed, academic indexes), patent search, citation verification, retraction detection, dead-link archiving, bibliography audit, and more.

If you work with AI-generated research, the citation layer is the one place where "good enough" isn't good enough. A hallucinated citation in a legal brief, a grant proposal, or a published paper can do real damage. The tool exists to close that gap.

Previously: From Node.js to Go: Rebuilding an MCP Server for Production — the engineering story behind the rewrite.

Claude Fable 5: When to Reach for the Frontier — and When Not To

Zohar Babin — Wed, 10 Jun 2026 17:46:03 +0000

Lessons and best practices for routing work between Fable 5 and the cheaper Claude tiers

Anthropic released Claude Fable 5 on June 9, 2026 — the first "Mythos-class" model made generally available, a new capability tier above Opus. It is state-of-the-art on nearly every tested benchmark, and Anthropic's framing is unusually specific: "the longer and more complex the task, the larger Fable 5's lead over our other models."

That sentence is the whole routing strategy.

We audited a working portfolio of ~20 active projects — agentic platforms, SaaS products, SDKs, data pipelines, documentation suites, deployment runbooks, and high-stakes financial analysis — against what Fable 5 actually changes, and a clear pattern emerged: most day-to-day work does not benefit from Fable 5, and the work that does benefits enormously.

If you only take one thing away, take this: Sonnet 4.6 by default. Fable 5 for the long, the hard, and the expensive-to-get-wrong. Haiku 4.5 for the fan-out. Opus 4.8 when a safety classifier would ruin your day. The rest of this article is the reasoning, the sharp edges, and the cost math.

What Fable 5 actually is (and isn't)

Three facts shape every decision below.

1. It's a tier, not a feature set. Fable 5's API surface matches Opus 4.8's almost exactly: same 1M context, same 128K output, same adaptive-thinking-only design, with one new sharp edge covered below. There is no new capability checkbox. What you're buying is how far the model can go unattended. Stripe reported completing, in a day, a codebase-wide migration across 50 million lines of Ruby that would have taken a team over two months. Early testers quoted in Anthropic's announcement describe long-horizon problems "that were out of reach for earlier models."

2. The sticker says 2× Opus. Your bill will say more. $10 per million input tokens, $50 per million output — exactly double Opus 4.8's $5/$25, with no long-context surcharge. But the rate card understates the real multiplier, because a reasoning-heavy model generates more tokens per task: it thinks longer, plans more, and verifies more, and you pay for every one of those thinking tokens even though you never see them by default (more on that below). Simon Willison, a veteran developer (co-creator of Django) whose independent AI reviews are widely read in the field, consumed $110 of tokens in a single ~5.5-hour day of hands-on testing, and one Max-plan subscriber's launch-week Reddit post described a usage meter ticking up "2% a minute." The model is also slower per response. None of this is a scandal — but all of it matters for routing. (Subscription users, note the calendar: Fable 5 is included free on Pro/Max/Team plans only through June 22, 2026; after that it draws prepaid usage credits at API rates.)

3. It ships with safety classifiers — and what happens when one trips depends on where you run it. Fable 5 is the same underlying model as Claude Mythos 5 (restricted to vetted cyber-defense partners and, soon, select biology researchers), made publicly releasable by classifiers covering cybersecurity, biology/chemistry, and distillation attempts. On Anthropic's own surfaces, a tripped classifier falls back to Opus 4.8 automatically and tells you. On the raw API, the request returns stop_reason: "refusal" and your harness decides what happens next: server-side fallback is an opt-in beta, and isn't available on Bedrock, Vertex, or Foundry at all. Anthropic says trips happen in under 5% of sessions, but the classifiers are deliberately tuned conservative, so benign security and biology work can trip them. Plan for this (see "Security research" and "Handle the refusal" below).

When Fable 5 is the right tool

Across a varied real-world portfolio, four workload shapes stood out as clear Fable territory.

1. Long-horizon autonomous engineering

Multi-hour to multi-day agentic sessions: large migrations, cross-cutting refactors, greenfield builds from a finished spec, overnight runs expected to complete without mid-course human correction. This is the headline capability. If your sessions routinely run 4–8 hours with a real CI gauntlet at the end, the model that finishes correctly the first time is cheaper than the model that needs two retries — even at 2× the token price. And higher effort up front often reduces total cost on this class of work, because turn count drops (more on this in the cost section).

A special case worth calling out: greenfield implementation from a completed blueprint. If you've invested in a detailed design doc, handing the entire spec to Fable 5 in a single opening prompt and letting it run at high effort is the single best-fit usage pattern the model has. And note that this category isn't just for "agent teams" — if you have a finished spec and an ambitious build, you're in it whether you run a fleet of CI-integrated agents or one Claude Code session overnight.

2. High-stakes analysis where error is expensive

Multi-document synthesis with real money or legal exposure attached: M&A due diligence, multi-jurisdictional tax positions, regulatory filings, senior-level financial reasoning (Fable 5 currently tops Hebbia's Finance Benchmark, a test of senior-analyst-level reasoning built by the AI document-analysis firm used by major banks and funds). When a subtle reasoning error costs five or six figures, token price is noise. Fable 5 is also a notably stronger thought partner — more willing to push back, kill its own incorrect beliefs, and reason from first principles. That's exactly the temperament you want reviewing a position you're already invested in.

3. Judge and synthesis roles in multi-agent systems

You rarely need the frontier model everywhere in an agent pipeline. The pattern that works: cheap models fan out, the expensive model decides. Domain-specialist agents, extractors, and verifiers run on Sonnet or Haiku; the judge, executive-synthesis, and cross-domain reasoning agents run on the top tier. Quality of the final output lives disproportionately in the synthesis step.

4. Vision-heavy document and design work

Fable 5 is the new state of the art on vision: extracting precise numbers from dense scientific figures, interpreting charts and tables nested in PDFs, rebuilding application source from screenshots, and — notably for coders — visually critiquing its own output against design goals. Document-heavy finance, legal, and analytics pipelines that previously needed heavy scaffolding need less of it now.

When a lower model serves you better

This list matters more than the previous one, because it covers most of what most teams do all day.

Runbook execution. If the intelligence lives in the runbook — deploy scripts, version-bump checklists, bundle-upload-verify sequences — the model is just following documented steps. That's Sonnet work, and the verification steps (curl the endpoint, grep for the version string) are Haiku work. Paying $50 per million output tokens to execute a procedure your docs already encode was the most common waste pattern in our portfolio.

Scoped maintenance on documented codebases. Bug fixes, small features, and quirk workarounds in mature projects with good CLAUDE.md files and accumulated memory. One real data point: a three-day plugin-maintenance sprint on an Opus-class model cost roughly $700 in compute. At 2× pricing, the same token volume on Fable would have run ~$1,400 — likely somewhat less in practice, given Fable's lower cache threshold and tendency to finish in fewer turns — with no meaningful quality difference, because the constraints were already written down. Sonnet 4.6 handles this class well.

Templated and pattern-following work. Documentation guides that follow an established template with dozens of worked examples, configuration changes, landing pages built from a skill, i18n updates. The pattern is the intelligence.

Anything latency-sensitive or high-volume. Fable 5 is slow. Classification, extraction, chat-style products, interactive coding sessions, and high-throughput pipelines belong on Sonnet or Haiku. This hasn't changed.

Security research and recon-shaped work — for a different reason. Legitimate penetration-test reporting, OSINT footprinting, security audits, and vulnerability triage pattern-match the offensive-cyber classifier. The work is benign; the classifier is conservative by design — launch-week reports of false positives ranged from code reviews to a question about Hermitian matrices. On a managed surface you still get an answer (from Opus 4.8), but if your harness pins Fable 5 on the raw API with no refusal handling, a tripped classifier stops the run. For security-flavored sessions, either run Opus 4.8 directly or make sure your integration handles the refusal gracefully. The same applies to life-sciences work until Anthropic's trusted-access program for biology opens up. (If "but isn't security exactly what Mythos 5 is for?" just crossed your mind — yes, and the next section is for you.)

And if your team's workload is mostly in this section — maintenance, support, content operations, runbook automation — the takeaway is simple: you can skip the frontier tier entirely for now. Route to Sonnet 4.6 by default and you'll capture nearly all the value at a fraction of the cost.

A useful rule of thumb: ask "would a competent person with my docs in front of them find this task hard, or just laborious?" Laborious goes to Sonnet. Hard goes to Fable — and so does laborious-at-a-scale where staying coherent unattended for hours is the hard part, which is why a 50-million-line migration is Fable territory even though no single edit in it is difficult.

Mythos 5 vs. Fable 5: same model, different gates

The naming invites confusion, so let's clear it up: Mythos 5 and Fable 5 are the same underlying model. Anthropic says so explicitly: "the safeguards are what distinguish the two models... and are why we've given them different names." Mythos 5 has safeguards lifted for its vetted audience — cyber today, bio/chem for the upcoming biology cohort — while Fable 5 keeps all the gates and routes gated queries to Opus 4.8. Same weights, same $10/$50 pricing, same 30-day retention. By Anthropic's own data, more than 95% of Fable sessions involve no fallback at all — and in those sessions, "Fable 5's performance is effectively the same as that of Mythos 5." You are not missing out on a smarter model; you're missing out on the gated domains, and even Mythos partners only get the gates their program unlocks.

So who actually gets the ungated version? Today, almost nobody. Mythos 5 is restricted to Project Glasswing partners — a US-government-coordinated cyber-defense program whose roster reads like an infrastructure who's-who (AWS, Apple, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, Palo Alto Networks, plus ~150 more critical-infrastructure organizations). Mythos 5 supersedes Claude Mythos Preview, the April 2026 gated research release that found over ten thousand high- or critical-severity vulnerabilities — at $25/$125 per million tokens, more than double what Mythos 5 costs now. A trusted-access program for biology is opening "in the coming weeks" — note it grants Fable 5 with the bio/chem gates removed (cyber gates intact), not Mythos itself — and a broader systematic application path for cybersecurity organizations is promised but not dated.

This is why our routing table sends security work to Opus 4.8, not Mythos 5. For a security team without Glasswing access — which is to say, nearly every security team — the realistic ladder looks like this:

Opus 4.8 is the working default. It's the model Fable's classifier hands your cyber queries to anyway, so pinning it directly just skips the detour and the split-billing.
Opus 4.8 + the Cyber Verification Program is the self-serve upgrade. The CVP is a free, application-based program (decisions target two business days) that lifts the dual-use cyber blocks — vulnerability exploitation, offensive tooling for defensive purposes — on Opus models for verified organizations. Prohibited-use blocks (ransomware, mass exfiltration) stay regardless; it isn't offered on Bedrock or Vertex; zero-data-retention orgs aren't currently eligible; and note carefully: it's an Opus program. It does not raise Fable 5's thresholds and it does not grant Mythos access.
Fable 5 for everything outside the gated domains — a security company's product engineering, data pipelines, and long-horizon builds are still Fable territory like anyone else's.
Mythos 5 only if you're a Glasswing partner — or, eventually, through the broader cyber application path Anthropic has promised but not dated. If your organization defends critical infrastructure at scale, that's worth watching for; if not, items 1–3 are the menu.

The summary for the impatient: Mythos 5 isn't the model you're missing — it's the permission slip you're missing, and for 95%+ of work the permission slip changes nothing.

How to use Fable 5 effectively

API configuration: know the sharp edges

Fable 5 inherits Opus 4.7/4.8's request surface, but several details differ in ways that matter:

Adaptive thinking only. thinking: {type: "adaptive"}. Fixed budget_tokens returns a 400, as do temperature, top_p, and top_k.
No explicit disabled. Unique to Fable 5: an explicit thinking: {type: "disabled"} returns a 400 (it's accepted on Opus 4.8). Omit the thinking parameter instead, or run adaptive — either way, the model thinks, and you pay for it (see "The bill" below).
Don't ask it to transcribe its reasoning. Prompting Fable 5 to echo its internal chain of thought as response text can trip the reasoning_extraction classifier and refuse the turn. If you need reasoning visibility, set thinking: {type: "adaptive", display: "summarized"} and read the structured thinking blocks instead.
Set generous max_tokens and stream. At xhigh/max effort, give the model ≥64K output room; stream anything large.
Handle the refusal. Check stop_reason: "refusal" and the structured stop_details (category: cyber, bio, reasoning_extraction, or null — see Anthropic's safeguards guide). On Anthropic's own surfaces the fallback to Opus 4.8 is automatic; on the API, server-side fallback is an opt-in beta (and unavailable on Bedrock, Vertex, and Foundry), so decide deliberately what your harness does when a refusal fires. The billing side of this is covered below.
Data retention: Mythos-class traffic carries a mandatory 30-day retention policy (safety-only, not used for training, with logged human access). On some cloud platforms you must explicitly opt in before the model is invocable. Factor this into compliance review before adopting, not after.

The bill: what you're actually paying for

Fable 5's pricing has a sticker and a story. The sticker is simple: 2× Opus. The story is everything that multiplies on top of it — and almost all of it is under your control. Five levers, roughly in order of impact:

1. Effort is the biggest dial on the dashboard. output_config: {effort: ...} with low | medium | high | xhigh | max governs all output — thinking, tool calls, and text. In Willison's single-prompt sweep, max produced roughly 7.5× the output tokens of low on an identical prompt, and high actually used fewer tokens than medium on one run. The relationship between effort and total cost isn't monotonic, because higher effort often finishes in fewer turns. So: start at high (the default), sweep on your own evals, and reserve xhigh/max for hard, latency-insensitive problems. Several of the scariest launch-week burn reports turned out to involve max effort, 1M contexts, and multi-agent workflows all at once — three multipliers, none of them mandatory.

2. You pay for thinking you never see. Adaptive thinking can't be turned off, and you're billed for the model's full internal reasoning, not the (by default, empty) thinking text in the response. If your billed output tokens dwarf your visible output, that's not a bug — check usage.output_tokens_details.thinking_tokens to see the reasoning share. Budget for it; don't fight it.

3. Caching is unusually kind to Fable. Cache reads cost 10% of the input price, standard multipliers apply on writes, and Fable 5's minimum cacheable prefix is 2,048 tokens versus 4,096 on Opus 4.6–4.8 — so mid-sized prompts cache on Fable that silently wouldn't on Opus. At $10/MTok input, verify your cache hits (usage.cache_read_input_tokens) before optimizing anything else.

4. Batch anything that can wait. The Batches API supports Fable 5 at the standard 50% discount: $5/$25, which is interactive-Opus pricing for frontier-quality output. Overnight evals, bulk document analysis, scheduled report generation — if nobody is watching the response stream, it belongs in a batch.

5. Classifier trips have a billing story too. A request refused on input costs $0. With server-side fallback enabled, the fallback's input is billed at cache-read rates. But a mid-stream block split-bills: Fable rates for everything generated before the block, Opus rates after (the fallback billing guide covers the full mechanics). One launch-week report described burning 200K Fable-priced tokens on a code review before the classifier handed the session to Opus. For security- and biology-adjacent work, it's cheaper to start on Opus 4.8 than to pay frontier rates for a run the classifier was always going to interrupt.

Then the guard rails. max_tokens is the only hard cap (thinking and text combined). Task budgets (beta, minimum 20K tokens) give the model a live countdown for an entire agentic loop and it self-moderates — the right pacing tool for long autonomous runs, though it's advisory, not a wall. One caution: set the budget once and let the server count down; mutating it client-side between turns invalidates your prompt cache. Finally, track spend per project. Usage-analytics tooling that breaks cost down by project and agent (Claude Code's /usage and spend limits, or the Console for API orgs) turns the routing decisions in this article from theory into a weekly habit. You'll likely find, as we did, that a handful of long-horizon projects justify the frontier tier and everything else doesn't.

Prompting: front-load the specification

The highest-leverage practice: give the full task specification in one well-specified opening turn, then let the model run. Fable 5's long-horizon coherence comes from planning against a clear goal. Ambiguous asks drip-fed across many turns waste exactly the capability you're paying double for. Concretely:

State what "done" looks like in checkable terms — not "a good report" but "a CSV with a numeric price column per SKU, validated against the source totals."
Include constraints, non-goals, and the verification method up front.
For managed agent platforms, encode "done" as a gradeable rubric/outcome so the harness iterates and grades automatically.

Two behavioral notes carried over from recent Opus generations still apply: the model follows instructions literally (soften any "CRITICAL: YOU MUST" scaffolding written for older, more reluctant models), and it's conservative about reaching for optional capabilities — if you want it using subagents, file-based memory, or custom tools, say when to use them, in the system prompt and in each tool's own description.

Harness and tooling: the model is half the system

Tier your subagents. Even with Fable 5 in the main loop, search/explore/verify subagents should run on Haiku or Sonnet. On tasks where subagent fan-out dominates spend, this approaches a 10× cost difference with near-identical results. (And know your workflow multipliers: multi-agent team sessions run roughly 7× the tokens of a solo session.)
Give it file-based memory on long tasks. Anthropic's own testing showed persistent memory improved Fable 5's long-game performance roughly three times more than it improved Opus 4.8's — this model is distinctly better at writing and using its own notes. Maintain per-project memory files (one lesson per file, corrections and confirmed approaches alike) and the model compounds.
Invest in CLAUDE.md / project instructions. Documented constraints, build commands, and verification procedures are what let a long autonomous run self-correct instead of drifting. Counterintuitively, the better your docs, the less often you need Fable — and the better Fable performs when you do.
Use skills for repeatable procedures. Anything you do more than twice — deploy flows, report formats, review checklists — belongs in a skill the model loads on demand, not in tokens you re-explain per session.
Verify with fresh-context agents, not self-critique. For audits, reviews, and migrations, structure the work as decompose → execute → adversarially verify → synthesize, with verifiers spawned in clean contexts. Fable 5 at the highest effort already reflects on and validates its own work, but independent verification still beats self-review for anything you'll ship.

A simple routing default

Workload	Model
Most work (the default): scoped features, maintenance, templated docs, runbook execution, interactive coding	Sonnet 4.6 (escalate to Opus 4.8 when it stalls)
Multi-day autonomous builds, large migrations, high-stakes synthesis, judge/synthesis agents, frontier vision tasks	Fable 5
Subagent fan-out, verification steps, classification, extraction, high-volume pipelines	Haiku 4.5 (Sonnet 4.6 for heavier subagents)
Security- or biology-flavored sessions where a classifier trip would break your harness	Opus 4.8 directly (+ CVP verification for dual-use cyber work — see the Mythos section)

The bottom line

Fable 5 doesn't make Sonnet or Opus obsolete — it makes a previously impossible class of work possible, at a price and latency that punish using it for everything else. Think of it as hiring a brilliant specialist who bills by the hour and thinks out loud on the clock: transformative on the right problem, ruinous as a receptionist.

The teams that get the most from it will be the ones with the discipline to route: specs and stakes go up to the frontier; procedures and patterns stay on the efficient tiers; verification runs cheap everywhere. The model rewards exactly the engineering hygiene — written specs, documented constraints, checkable definitions of done, persistent memory — that made your systems better before it existed.

Sources

Anthropic, "Claude Fable 5 and Claude Mythos 5" — official announcement (June 9, 2026): capabilities, classifiers, fallback behavior, pricing, availability, data retention.
Anthropic, Project Glasswing and Real-time cyber safeguards on Claude — Mythos access, partner roster, Mythos Preview pricing, and the Cyber Verification Program's scope and application process.
Anthropic platform documentation — pricing, effort parameter, adaptive thinking, task budgets, prompt caching, and the Fable 5 fallback billing guide (refusal/fallback billing mechanics, batch availability).
AWS News Blog, "Anthropic Claude Fable 5 on AWS" — Bedrock/Claude Platform availability, data-retention opt-in, fallback pricing mechanics.
Simon Willison, "Initial impressions of Claude Fable 5" — independent hands-on review: real-world coding sessions, cost data, effort-level token sweep.
Community launch-week reports — r/ClaudeAI thread (144 comments of burn-rate anecdotes and the emerging plan-with-Fable, execute-with-Opus pattern) and Hacker News discussion. Treated as anecdote, not data.
Our portfolio audit (June 2026, including launch-week usage) — routing analysis across ~20 active projects spanning agentic platforms, SDKs, data pipelines, and analysis workloads.

From Node.js to Go: Rebuilding an MCP Server for Production

Zohar Babin — Tue, 19 May 2026 20:48:20 +0000

This is the story of why I rebuilt google-researcher-mcp (Node.js/TypeScript) from scratch as web-researcher-mcp (Go), and what the lessons learned along the way.

The Starting Point

The original project — google-researcher-mcp — was a TypeScript/Node.js MCP server distributed via npm. It had real traction: 36 GitHub stars, 6,500+ npm downloads, 860+ tests, and active users. But five critical issues kept surfacing that couldn't be solved within the existing architecture.

Why Rewrite in Go (Not Refactored)

Orphan Processes (Issue #108)

npx spawns deeply nested process trees. When the parent MCP client (Claude Desktop, Cursor) crashes or closes unexpectedly, the Node.js process doesn't receive a signal — it keeps running, consuming memory and holding file locks.

Myself and collaborators spent three versions (v6.2.0 through v6.4.0) building increasingly complex orphan detection: a Worker thread watchdog with CPU spin detection, three-layer parent-alive checks, and graceful degradation. It was all band-aids on a fundamental runtime limitation.

Go fix: A single static binary. No runtime process tree. EOF on stdin = immediate exit. The entire problem category disappeared.

Google Discontinuing "Entire Web" Search (Issue #107)

Google announced it would be discontinuing support for Programmable Search Engines configured to search the "entire web." The project was named google-researcher-mcp — the dependency on a single search provider was an foundational risk.

Go fix: Interface-driven search.Provider with multiple implementations, plus a Router that provides multi-provider routing with automatic failover via per-provider circuit breakers.

Alternative Search Engines (Issue #55)

Users wanted Brave, Bing (go figure), and other providers. But the TypeScript codebase was too tightly coupled to Google's API response format — the shared directory (41 files) made every change risky and far-reaching.

Go fix: A clean Provider interface — each adapter normalizes provider-specific responses to common types (SearchResult, ImageResult, NewsResult). Adding a new provider is one file implementing one interface.

Redis Caching (Issue #72)

The in-memory cache was lost on every process restart — which happened frequently with npx-launched servers. The complex persistence manager offered four strategies (Periodic, WriteThrough, OnShutdown, Hybrid), but none reliably survived the volatile process lifecycle.

Go fix: A cache.Cache interface with a hybrid implementation: memory LRU + AES-encrypted disk + optional Redis. Simple, testable, and it never loses data because the disk layer persists across restarts.

Monolithic Architecture (Issue #40)

The project had 100+ source files but a tightly coupled shared/ directory with 41 files. Adding a single tool required touching 4+ documentation sections, and the import graph made refactoring perilous.

Go fix: One package per concern. Tool handlers are self-contained files. Adding a tool means writing one file and one line in the registry.

What Changed Architecturally

Aspect	Node.js (old)	Go (new)
Distribution	npm/npx (runtime required)	Single static binary
Memory	430MB idle (80MB after optimization)	~25MB baseline
Startup	2-4 seconds (lazy imports)	<100ms
Process lifecycle	Worker thread watchdog	EOF detection, no orphans
Search providers	Google only	Multiple providers + fallback routing
Concurrency	Event loop + async/await	Goroutines + semaphores
Type safety	TypeScript + Zod	Go type system + struct tags
Testing	860+ Jest tests	Table-driven tests + race detector
Scraping	Playwright (heavy)	4-tier pipeline (lightweight first)

Key Lessons Learned

1. Don't Fight Your Runtime

Node.js process management is fundamentally fragile for long-lived servers launched via npx. The runtime doesn't support robust parent-death detection, and the nested process tree (npx → node → worker) makes signal propagation unreliable. We spent three versions building increasingly complex orphan detection. Go's single binary eliminated the entire category of problems.

Takeaway: If you're spending significant engineering effort working around your runtime's limitations, that's a signal to evaluate whether the runtime fits the problem.

Side note: looking for a better runtime I looked into both Go and Rust (isn't Rust aweoms!?). Go won primarily for its lightweight goroutines exceling at I/O-bound operations, and the mcp-go SDK is superbly maintained.

2. Interface-Driven Design Enables Fearless Extension

Adding Brave Search in the Go version was one file implementing one interface — about 200 lines including tests. In the Node.js version, the equivalent change would have touched 6+ files due to tightly coupled imports in the shared directory.

Takeaway: When you know extension is likely (new providers, new tools), invest in clean interfaces upfront. The interface is the specification; implementations are interchangeable.

3. Memory Matters for MCP Servers

MCP servers run alongside AI assistants on developer machines. They're always-on background processes. A 430MB idle memory footprint was unacceptable — users would notice and uninstall. Go's ~25MB baseline lets the server stay resident without impact.

Takeaway: For developer tools that run continuously, memory efficiency is a feature, not an optimization. Choose your runtime accordingly.

4. Caching Architecture Should Be Boring

The old project had four persistence strategies with complex heuristics for when to flush. The new one has: memory LRU + optional encrypted disk + optional Redis. Each layer is simple and independently testable. No heuristics, no race conditions, no data loss.

Takeaway: Boring infrastructure is reliable infrastructure. If your caching layer needs its own debugging session, it's too complex.

5. Documentation Should Be Drift-Resistant

The old project required updating four separate documentation files per new tool. Inevitably, docs drifted from reality. The new project's test suite programmatically validates documentation claims — tool descriptions must mention alternatives, output schemas must match actual responses, and annotations must be consistent.

Takeaway: If documentation can be wrong without a test failing, it will eventually be wrong.

What We Kept

The rewrite preserved the user-facing contract:

Same tools with identical semantics and parameter names
Same MCP protocol compatibility (Claude Desktop, Cursor, VS Code, any MCP client)
Same environment variables (drop-in replacement for existing configs)
Same search lenses (curated domain lists, identical JSON format)

What improved (without breaking backwards compatibility):

OAuth 2.1 authentication for multi-client deployments
Multi-tenancy with per-tenant session isolation
Per-provider circuit breakers with automatic fallback
Prometheus metrics for observability
Structured audit logging for compliance

Results

Since launching the Go version:

Zero orphan process reports (vs. recurring issue in Node.js version)
Multiple search providers with automatic failover (vs. single provider)
4-tier scraping pipeline that tries lightweight methods first (vs. Playwright-only)
Sub-100ms cold startup (vs. 2-4 seconds)
Production-ready: rate limiting, circuit breakers, session isolation, audit trail

Should You Rewrite?

Probably not. Most rewrites fail because they're motivated by developer preference ("I want to use a new language") rather than architectural necessity. Ours succeeded because:

The problems were architectural, not implementational — no amount of refactoring within Node.js would fix process orphaning
The user-facing contract was well-defined — MCP provides a clean protocol boundary
The scope was bounded — we knew exactly what the server needed to do
We had comprehensive tests on the old version to validate behavioral equivalence

If your problems are solvable within your current architecture, refactor. If they're fundamentally incompatible with your runtime or architecture, consider a rewrite — but only with clear success criteria and a well-defined boundary.

This article covers the migration from google-researcher-mcp to web-researcher-mcp. The new project is open source under MIT and works with any MCP-compatible AI assistant.

Building a 13-Agent AI System for M&A Due Diligence — Architecture Deep Dive

Zohar Babin — Sun, 17 May 2026 19:05:46 +0000

The Problem Nobody Was Solving

As a corp dev lead, I spent weeks doing the same thing after every deal: assembling the cross-domain picture from siloed advisor reports.

Legal would flag a termination clause. Finance would flag revenue concentration. Same entity. Nobody connected the dots.

This happens because due diligence is split into parallel workstreams — legal, financial, commercial, tax, regulatory — each run by separate teams with separate deliverables. The cross-referencing happens in someone's head, over coffee, two days before the IC memo is due.

The numbers back this up:

31% of M&A failures trace back to DD shortcomings (HBR, McKinsey, KPMG research)
DD timelines keep compressing — six weeks becomes three, same scope
Corp dev teams screen 200-1,000+ companies/year but close 1-3%

I built Due Diligence Agents to fix this.

What It Does

13 AI agents analyze every document in an M&A data room across 9 specialist domains — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, and ESG — then cross-reference findings automatically and trace each one to the exact page, section, and quote.

pip install dd-agents
dd-agents auto-config "Buyer" "Target" --data-room ./your_data_room
dd-agents run deal-config.json

Output: an interactive HTML report, a 14-sheet Excel workbook, and per-subject JSON findings. See a sample report from synthetic data.

The Architecture

The system has four layers:

Layer 1: 38-Step Async Pipeline

The orchestrator (engine.py) is a state machine with 38 async steps grouped into phases:

Setup (steps 1-5): Load config, validate data room, resolve entities
Discovery (steps 6-13): Extract documents, build inventory, classify files, compute precedence
Analysis (steps 14-17): Build specialist prompts, route documents, spawn agents in parallel, check coverage
Cross-Domain (steps 18-20): Symbolic trigger evaluation, targeted respawn, merge
Quality (steps 21-26): Judge review, merge findings, validate, deduplicate
Reporting (steps 27-38): Generate HTML, Excel, JSON, knowledge base

Every step supports checkpoint/resume. If the pipeline crashes at step 23, it restarts from step 23 — not from scratch. Steps are typed, and the state object serializes cleanly to JSON.

Layer 2: 13 Agents

9 specialists + 4 meta-agents, each spawned via Anthropic's claude-agent-sdk:

Specialists: Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG. Each gets domain-specific prompts, the relevant documents, and a set of tools (file read, search, finding write).

Meta-agents:

Judge: Reviews specialist findings for quality, consistency, and missed coverage
Executive Synthesis: Produces the deal-level summary with go/no-go signals
Red Flag Scanner: Pattern-matches across all findings for deal-killers
Acquirer Intelligence: Tailors findings to the buyer's strategic context

Specialists run in parallel (batched by resource constraints). Meta-agents run sequentially after all specialists complete.

Layer 3: Neurosymbolic Cross-Domain Analysis

This is the part that solved my original problem.

After specialists produce their findings (pass 1), a deterministic rule engine scans them for cross-domain dependencies. No LLM calls — just Python pattern matching.

# Example: Finance finds revenue recognition issue
# → Rule fires → Legal agent re-examines specific contracts
# for enforceability, clawback clauses, delivery milestones

Seven built-in trigger rules cover the most common M&A cross-domain dependencies:

Source → Target	When It Fires
Finance → Legal	Revenue recognition finding needs contract enforceability check
Legal → Finance	Change-of-control clause needs financial exposure quantification
Legal → Finance	Termination-for-convenience needs revenue-at-risk calculation
Legal → ProductTech	IP ownership dispute needs technical dependency assessment
ProductTech → Legal	Data privacy finding needs DPA/GDPR compliance review
Commercial → Finance	SLA risk with >10% service credits needs financial quantification
Finance → Commercial	Pricing discrepancy needs commercial rate card validation

When a rule fires, it creates a CrossDomainTrigger with the specific contracts to re-examine and instructions for the target agent. The target agent runs a targeted pass-2 review — only on the cited contracts, not the full data room. This keeps costs bounded.

Budget-capped, priority-ordered. If no triggers fire, zero additional cost.

The design is inspired by the FAOS Platform — asymmetric coupling where symbolic rules constrain the LLM's scope while the LLM provides judgment. Symbolic decides when intelligence is needed; the LLM provides what to do about it.

Layer 4: 5 Blocking Quality Gates

Every finding goes through validation before it reaches the report:

Coverage gate: Did the agent analyze every assigned document?
Schema validation: Does every finding have the required fields (severity, citations, category)?
Citation verification: Can we trace the finding back to a specific page and quote?
Semantic dedup: Are two agents saying the same thing about the same document? (rapidfuzz token_sort_ratio ≥ 80)
Numerical audit: Do financial figures in findings match what's in the source documents?

Fail-closed. If validation fails, the pipeline stops — it doesn't silently produce bad output.

The Chat Mode (My Favorite Feature)

After the pipeline runs, you can interrogate the results:

dd-agents chat --report _dd/forensic-dd/runs/latest

The chat agent has 14 MCP tools: citation verification against source PDFs, cross-contract search, entity resolution, and sandboxed document generation. Ask "build me a board summary of all P0 findings with revenue impact" and it writes a Python script, executes it in a sandbox, and hands you the .xlsx file.

15 Things I Learned Building This

These lessons apply to any system doing cross-document analysis at scale — not just M&A.

1. Extraction is harder than analysis. By a lot.

Everyone focuses on the LLM prompts. But 80% of the real engineering is getting clean text out of messy documents. Our extraction pipeline has 4 tiers: pymupdf → pdftotext → OCR (Tesseract → GLM-OCR) → Claude vision as last resort. Each tier has 6 quality gates (min chars, printable ratio, density, readability, watermark detection, corruption check). Confidence scales with method quality — pymupdf gets 0.9 base, OCR gets 0.65.

2. Entity resolution is your invisible foundation

"IBM", "International Business Machines", and "Red Hat" — are these the same entity? We use a 6-stage cascade: exact match → normalized (strip legal suffixes) → alias expansion → fuzzy match (rapidfuzz) → TF-IDF cosine similarity → learned matches from prior runs. Names ≤5 characters are blocked from fuzzy matching — without this, "Inc." matches random entities.

3. Don't dump everything into one context. Map-merge-resolve.

A 200-page master agreement might have the deal-killer on page 147. You can't skip large files. But dumping them into one context drops accuracy from 95% to 74% (Addleshaw Goddard, 510 contracts). Instead: chunk at page boundaries (150K chars, 15% overlap), analyze each chunk independently, merge with priority logic (YES beats NO, specific beats generic), and only invoke LLM arbitration when chunks disagree. The 21-point accuracy gain is entirely engineering — no model change.

4. Hallucination is an engineering problem, not a model problem

No single defense works. We use 5 layers: (1) Pydantic schema validation on every response, (2) mandatory citation with file_path/page/exact_quote verified against source, (3) explicit "NOT_FOUND" escape valve — without this, models fabricate clauses rather than admit ignorance, (4) adversarial Judge review with accusatory framing ("this finding appears fabricated — prove it with a direct quote"), (5) 6-layer deterministic numerical audit.

Layer 3 changed everything. When you tell the model "if you can't find this clause, say NOT_FOUND," hallucination drops dramatically.

5. Know when to stop using LLMs

We had an LLM agent doing validation and report synthesis. We replaced it with deterministic Python. Quality went up, cost went down. The rule: use LLMs for analysis and synthesis; use Python for validation, dedup, and audit. If you can write the logic as deterministic code, do it.

6. Self-verification works — but only with accusatory framing

After agents produce findings, a follow-up pass challenges them on high-severity claims. Polite prompts ("please review your finding") have near-zero effect — models confirm their own output. Accusatory prompts ("this finding appears fabricated," "the cited clause doesn't exist") force re-examination and produce a 9.2% accuracy improvement.

7. Cross-agent dedup is different than you think

When 4 agents analyze the same document, they find the same issue but describe it differently. Three rules: (1) never dedup within the same agent — two similar findings from Legal are intentionally distinct, (2) only dedup across agents on the same document — similar findings on different documents are different findings, (3) keep contributing agent metadata so you know which domains flagged it.

8. Context window engineering is a first-class discipline

It's not just about fitting data in — it's about where things go. Critical instructions go at the start (highest recall zone). Document content goes in the middle (lowest recall — ~40% worse). Constraints and format rules go at the end (second-highest recall). We budget 40% of the context window for tool calls and reasoning.

9. Quality gates must be blocking, not advisory

If validation just logs a warning, nobody reads it. If it halts the pipeline, quality is non-negotiable. Same for agent guardrails: hard turn limits (soft at 200, force-kill at 3x), path guards (agents can only write under _dd/), bash guards (24 blocked patterns — no rm -rf, no sudo, no pipe-to-shell). Better to produce nothing than unreliable output.

10. Every claim must be traceable to source

Citation verification uses 4 scopes: exact page match → adjacent pages ±1 → full document fuzzy match (80%+) → cross-file search. That last one matters — if the quote isn't in the cited file, we search all files for that entity. Auto-corrects file misattribution.

11. Most of what AI finds is noise

Run 9 agents across hundreds of documents and you'll get thousands of findings. We use a 3-stage classification: noise filter (15 patterns for extraction artifacts), data quality filter (14 patterns for "data unavailable" gaps), then material findings. Plus 5 severity recalibration rules — e.g., a change-of-control clause that only applies to competitors gets downgraded from P0 to P3 automatically.

12. Same clause, different deal, different severity

An anti-assignment clause is P0 in an asset purchase (blocks contract transfer) but P3 in a stock purchase (entity doesn't change). Deal-type context must flow through the entire pipeline: prompt-time rules, post-hoc deterministic adjustments, and executive judgment overrides — with full audit trail.

13. Every API call is a deal cost

Three model profiles: economy (Haiku for extraction), standard (Sonnet for analysis), premium (Opus for synthesis). Per-agent cost tracking. Hard budget limits that halt the pipeline. Right model for right task.

14. Pydantic v2 everywhere

137+ models with model_json_schema() for structured outputs. Strict mypy across 199 source files. The type system catches real bugs — a finding with evidence instead of citations gets blocked by the schema guard hook before it's written to disk.

15. Make every run smarter than the last

Inspired by Karpathy's "LLM Wiki" pattern: a persistent knowledge base compounds across runs. Finding lineage via SHA-256 fingerprinting tracks findings even when wording changes. A NetworkX knowledge graph with 11 typed edge types captures entity relationships, contradictions, and clause interactions. Run 2 knows what Run 1 found — and catches what changed.

Try It

pip install dd-agents
dd-agents auto-config "Buyer" "Target" --data-room ./your_data_room
dd-agents run deal-config.json --dry-run  # Preview without API calls

Sample report (synthetic data, no install needed)

GitHub — Apache 2.0, 3,714 tests, strict mypy.

Built on Anthropic's Claude Agent SDK. Looking for feedback — especially from anyone who's dealt with data room analysis and can tell me whether the report structure maps to how DD findings are actually consumed.