<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zohar Babin</title>
    <description>The latest articles on DEV Community by Zohar Babin (@zoharbabin).</description>
    <link>https://dev.to/zoharbabin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936734%2F51a9e2c8-9c93-45ba-91ae-2449d592c478.png</url>
      <title>DEV Community: Zohar Babin</title>
      <link>https://dev.to/zoharbabin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zoharbabin"/>
    <language>en</language>
    <item>
      <title>AI assistants lie about citations. Here's how to catch them.</title>
      <dc:creator>Zohar Babin</dc:creator>
      <pubDate>Sat, 13 Jun 2026 04:26:52 +0000</pubDate>
      <link>https://dev.to/zoharbabin/ai-assistants-lie-about-citations-heres-how-to-catch-them-5608</link>
      <guid>https://dev.to/zoharbabin/ai-assistants-lie-about-citations-heres-how-to-catch-them-5608</guid>
      <description>&lt;p&gt;In 2023, a New York lawyer submitted a brief citing six cases that didn't exist. ChatGPT had hallucinated them — complete with plausible docket numbers, judges, and holdings. The lawyer was fined $5,000.&lt;/p&gt;

&lt;p&gt;In 2024, a Nature study found that roughly 1 in 6 AI-generated citations in scientific writing referred to nonexistent or misrepresented papers.&lt;/p&gt;

&lt;p&gt;In June 2026, the Ninth Circuit sanctioned an attorney for citing fabricated AI-generated precedents, calling it "an affront to the judicial system."&lt;/p&gt;

&lt;p&gt;These aren't edge cases. They're the normal behavior of a language model doing its job — generating plausible text — applied to a domain where plausibility isn't enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core problem
&lt;/h2&gt;

&lt;p&gt;Language models are trained to produce fluent, contextually appropriate text. A citation is just a specific format: author names, title, journal, year, DOI. The model has seen millions of them. It can generate one that looks exactly right.&lt;/p&gt;

&lt;p&gt;The model doesn't know whether the paper exists. It has no mechanism to check. It's pattern-matching, not remembering.&lt;/p&gt;

&lt;p&gt;When you ask an AI assistant to "find me papers on X" or "what does the literature say about Y," you're asking a pattern-matcher to retrieve facts. It will produce citations. Some will be real. Some will be fabricated. From the output alone, you can't tell which.&lt;/p&gt;

&lt;h2&gt;
  
  
  What verification actually requires
&lt;/h2&gt;

&lt;p&gt;To know a citation is real, you need to check it against an external, authoritative source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DOI resolution&lt;/strong&gt;: Does this DOI exist? Does it resolve to the paper being cited?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retraction status&lt;/strong&gt;: Has this paper been retracted, corrected, or flagged?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content match&lt;/strong&gt;: Does the paper actually say what the citation claims it says?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Link liveness&lt;/strong&gt;: Is the URL still live, or has it 404'd?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this can happen inside the model. It has to happen at runtime, against live external systems.&lt;/p&gt;

&lt;p&gt;This is the architecture that makes verification possible: an AI assistant calling external tools at inference time — not relying on training data.&lt;/p&gt;

&lt;h2&gt;
  
  
  How web-researcher-mcp catches citation hallucinations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/zoharbabin/web-researcher-mcp" rel="noopener noreferrer"&gt;web-researcher-mcp&lt;/a&gt; is an open-source MCP server (MIT, single Go binary) that gives AI assistants like Claude and Cursor a verification layer at the tool-call layer.&lt;/p&gt;

&lt;p&gt;Here's what happens when an AI assistant calls &lt;code&gt;verify_citation&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verify_citation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"doi"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10.1038/s41586-021-03819-2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"claimed_title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Highly accurate protein structure prediction with AlphaFold"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"claimed_authors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Jumper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Evans"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"claimed_year"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resolves the DOI against Crossref — the authoritative DOI registry&lt;/li&gt;
&lt;li&gt;Checks if the resolved record matches the claimed title, authors, and year&lt;/li&gt;
&lt;li&gt;Queries Crossref Retraction Watch for retraction/correction status&lt;/li&gt;
&lt;li&gt;Returns a structured result: &lt;code&gt;verified&lt;/code&gt;, &lt;code&gt;title_mismatch&lt;/code&gt;, &lt;code&gt;not_found&lt;/code&gt;, &lt;code&gt;retracted&lt;/code&gt;, or &lt;code&gt;unchecked&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A &lt;code&gt;not_found&lt;/code&gt; means the DOI doesn't exist in Crossref — strong evidence of fabrication. A &lt;code&gt;title_mismatch&lt;/code&gt; means the DOI resolves to a real paper, but not the one being cited — the model hallucinated the DOI for a real title, or swapped metadata.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auditing a full bibliography
&lt;/h2&gt;

&lt;p&gt;Individual citation checking is useful. Full bibliography auditing is where it gets powerful.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;audit_bibliography&lt;/code&gt; takes a complete reference list (BibTeX, RIS, CSL-JSON, or a plain list) and runs every entry through the verification pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Results for 12 references:
  ✓ verified        8  (DOI resolves, metadata matches, not retracted)
  ✗ not_found       2  (DOI absent from Crossref — possible fabrication)
  ⚠ title_mismatch  1  (DOI resolves to different paper)
  ~ unchecked       1  (book chapter, no DOI — can't verify)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two &lt;code&gt;not_found&lt;/code&gt; entries are the hallucinations. The &lt;code&gt;title_mismatch&lt;/code&gt; is a DOI that got swapped — the paper exists, but it's not the one being cited.&lt;/p&gt;

&lt;p&gt;This is what distinguishes citation verification from citation search. Search finds you papers. Verification tells you whether the papers the AI found actually exist and say what they're claimed to say.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead links and retraction detection
&lt;/h2&gt;

&lt;p&gt;Two more signals the tool surfaces:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retraction status&lt;/strong&gt; via Crossref Retraction Watch. When you call &lt;code&gt;scrape_page&lt;/code&gt; on a PDF or academic URL, the tool automatically checks the detected DOI against the retraction database. A retracted paper that the AI cited as evidence is worse than a fabricated one — it's a real paper that the scientific community has repudiated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead link archiving&lt;/strong&gt; via Wayback Machine. When a cited URL 404s, &lt;code&gt;archive_source&lt;/code&gt; can retrieve the archived version and return the Wayback URL. This is especially common in legal and policy citations where government pages move or are taken down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MCP architecture that makes this possible
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) is an Anthropic-backed open standard that lets AI clients call external tools at inference time. It's roughly: the model decides to call a tool, the MCP server executes it, the result comes back as context.&lt;/p&gt;

&lt;p&gt;This architecture is exactly what citation verification requires. The model can't verify a citation from memory — but it can call &lt;code&gt;verify_citation&lt;/code&gt;, get a structured result, and incorporate that into its response.&lt;/p&gt;

&lt;p&gt;Any MCP-compatible client works: Claude Desktop, Cursor, Windsurf, VS Code with Continue, and others. Install is one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS/Linux&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/zoharbabin/web-researcher-mcp/main/install.sh | bash

&lt;span class="c"&gt;# Python users&lt;/span&gt;
uvx web-researcher-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero config — DuckDuckGo works out of the box. Add API keys for Google, Brave, or other providers to extend coverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this matters most
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Research and academia&lt;/strong&gt;: Before citing a paper, &lt;code&gt;verify_citation&lt;/code&gt; checks that it exists, isn't retracted, and matches the claimed metadata. &lt;code&gt;audit_bibliography&lt;/code&gt; sweeps a full reference list before submission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal work&lt;/strong&gt;: AI-assisted legal research is now common. Citation hallucination in legal briefs has led to sanctions in multiple jurisdictions. Verify every case cite before it goes in a brief.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Journalism and fact-checking&lt;/strong&gt;: Dead links, retracted studies, and papers that don't say what they're claimed to say. The tool catches all three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Any workflow where AI summarizes research&lt;/strong&gt;: The model will find real papers and fabricated ones with equal confidence. The only way to tell them apart is to check.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this doesn't do
&lt;/h2&gt;

&lt;p&gt;The tool verifies whether citations are real and whether they match claimed metadata. It doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read and summarize the full paper for you (though &lt;code&gt;scrape_page&lt;/code&gt; can fetch and extract content from open-access PDFs)&lt;/li&gt;
&lt;li&gt;Judge whether a paper's methodology is sound&lt;/li&gt;
&lt;li&gt;Replace domain expertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is narrow: catch the most common and most dangerous failure mode — a citation that doesn't correspond to a real, non-retracted paper saying what it's claimed to say.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The project is &lt;a href="https://github.com/zoharbabin/web-researcher-mcp" rel="noopener noreferrer"&gt;open source on GitHub&lt;/a&gt;, MIT licensed, and launching today on &lt;a href="https://www.producthunt.com/posts/web-researcher-mcp" rel="noopener noreferrer"&gt;Product Hunt&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;35+ tools covering web search (Google/Brave/DuckDuckGo/Tavily/Exa), academic search (PubMed, academic indexes), patent search, citation verification, retraction detection, dead-link archiving, bibliography audit, and more.&lt;/p&gt;

&lt;p&gt;If you work with AI-generated research, the citation layer is the one place where "good enough" isn't good enough. A hallucinated citation in a legal brief, a grant proposal, or a published paper can do real damage. The tool exists to close that gap.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Previously: &lt;a href="https://dev.to/zoharbabin/from-nodejs-to-go-rebuilding-an-mcp-server-for-production-oil"&gt;From Node.js to Go: Rebuilding an MCP Server for Production&lt;/a&gt; — the engineering story behind the rewrite.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>go</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Claude Fable 5: When to Reach for the Frontier — and When Not To</title>
      <dc:creator>Zohar Babin</dc:creator>
      <pubDate>Wed, 10 Jun 2026 17:46:03 +0000</pubDate>
      <link>https://dev.to/zoharbabin/claude-fable-5-when-to-reach-for-the-frontier-and-when-not-to-243d</link>
      <guid>https://dev.to/zoharbabin/claude-fable-5-when-to-reach-for-the-frontier-and-when-not-to-243d</guid>
      <description>&lt;p&gt;&lt;em&gt;Lessons and best practices for routing work between Fable 5 and the cheaper Claude tiers&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;released Claude Fable 5 on June 9, 2026&lt;/a&gt; — the first "Mythos-class" model made generally available, a new capability tier above Opus. It is state-of-the-art on nearly every tested benchmark, and Anthropic's framing is unusually specific: &lt;strong&gt;"the longer and more complex the task, the larger Fable 5's lead over our other models."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That sentence is the whole routing strategy.&lt;/p&gt;

&lt;p&gt;We audited a working portfolio of ~20 active projects — agentic platforms, SaaS products, SDKs, data pipelines, documentation suites, deployment runbooks, and high-stakes financial analysis — against what Fable 5 actually changes, and a clear pattern emerged: &lt;strong&gt;most day-to-day work does not benefit from Fable 5, and the work that does benefits enormously.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you only take one thing away, take this: &lt;strong&gt;Sonnet 4.6 by default. Fable 5 for the long, the hard, and the expensive-to-get-wrong. Haiku 4.5 for the fan-out. Opus 4.8 when a safety classifier would ruin your day.&lt;/strong&gt; The rest of this article is the reasoning, the sharp edges, and the cost math.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Fable 5 actually is (and isn't)
&lt;/h2&gt;

&lt;p&gt;Three facts shape every decision below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. It's a tier, not a feature set.&lt;/strong&gt; Fable 5's API surface matches Opus 4.8's almost exactly: same 1M context, same 128K output, same adaptive-thinking-only design, with one new sharp edge covered below. There is no new capability checkbox. What you're buying is &lt;em&gt;how far the model can go unattended&lt;/em&gt;. Stripe &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;reported completing, in a day&lt;/a&gt;, a codebase-wide migration across 50 million lines of Ruby that would have taken a team over two months. Early testers quoted in Anthropic's announcement describe long-horizon problems "that were out of reach for earlier models."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The sticker says 2× Opus. Your bill will say more.&lt;/strong&gt; &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;$10 per million input tokens, $50 per million output&lt;/a&gt; — exactly double Opus 4.8's $5/$25, with no long-context surcharge. But the rate card understates the real multiplier, because a reasoning-heavy model generates more tokens per task: it thinks longer, plans more, and verifies more, and you pay for every one of those thinking tokens even though you never see them by default (more on that below). &lt;a href="https://simonwillison.net/" rel="noopener noreferrer"&gt;Simon Willison&lt;/a&gt;, a veteran developer (co-creator of Django) whose independent AI reviews are widely read in the field, &lt;a href="https://simonwillison.net/2026/Jun/9/claude-fable-5/" rel="noopener noreferrer"&gt;consumed $110 of tokens in a single ~5.5-hour day&lt;/a&gt; of hands-on testing, and one Max-plan subscriber's &lt;a href="https://www.reddit.com/r/ClaudeAI/comments/1u1cvkc/fable_5_is_insanely_good_but_watch_your_usage_i/" rel="noopener noreferrer"&gt;launch-week Reddit post&lt;/a&gt; described a usage meter ticking up "2% a minute." The model is also &lt;em&gt;slower&lt;/em&gt; per response. None of this is a scandal — but all of it matters for routing. (Subscription users, note the calendar: Fable 5 is &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;included free on Pro/Max/Team plans only through June 22, 2026&lt;/a&gt;; after that it draws prepaid usage credits at API rates.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. It ships with safety classifiers — and what happens when one trips depends on where you run it.&lt;/strong&gt; Fable 5 is the same underlying model as Claude Mythos 5 (restricted to vetted cyber-defense partners and, soon, select biology researchers), made publicly releasable by classifiers covering cybersecurity, biology/chemistry, and distillation attempts. On Anthropic's own surfaces, a tripped classifier falls back to Opus 4.8 automatically and tells you. On the raw API, the request returns &lt;code&gt;stop_reason: "refusal"&lt;/code&gt; and your harness decides what happens next: &lt;a href="https://platform.claude.com/cookbook/fable-5-fallback-billing-guide" rel="noopener noreferrer"&gt;server-side fallback is an opt-in beta&lt;/a&gt;, and isn't available on Bedrock, Vertex, or Foundry at all. Anthropic says trips happen in &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;under 5% of sessions&lt;/a&gt;, but the classifiers are deliberately tuned conservative, so &lt;em&gt;benign&lt;/em&gt; security and biology work can trip them. Plan for this (see "Security research" and "Handle the refusal" below).&lt;/p&gt;




&lt;h2&gt;
  
  
  When Fable 5 is the right tool
&lt;/h2&gt;

&lt;p&gt;Across a varied real-world portfolio, four workload shapes stood out as clear Fable territory.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Long-horizon autonomous engineering
&lt;/h3&gt;

&lt;p&gt;Multi-hour to multi-day agentic sessions: large migrations, cross-cutting refactors, greenfield builds from a finished spec, overnight runs expected to complete &lt;em&gt;without mid-course human correction&lt;/em&gt;. This is the headline capability. If your sessions routinely run 4–8 hours with a real CI gauntlet at the end, the model that finishes correctly the first time is cheaper than the model that needs two retries — even at 2× the token price. And higher effort up front often &lt;em&gt;reduces&lt;/em&gt; total cost on this class of work, because turn count drops (more on this in the cost section).&lt;/p&gt;

&lt;p&gt;A special case worth calling out: &lt;strong&gt;greenfield implementation from a completed blueprint.&lt;/strong&gt; If you've invested in a detailed design doc, handing the entire spec to Fable 5 in a single opening prompt and letting it run at high effort is the single best-fit usage pattern the model has. And note that this category isn't just for "agent teams" — if you have a finished spec and an ambitious build, you're in it whether you run a fleet of CI-integrated agents or one Claude Code session overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. High-stakes analysis where error is expensive
&lt;/h3&gt;

&lt;p&gt;Multi-document synthesis with real money or legal exposure attached: M&amp;amp;A due diligence, multi-jurisdictional tax positions, regulatory filings, senior-level financial reasoning (Fable 5 currently &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;tops Hebbia's Finance Benchmark&lt;/a&gt;, a test of senior-analyst-level reasoning built by the AI document-analysis firm used by major banks and funds). When a subtle reasoning error costs five or six figures, token price is noise. Fable 5 is also a notably stronger &lt;em&gt;thought partner&lt;/em&gt; — more willing to push back, kill its own incorrect beliefs, and reason from first principles. That's exactly the temperament you want reviewing a position you're already invested in.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Judge and synthesis roles in multi-agent systems
&lt;/h3&gt;

&lt;p&gt;You rarely need the frontier model everywhere in an agent pipeline. The pattern that works: &lt;strong&gt;cheap models fan out, the expensive model decides.&lt;/strong&gt; Domain-specialist agents, extractors, and verifiers run on Sonnet or Haiku; the judge, executive-synthesis, and cross-domain reasoning agents run on the top tier. Quality of the final output lives disproportionately in the synthesis step.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Vision-heavy document and design work
&lt;/h3&gt;

&lt;p&gt;Fable 5 is the new state of the art on vision: extracting precise numbers from dense scientific figures, interpreting charts and tables nested in PDFs, rebuilding application source from screenshots, and — notably for coders — &lt;em&gt;visually critiquing its own output against design goals&lt;/em&gt;. Document-heavy finance, legal, and analytics pipelines that previously needed heavy scaffolding need less of it now.&lt;/p&gt;




&lt;h2&gt;
  
  
  When a lower model serves you better
&lt;/h2&gt;

&lt;p&gt;This list matters more than the previous one, because it covers most of what most teams do all day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runbook execution.&lt;/strong&gt; If the intelligence lives in the runbook — deploy scripts, version-bump checklists, bundle-upload-verify sequences — the model is just following documented steps. That's Sonnet work, and the verification steps (curl the endpoint, grep for the version string) are Haiku work. Paying $50 per million output tokens to execute a procedure your docs already encode was the most common waste pattern in our portfolio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoped maintenance on documented codebases.&lt;/strong&gt; Bug fixes, small features, and quirk workarounds in mature projects with good CLAUDE.md files and accumulated memory. One real data point: a three-day plugin-maintenance sprint on an Opus-class model cost roughly $700 in compute. At 2× pricing, the same token volume on Fable would have run ~$1,400 — likely somewhat less in practice, given Fable's lower cache threshold and tendency to finish in fewer turns — with no meaningful quality difference, because the constraints were already written down. Sonnet 4.6 handles this class well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Templated and pattern-following work.&lt;/strong&gt; Documentation guides that follow an established template with dozens of worked examples, configuration changes, landing pages built from a skill, i18n updates. The pattern &lt;em&gt;is&lt;/em&gt; the intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anything latency-sensitive or high-volume.&lt;/strong&gt; Fable 5 is slow. Classification, extraction, chat-style products, interactive coding sessions, and high-throughput pipelines belong on Sonnet or Haiku. This hasn't changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security research and recon-shaped work — for a different reason.&lt;/strong&gt; Legitimate penetration-test reporting, OSINT footprinting, security audits, and vulnerability triage pattern-match the offensive-cyber classifier. The work is benign; the classifier is conservative by design — launch-week reports of false positives ranged from code reviews to &lt;a href="https://www.reddit.com/r/ClaudeAI/comments/1u1cvkc/fable_5_is_insanely_good_but_watch_your_usage_i/" rel="noopener noreferrer"&gt;a question about Hermitian matrices&lt;/a&gt;. On a managed surface you still get an answer (from Opus 4.8), but if your harness pins Fable 5 on the raw API with no refusal handling, a tripped classifier stops the run. For security-flavored sessions, either run Opus 4.8 directly or make sure your integration handles the refusal gracefully. The same applies to life-sciences work until Anthropic's trusted-access program for biology opens up. (If "but isn't security exactly what Mythos 5 is for?" just crossed your mind — yes, and the next section is for you.)&lt;/p&gt;

&lt;p&gt;And if your team's workload is mostly in this section — maintenance, support, content operations, runbook automation — the takeaway is simple: you can skip the frontier tier entirely for now. Route to Sonnet 4.6 by default and you'll capture nearly all the value at a fraction of the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A useful rule of thumb:&lt;/strong&gt; ask "would a competent person with my docs in front of them find this task &lt;em&gt;hard&lt;/em&gt;, or just &lt;em&gt;laborious&lt;/em&gt;?" Laborious goes to Sonnet. Hard goes to Fable — and so does laborious-at-a-scale where staying coherent unattended for hours &lt;em&gt;is&lt;/em&gt; the hard part, which is why a 50-million-line migration is Fable territory even though no single edit in it is difficult.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mythos 5 vs. Fable 5: same model, different gates
&lt;/h2&gt;

&lt;p&gt;The naming invites confusion, so let's clear it up: &lt;strong&gt;Mythos 5 and Fable 5 are the same underlying model.&lt;/strong&gt; Anthropic says so &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;explicitly&lt;/a&gt;: "the safeguards are what distinguish the two models... and are why we've given them different names." Mythos 5 has safeguards lifted for its vetted audience — cyber today, bio/chem for the upcoming biology cohort — while Fable 5 keeps all the gates and routes gated queries to Opus 4.8. Same weights, same $10/$50 pricing, same 30-day retention. By Anthropic's own data, more than 95% of Fable sessions involve no fallback at all — and in those sessions, "Fable 5's performance is effectively the same as that of Mythos 5." You are not missing out on a smarter model; you're missing out on the gated domains, and even Mythos partners only get the gates their program unlocks.&lt;/p&gt;

&lt;p&gt;So who actually gets the ungated version? Today, almost nobody. Mythos 5 is &lt;a href="https://www.anthropic.com/project/glasswing" rel="noopener noreferrer"&gt;restricted to Project Glasswing partners&lt;/a&gt; — a US-government-coordinated cyber-defense program whose roster reads like an infrastructure who's-who (AWS, Apple, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, Palo Alto Networks, plus ~150 more critical-infrastructure organizations). Mythos 5 supersedes Claude Mythos Preview, the April 2026 gated research release that found over ten thousand high- or critical-severity vulnerabilities — at $25/$125 per million tokens, more than double what Mythos 5 costs now. A &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;trusted-access program for biology&lt;/a&gt; is opening "in the coming weeks" — note it grants Fable 5 with the bio/chem gates removed (cyber gates intact), not Mythos itself — and a broader systematic application path for cybersecurity organizations is promised but not dated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is why our routing table sends security work to Opus 4.8, not Mythos 5.&lt;/strong&gt; For a security team without Glasswing access — which is to say, nearly every security team — the realistic ladder looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Opus 4.8 is the working default.&lt;/strong&gt; It's the model Fable's classifier hands your cyber queries to anyway, so pinning it directly just skips the detour and the split-billing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opus 4.8 + the Cyber Verification Program&lt;/strong&gt; is the self-serve upgrade. The &lt;a href="https://support.claude.com/en/articles/14604842-real-time-cyber-safeguards-on-claude" rel="noopener noreferrer"&gt;CVP&lt;/a&gt; is a free, application-based program (decisions target two business days) that lifts the &lt;em&gt;dual-use&lt;/em&gt; cyber blocks — vulnerability exploitation, offensive tooling for defensive purposes — on Opus models for verified organizations. Prohibited-use blocks (ransomware, mass exfiltration) stay regardless; it isn't offered on Bedrock or Vertex; zero-data-retention orgs aren't currently eligible; and note carefully: it's an Opus program. It does not raise Fable 5's thresholds and it does not grant Mythos access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fable 5&lt;/strong&gt; for everything &lt;em&gt;outside&lt;/em&gt; the gated domains — a security company's product engineering, data pipelines, and long-horizon builds are still Fable territory like anyone else's.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mythos 5&lt;/strong&gt; only if you're a Glasswing partner — or, eventually, through the broader cyber application path Anthropic has promised but not dated. If your organization defends critical infrastructure at scale, that's worth watching for; if not, items 1–3 are the menu.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The summary for the impatient: Mythos 5 isn't the model you're missing — it's the &lt;em&gt;permission slip&lt;/em&gt; you're missing, and for 95%+ of work the permission slip changes nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to use Fable 5 effectively
&lt;/h2&gt;

&lt;h3&gt;
  
  
  API configuration: know the sharp edges
&lt;/h3&gt;

&lt;p&gt;Fable 5 inherits Opus 4.7/4.8's request surface, but several details differ in ways that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive thinking only.&lt;/strong&gt; &lt;a href="https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking" rel="noopener noreferrer"&gt;&lt;code&gt;thinking: {type: "adaptive"}&lt;/code&gt;&lt;/a&gt;. Fixed &lt;code&gt;budget_tokens&lt;/code&gt; returns a 400, as do &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, and &lt;code&gt;top_k&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No explicit &lt;code&gt;disabled&lt;/code&gt;.&lt;/strong&gt; Unique to Fable 5: an explicit &lt;code&gt;thinking: {type: "disabled"}&lt;/code&gt; returns a 400 (it's accepted on Opus 4.8). Omit the &lt;code&gt;thinking&lt;/code&gt; parameter instead, or run adaptive — either way, the model thinks, and you pay for it (see "The bill" below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't ask it to transcribe its reasoning.&lt;/strong&gt; Prompting Fable 5 to echo its internal chain of thought as response text can trip the &lt;code&gt;reasoning_extraction&lt;/code&gt; classifier and refuse the turn. If you need reasoning visibility, set &lt;code&gt;thinking: {type: "adaptive", display: "summarized"}&lt;/code&gt; and read the structured thinking blocks instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set generous &lt;code&gt;max_tokens&lt;/code&gt; and stream.&lt;/strong&gt; At &lt;code&gt;xhigh&lt;/code&gt;/&lt;code&gt;max&lt;/code&gt; effort, give the model ≥64K output room; stream anything large.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle the refusal.&lt;/strong&gt; Check &lt;code&gt;stop_reason: "refusal"&lt;/code&gt; and the structured &lt;code&gt;stop_details&lt;/code&gt; (category: &lt;code&gt;cyber&lt;/code&gt;, &lt;code&gt;bio&lt;/code&gt;, &lt;code&gt;reasoning_extraction&lt;/code&gt;, or null — see &lt;a href="https://support.claude.com/en/articles/15363606" rel="noopener noreferrer"&gt;Anthropic's safeguards guide&lt;/a&gt;). On Anthropic's own surfaces the fallback to Opus 4.8 is automatic; on the API, &lt;a href="https://platform.claude.com/cookbook/fable-5-fallback-billing-guide" rel="noopener noreferrer"&gt;server-side fallback is an opt-in beta&lt;/a&gt; (and unavailable on Bedrock, Vertex, and Foundry), so decide deliberately what your harness does when a refusal fires. The billing side of this is covered below.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data retention:&lt;/strong&gt; Mythos-class traffic carries a &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;mandatory 30-day retention policy&lt;/a&gt; (safety-only, not used for training, with logged human access). On some cloud platforms you must &lt;a href="https://aws.amazon.com/blogs/aws/anthropic-claude-fable-5-on-aws-mythos-class-capabilities-with-built-in-safeguards-now-available/" rel="noopener noreferrer"&gt;explicitly opt in&lt;/a&gt; before the model is invocable. Factor this into compliance review &lt;em&gt;before&lt;/em&gt; adopting, not after.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The bill: what you're actually paying for
&lt;/h3&gt;

&lt;p&gt;Fable 5's pricing has a sticker and a story. The sticker is simple: 2× Opus. The story is everything that multiplies on top of it — and almost all of it is under your control. Five levers, roughly in order of impact:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Effort is the biggest dial on the dashboard.&lt;/strong&gt; &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort" rel="noopener noreferrer"&gt;&lt;code&gt;output_config: {effort: ...}&lt;/code&gt;&lt;/a&gt; with &lt;code&gt;low | medium | high | xhigh | max&lt;/code&gt; governs &lt;em&gt;all&lt;/em&gt; output — thinking, tool calls, and text. In &lt;a href="https://news.ycombinator.com/item?id=48464054" rel="noopener noreferrer"&gt;Willison's single-prompt sweep&lt;/a&gt;, &lt;code&gt;max&lt;/code&gt; produced roughly 7.5× the output tokens of &lt;code&gt;low&lt;/code&gt; on an identical prompt, and &lt;code&gt;high&lt;/code&gt; actually used fewer tokens than &lt;code&gt;medium&lt;/code&gt; on one run. The relationship between effort and total cost isn't monotonic, because higher effort often finishes in fewer turns. So: start at &lt;code&gt;high&lt;/code&gt; (the default), sweep on your own evals, and reserve &lt;code&gt;xhigh&lt;/code&gt;/&lt;code&gt;max&lt;/code&gt; for hard, latency-insensitive problems. Several of the scariest launch-week burn reports turned out to involve max effort, 1M contexts, and multi-agent workflows all at once — three multipliers, none of them mandatory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. You pay for thinking you never see.&lt;/strong&gt; Adaptive thinking can't be turned off, and &lt;a href="https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking" rel="noopener noreferrer"&gt;you're billed for the model's &lt;em&gt;full internal reasoning&lt;/em&gt;&lt;/a&gt;, not the (by default, empty) thinking text in the response. If your billed output tokens dwarf your visible output, that's not a bug — check &lt;code&gt;usage.output_tokens_details.thinking_tokens&lt;/code&gt; to see the reasoning share. Budget for it; don't fight it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Caching is unusually kind to Fable.&lt;/strong&gt; &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Cache reads cost 10% of the input price&lt;/a&gt;, standard multipliers apply on writes, and Fable 5's minimum cacheable prefix is 2,048 tokens versus 4,096 on Opus 4.6–4.8 — so mid-sized prompts cache on Fable that silently wouldn't on Opus. At $10/MTok input, verify your cache hits (&lt;code&gt;usage.cache_read_input_tokens&lt;/code&gt;) before optimizing anything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Batch anything that can wait.&lt;/strong&gt; The &lt;a href="https://platform.claude.com/docs/en/build-with-claude/batch-processing" rel="noopener noreferrer"&gt;Batches API&lt;/a&gt; supports Fable 5 at the standard 50% discount: $5/$25, which is interactive-Opus pricing for frontier-quality output. Overnight evals, bulk document analysis, scheduled report generation — if nobody is watching the response stream, it belongs in a batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Classifier trips have a billing story too.&lt;/strong&gt; A request refused on input costs $0. With server-side fallback enabled, the fallback's input is billed at cache-read rates. But a &lt;em&gt;mid-stream&lt;/em&gt; block split-bills: Fable rates for everything generated before the block, Opus rates after (the &lt;a href="https://platform.claude.com/cookbook/fable-5-fallback-billing-guide" rel="noopener noreferrer"&gt;fallback billing guide&lt;/a&gt; covers the full mechanics). One &lt;a href="https://www.reddit.com/r/ClaudeAI/comments/1u1cvkc/fable_5_is_insanely_good_but_watch_your_usage_i/" rel="noopener noreferrer"&gt;launch-week report&lt;/a&gt; described burning 200K Fable-priced tokens on a code review before the classifier handed the session to Opus. For security- and biology-adjacent work, it's cheaper to start on Opus 4.8 than to pay frontier rates for a run the classifier was always going to interrupt.&lt;/p&gt;

&lt;p&gt;Then the guard rails. &lt;code&gt;max_tokens&lt;/code&gt; is the only hard cap (thinking and text combined). &lt;strong&gt;&lt;a href="https://platform.claude.com/docs/en/build-with-claude/task-budgets" rel="noopener noreferrer"&gt;Task budgets&lt;/a&gt;&lt;/strong&gt; (beta, minimum 20K tokens) give the model a live countdown for an entire agentic loop and it self-moderates — the right pacing tool for long autonomous runs, though it's advisory, not a wall. One caution: set the budget once and let the server count down; mutating it client-side between turns invalidates your prompt cache. Finally, &lt;strong&gt;track spend per project.&lt;/strong&gt; Usage-analytics tooling that breaks cost down by project and agent (Claude Code's &lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;&lt;code&gt;/usage&lt;/code&gt; and spend limits&lt;/a&gt;, or the Console for API orgs) turns the routing decisions in this article from theory into a weekly habit. You'll likely find, as we did, that a handful of long-horizon projects justify the frontier tier and everything else doesn't.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompting: front-load the specification
&lt;/h3&gt;

&lt;p&gt;The highest-leverage practice: &lt;strong&gt;give the full task specification in one well-specified opening turn, then let the model run.&lt;/strong&gt; Fable 5's long-horizon coherence comes from planning against a clear goal. Ambiguous asks drip-fed across many turns waste exactly the capability you're paying double for. Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State what "done" looks like in checkable terms — not "a good report" but "a CSV with a numeric price column per SKU, validated against the source totals."&lt;/li&gt;
&lt;li&gt;Include constraints, non-goals, and the verification method up front.&lt;/li&gt;
&lt;li&gt;For managed agent platforms, encode "done" as a gradeable rubric/outcome so the harness iterates and grades automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two behavioral notes carried over from recent Opus generations still apply: the model follows instructions &lt;em&gt;literally&lt;/em&gt; (soften any "CRITICAL: YOU MUST" scaffolding written for older, more reluctant models), and it's conservative about reaching for optional capabilities — if you want it using subagents, file-based memory, or custom tools, say &lt;em&gt;when&lt;/em&gt; to use them, in the system prompt and in each tool's own description.&lt;/p&gt;

&lt;h3&gt;
  
  
  Harness and tooling: the model is half the system
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier your subagents.&lt;/strong&gt; Even with Fable 5 in the main loop, search/explore/verify subagents should run on Haiku or Sonnet. On tasks where subagent fan-out dominates spend, this approaches a 10× cost difference with near-identical results. (And know your workflow multipliers: &lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;multi-agent team sessions run roughly 7× the tokens&lt;/a&gt; of a solo session.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give it file-based memory on long tasks.&lt;/strong&gt; &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;Anthropic's own testing&lt;/a&gt; showed persistent memory improved Fable 5's long-game performance roughly &lt;em&gt;three times more&lt;/em&gt; than it improved Opus 4.8's — this model is distinctly better at writing and using its own notes. Maintain per-project memory files (one lesson per file, corrections and confirmed approaches alike) and the model compounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invest in CLAUDE.md / project instructions.&lt;/strong&gt; Documented constraints, build commands, and verification procedures are what let a long autonomous run self-correct instead of drifting. Counterintuitively, the better your docs, the &lt;em&gt;less&lt;/em&gt; often you need Fable — and the better Fable performs when you do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use skills for repeatable procedures.&lt;/strong&gt; Anything you do more than twice — deploy flows, report formats, review checklists — belongs in a skill the model loads on demand, not in tokens you re-explain per session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify with fresh-context agents, not self-critique.&lt;/strong&gt; For audits, reviews, and migrations, structure the work as decompose → execute → adversarially verify → synthesize, with verifiers spawned in clean contexts. Fable 5 at the highest effort already reflects on and validates its own work, but independent verification still beats self-review for anything you'll ship.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A simple routing default
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Most work (the default): scoped features, maintenance, templated docs, runbook execution, interactive coding&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Sonnet 4.6&lt;/strong&gt; (escalate to Opus 4.8 when it stalls)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-day autonomous builds, large migrations, high-stakes synthesis, judge/synthesis agents, frontier vision tasks&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fable 5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subagent fan-out, verification steps, classification, extraction, high-volume pipelines&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Haiku 4.5&lt;/strong&gt; (Sonnet 4.6 for heavier subagents)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security- or biology-flavored sessions where a classifier trip would break your harness&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Opus 4.8&lt;/strong&gt; directly (+ CVP verification for dual-use cyber work — see the Mythos section)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Fable 5 doesn't make Sonnet or Opus obsolete — it makes a &lt;em&gt;previously impossible class of work&lt;/em&gt; possible, at a price and latency that punish using it for everything else. Think of it as hiring a brilliant specialist who bills by the hour and thinks out loud on the clock: transformative on the right problem, ruinous as a receptionist.&lt;/p&gt;

&lt;p&gt;The teams that get the most from it will be the ones with the discipline to route: specs and stakes go up to the frontier; procedures and patterns stay on the efficient tiers; verification runs cheap everywhere. The model rewards exactly the engineering hygiene — written specs, documented constraints, checkable definitions of done, persistent memory — that made your systems better before it existed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic, &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;"Claude Fable 5 and Claude Mythos 5"&lt;/a&gt; — official announcement (June 9, 2026): capabilities, classifiers, fallback behavior, pricing, availability, data retention.&lt;/li&gt;
&lt;li&gt;Anthropic, &lt;a href="https://www.anthropic.com/project/glasswing" rel="noopener noreferrer"&gt;Project Glasswing&lt;/a&gt; and &lt;a href="https://support.claude.com/en/articles/14604842-real-time-cyber-safeguards-on-claude" rel="noopener noreferrer"&gt;Real-time cyber safeguards on Claude&lt;/a&gt; — Mythos access, partner roster, Mythos Preview pricing, and the Cyber Verification Program's scope and application process.&lt;/li&gt;
&lt;li&gt;Anthropic platform documentation — pricing, effort parameter, adaptive thinking, task budgets, prompt caching, and the &lt;a href="https://platform.claude.com/cookbook/fable-5-fallback-billing-guide" rel="noopener noreferrer"&gt;Fable 5 fallback billing guide&lt;/a&gt; (refusal/fallback billing mechanics, batch availability).&lt;/li&gt;
&lt;li&gt;AWS News Blog, &lt;a href="https://aws.amazon.com/blogs/aws/anthropic-claude-fable-5-on-aws-mythos-class-capabilities-with-built-in-safeguards-now-available/" rel="noopener noreferrer"&gt;"Anthropic Claude Fable 5 on AWS"&lt;/a&gt; — Bedrock/Claude Platform availability, data-retention opt-in, fallback pricing mechanics.&lt;/li&gt;
&lt;li&gt;Simon Willison, &lt;a href="https://simonwillison.net/2026/Jun/9/claude-fable-5/" rel="noopener noreferrer"&gt;"Initial impressions of Claude Fable 5"&lt;/a&gt; — independent hands-on review: real-world coding sessions, cost data, effort-level token sweep.&lt;/li&gt;
&lt;li&gt;Community launch-week reports — &lt;a href="https://www.reddit.com/r/ClaudeAI/comments/1u1cvkc/fable_5_is_insanely_good_but_watch_your_usage_i/" rel="noopener noreferrer"&gt;r/ClaudeAI thread&lt;/a&gt; (144 comments of burn-rate anecdotes and the emerging plan-with-Fable, execute-with-Opus pattern) and &lt;a href="https://news.ycombinator.com/item?id=48463982" rel="noopener noreferrer"&gt;Hacker News discussion&lt;/a&gt;. Treated as anecdote, not data.&lt;/li&gt;
&lt;li&gt;Our portfolio audit (June 2026, including launch-week usage) — routing analysis across ~20 active projects spanning agentic platforms, SDKs, data pipelines, and analysis workloads.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claude</category>
      <category>mythos</category>
      <category>fable</category>
      <category>ai</category>
    </item>
    <item>
      <title>From Node.js to Go: Rebuilding an MCP Server for Production</title>
      <dc:creator>Zohar Babin</dc:creator>
      <pubDate>Tue, 19 May 2026 20:48:20 +0000</pubDate>
      <link>https://dev.to/zoharbabin/from-nodejs-to-go-rebuilding-an-mcp-server-for-production-oil</link>
      <guid>https://dev.to/zoharbabin/from-nodejs-to-go-rebuilding-an-mcp-server-for-production-oil</guid>
      <description>&lt;p&gt;This is the story of why I rebuilt &lt;a href="https://github.com/zoharbabin/google-researcher-mcp" rel="noopener noreferrer"&gt;google-researcher-mcp&lt;/a&gt; (Node.js/TypeScript) from scratch as &lt;a href="https://github.com/zoharbabin/web-researcher-mcp" rel="noopener noreferrer"&gt;web-researcher-mcp&lt;/a&gt; (Go), and what the lessons learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Starting Point
&lt;/h2&gt;

&lt;p&gt;The original project — &lt;code&gt;google-researcher-mcp&lt;/code&gt; — was a TypeScript/Node.js MCP server distributed via npm. It had real traction: 36 GitHub stars, 6,500+ npm downloads, 860+ tests, and active users. But five critical issues kept surfacing that couldn't be solved within the existing architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rewrite in Go (Not Refactored)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Orphan Processes (Issue #108)
&lt;/h3&gt;

&lt;p&gt;npx spawns deeply nested process trees. When the parent MCP client (Claude Desktop, Cursor) crashes or closes unexpectedly, the Node.js process doesn't receive a signal — it keeps running, consuming memory and holding file locks.&lt;/p&gt;

&lt;p&gt;Myself and collaborators spent three versions (v6.2.0 through v6.4.0) building increasingly complex orphan detection: a Worker thread watchdog with CPU spin detection, three-layer parent-alive checks, and graceful degradation. It was all band-aids on a fundamental runtime limitation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go fix&lt;/strong&gt;: A single static binary. No runtime process tree. EOF on stdin = immediate exit. The entire problem category disappeared.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Discontinuing "Entire Web" Search (Issue #107)
&lt;/h3&gt;

&lt;p&gt;Google announced it would be discontinuing support for Programmable Search Engines configured to search the "entire web." The project was named &lt;code&gt;google-researcher-mcp&lt;/code&gt; — the dependency on a single search provider was an foundational risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go fix&lt;/strong&gt;: Interface-driven &lt;code&gt;search.Provider&lt;/code&gt; with multiple implementations, plus a Router that provides multi-provider routing with automatic failover via per-provider circuit breakers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alternative Search Engines (Issue #55)
&lt;/h3&gt;

&lt;p&gt;Users wanted Brave, Bing (go figure), and other providers. But the TypeScript codebase was too tightly coupled to Google's API response format — the shared directory (41 files) made every change risky and far-reaching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go fix&lt;/strong&gt;: A clean &lt;code&gt;Provider&lt;/code&gt; interface — each adapter normalizes provider-specific responses to common types (&lt;code&gt;SearchResult&lt;/code&gt;, &lt;code&gt;ImageResult&lt;/code&gt;, &lt;code&gt;NewsResult&lt;/code&gt;). Adding a new provider is one file implementing one interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Redis Caching (Issue #72)
&lt;/h3&gt;

&lt;p&gt;The in-memory cache was lost on every process restart — which happened frequently with npx-launched servers. The complex persistence manager offered four strategies (Periodic, WriteThrough, OnShutdown, Hybrid), but none reliably survived the volatile process lifecycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go fix&lt;/strong&gt;: A &lt;code&gt;cache.Cache&lt;/code&gt; interface with a hybrid implementation: memory LRU + AES-encrypted disk + optional Redis. Simple, testable, and it never loses data because the disk layer persists across restarts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monolithic Architecture (Issue #40)
&lt;/h3&gt;

&lt;p&gt;The project had 100+ source files but a tightly coupled &lt;code&gt;shared/&lt;/code&gt; directory with 41 files. Adding a single tool required touching 4+ documentation sections, and the import graph made refactoring perilous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go fix&lt;/strong&gt;: One package per concern. Tool handlers are self-contained files. Adding a tool means writing one file and one line in the registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed Architecturally
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Node.js (old)&lt;/th&gt;
&lt;th&gt;Go (new)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Distribution&lt;/td&gt;
&lt;td&gt;npm/npx (runtime required)&lt;/td&gt;
&lt;td&gt;Single static binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;430MB idle (80MB after optimization)&lt;/td&gt;
&lt;td&gt;~25MB baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup&lt;/td&gt;
&lt;td&gt;2-4 seconds (lazy imports)&lt;/td&gt;
&lt;td&gt;&amp;lt;100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Process lifecycle&lt;/td&gt;
&lt;td&gt;Worker thread watchdog&lt;/td&gt;
&lt;td&gt;EOF detection, no orphans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search providers&lt;/td&gt;
&lt;td&gt;Google only&lt;/td&gt;
&lt;td&gt;Multiple providers + fallback routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency&lt;/td&gt;
&lt;td&gt;Event loop + async/await&lt;/td&gt;
&lt;td&gt;Goroutines + semaphores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type safety&lt;/td&gt;
&lt;td&gt;TypeScript + Zod&lt;/td&gt;
&lt;td&gt;Go type system + struct tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;860+ Jest tests&lt;/td&gt;
&lt;td&gt;Table-driven tests + race detector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scraping&lt;/td&gt;
&lt;td&gt;Playwright (heavy)&lt;/td&gt;
&lt;td&gt;4-tier pipeline (lightweight first)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Don't Fight Your Runtime
&lt;/h3&gt;

&lt;p&gt;Node.js process management is fundamentally fragile for long-lived servers launched via npx. The runtime doesn't support robust parent-death detection, and the nested process tree (npx → node → worker) makes signal propagation unreliable. We spent three versions building increasingly complex orphan detection. Go's single binary eliminated the entire category of problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: If you're spending significant engineering effort working around your runtime's limitations, that's a signal to evaluate whether the runtime fits the problem.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Side note: looking for a better runtime I looked into both Go and Rust (isn't Rust aweoms!?). Go won primarily for its lightweight goroutines exceling at I/O-bound operations, and the mcp-go SDK is superbly maintained. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Interface-Driven Design Enables Fearless Extension
&lt;/h3&gt;

&lt;p&gt;Adding Brave Search in the Go version was one file implementing one interface — about 200 lines including tests. In the Node.js version, the equivalent change would have touched 6+ files due to tightly coupled imports in the shared directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: When you know extension is likely (new providers, new tools), invest in clean interfaces upfront. The interface is the specification; implementations are interchangeable.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Memory Matters for MCP Servers
&lt;/h3&gt;

&lt;p&gt;MCP servers run alongside AI assistants on developer machines. They're always-on background processes. A 430MB idle memory footprint was unacceptable — users would notice and uninstall. Go's ~25MB baseline lets the server stay resident without impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: For developer tools that run continuously, memory efficiency is a feature, not an optimization. Choose your runtime accordingly.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Caching Architecture Should Be Boring
&lt;/h3&gt;

&lt;p&gt;The old project had four persistence strategies with complex heuristics for when to flush. The new one has: memory LRU + optional encrypted disk + optional Redis. Each layer is simple and independently testable. No heuristics, no race conditions, no data loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: Boring infrastructure is reliable infrastructure. If your caching layer needs its own debugging session, it's too complex.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Documentation Should Be Drift-Resistant
&lt;/h3&gt;

&lt;p&gt;The old project required updating four separate documentation files per new tool. Inevitably, docs drifted from reality. The new project's test suite programmatically validates documentation claims — tool descriptions must mention alternatives, output schemas must match actual responses, and annotations must be consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: If documentation can be wrong without a test failing, it will eventually be wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Kept
&lt;/h2&gt;

&lt;p&gt;The rewrite preserved the user-facing contract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Same tools&lt;/strong&gt; with identical semantics and parameter names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same MCP protocol&lt;/strong&gt; compatibility (Claude Desktop, Cursor, VS Code, any MCP client)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same environment variables&lt;/strong&gt; (drop-in replacement for existing configs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same search lenses&lt;/strong&gt; (curated domain lists, identical JSON format)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What improved (without breaking backwards compatibility):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth 2.1 authentication for multi-client deployments&lt;/li&gt;
&lt;li&gt;Multi-tenancy with per-tenant session isolation&lt;/li&gt;
&lt;li&gt;Per-provider circuit breakers with automatic fallback&lt;/li&gt;
&lt;li&gt;Prometheus metrics for observability&lt;/li&gt;
&lt;li&gt;Structured audit logging for compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Since launching the Go version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero orphan process reports (vs. recurring issue in Node.js version)&lt;/li&gt;
&lt;li&gt;Multiple search providers with automatic failover (vs. single provider)&lt;/li&gt;
&lt;li&gt;4-tier scraping pipeline that tries lightweight methods first (vs. Playwright-only)&lt;/li&gt;
&lt;li&gt;Sub-100ms cold startup (vs. 2-4 seconds)&lt;/li&gt;
&lt;li&gt;Production-ready: rate limiting, circuit breakers, session isolation, audit trail&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Should You Rewrite?
&lt;/h2&gt;

&lt;p&gt;Probably not. Most rewrites fail because they're motivated by developer preference ("I want to use a new language") rather than architectural necessity. Ours succeeded because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The problems were &lt;strong&gt;architectural&lt;/strong&gt;, not implementational — no amount of refactoring within Node.js would fix process orphaning&lt;/li&gt;
&lt;li&gt;The user-facing contract was &lt;strong&gt;well-defined&lt;/strong&gt; — MCP provides a clean protocol boundary&lt;/li&gt;
&lt;li&gt;The scope was &lt;strong&gt;bounded&lt;/strong&gt; — we knew exactly what the server needed to do&lt;/li&gt;
&lt;li&gt;We had &lt;strong&gt;comprehensive tests&lt;/strong&gt; on the old version to validate behavioral equivalence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your problems are solvable within your current architecture, refactor. If they're fundamentally incompatible with your runtime or architecture, consider a rewrite — but only with clear success criteria and a well-defined boundary.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article covers the migration from &lt;a href="https://github.com/zoharbabin/google-researcher-mcp" rel="noopener noreferrer"&gt;google-researcher-mcp&lt;/a&gt; to &lt;a href="https://github.com/zoharbabin/web-researcher-mcp" rel="noopener noreferrer"&gt;web-researcher-mcp&lt;/a&gt;. The new project is open source under MIT and works with any MCP-compatible AI assistant.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>go</category>
      <category>lessonslearned</category>
    </item>
    <item>
      <title>Building a 13-Agent AI System for M&amp;A Due Diligence — Architecture Deep Dive</title>
      <dc:creator>Zohar Babin</dc:creator>
      <pubDate>Sun, 17 May 2026 19:05:46 +0000</pubDate>
      <link>https://dev.to/zoharbabin/building-a-13-agent-ai-system-for-ma-due-diligence-architecture-deep-dive-20ah</link>
      <guid>https://dev.to/zoharbabin/building-a-13-agent-ai-system-for-ma-due-diligence-architecture-deep-dive-20ah</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Nobody Was Solving
&lt;/h2&gt;

&lt;p&gt;As a corp dev lead, I spent weeks doing the same thing after every deal: assembling the cross-domain picture from siloed advisor reports.&lt;/p&gt;

&lt;p&gt;Legal would flag a termination clause. Finance would flag revenue concentration. Same entity. Nobody connected the dots.&lt;/p&gt;

&lt;p&gt;This happens because due diligence is split into parallel workstreams — legal, financial, commercial, tax, regulatory — each run by separate teams with separate deliverables. The cross-referencing happens in someone's head, over coffee, two days before the IC memo is due.&lt;/p&gt;

&lt;p&gt;The numbers back this up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;31% of M&amp;amp;A failures trace back to DD shortcomings&lt;/strong&gt; (HBR, McKinsey, KPMG research)&lt;/li&gt;
&lt;li&gt;DD timelines keep compressing — six weeks becomes three, same scope&lt;/li&gt;
&lt;li&gt;Corp dev teams screen 200-1,000+ companies/year but close 1-3%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built &lt;a href="https://github.com/zoharbabin/due-diligence-agents" rel="noopener noreferrer"&gt;Due Diligence Agents&lt;/a&gt; to fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;13 AI agents analyze every document in an M&amp;amp;A data room across 9 specialist domains — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, and ESG — then cross-reference findings automatically and trace each one to the exact page, section, and quote.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dd-agents
dd-agents auto-config &lt;span class="s2"&gt;"Buyer"&lt;/span&gt; &lt;span class="s2"&gt;"Target"&lt;/span&gt; &lt;span class="nt"&gt;--data-room&lt;/span&gt; ./your_data_room
dd-agents run deal-config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: an interactive HTML report, a 14-sheet Excel workbook, and per-subject JSON findings. &lt;a href="https://zoharbabin.github.io/due-diligence-agents/" rel="noopener noreferrer"&gt;See a sample report from synthetic data.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The system has four layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: 38-Step Async Pipeline
&lt;/h3&gt;

&lt;p&gt;The orchestrator (&lt;code&gt;engine.py&lt;/code&gt;) is a state machine with 38 async steps grouped into phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Setup&lt;/strong&gt; (steps 1-5): Load config, validate data room, resolve entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery&lt;/strong&gt; (steps 6-13): Extract documents, build inventory, classify files, compute precedence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysis&lt;/strong&gt; (steps 14-17): Build specialist prompts, route documents, spawn agents in parallel, check coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Domain&lt;/strong&gt; (steps 18-20): Symbolic trigger evaluation, targeted respawn, merge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt; (steps 21-26): Judge review, merge findings, validate, deduplicate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reporting&lt;/strong&gt; (steps 27-38): Generate HTML, Excel, JSON, knowledge base&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every step supports checkpoint/resume. If the pipeline crashes at step 23, it restarts from step 23 — not from scratch. Steps are typed, and the state object serializes cleanly to JSON.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: 13 Agents
&lt;/h3&gt;

&lt;p&gt;9 specialists + 4 meta-agents, each spawned via Anthropic's &lt;code&gt;claude-agent-sdk&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specialists&lt;/strong&gt;: Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG. Each gets domain-specific prompts, the relevant documents, and a set of tools (file read, search, finding write).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-agents&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Judge&lt;/strong&gt;: Reviews specialist findings for quality, consistency, and missed coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executive Synthesis&lt;/strong&gt;: Produces the deal-level summary with go/no-go signals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red Flag Scanner&lt;/strong&gt;: Pattern-matches across all findings for deal-killers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acquirer Intelligence&lt;/strong&gt;: Tailors findings to the buyer's strategic context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Specialists run in parallel (batched by resource constraints). Meta-agents run sequentially after all specialists complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Neurosymbolic Cross-Domain Analysis
&lt;/h3&gt;

&lt;p&gt;This is the part that solved my original problem.&lt;/p&gt;

&lt;p&gt;After specialists produce their findings (pass 1), a &lt;strong&gt;deterministic rule engine&lt;/strong&gt; scans them for cross-domain dependencies. No LLM calls — just Python pattern matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: Finance finds revenue recognition issue
# → Rule fires → Legal agent re-examines specific contracts
# for enforceability, clawback clauses, delivery milestones
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seven built-in trigger rules cover the most common M&amp;amp;A cross-domain dependencies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source → Target&lt;/th&gt;
&lt;th&gt;When It Fires&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Finance → Legal&lt;/td&gt;
&lt;td&gt;Revenue recognition finding needs contract enforceability check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal → Finance&lt;/td&gt;
&lt;td&gt;Change-of-control clause needs financial exposure quantification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal → Finance&lt;/td&gt;
&lt;td&gt;Termination-for-convenience needs revenue-at-risk calculation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal → ProductTech&lt;/td&gt;
&lt;td&gt;IP ownership dispute needs technical dependency assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ProductTech → Legal&lt;/td&gt;
&lt;td&gt;Data privacy finding needs DPA/GDPR compliance review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commercial → Finance&lt;/td&gt;
&lt;td&gt;SLA risk with &amp;gt;10% service credits needs financial quantification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance → Commercial&lt;/td&gt;
&lt;td&gt;Pricing discrepancy needs commercial rate card validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When a rule fires, it creates a &lt;code&gt;CrossDomainTrigger&lt;/code&gt; with the specific contracts to re-examine and instructions for the target agent. The target agent runs a &lt;strong&gt;targeted pass-2 review&lt;/strong&gt; — only on the cited contracts, not the full data room. This keeps costs bounded.&lt;/p&gt;

&lt;p&gt;Budget-capped, priority-ordered. If no triggers fire, zero additional cost.&lt;/p&gt;

&lt;p&gt;The design is inspired by the &lt;a href="https://arxiv.org/abs/2604.00555" rel="noopener noreferrer"&gt;FAOS Platform&lt;/a&gt; — asymmetric coupling where symbolic rules constrain the LLM's scope while the LLM provides judgment. Symbolic decides &lt;em&gt;when&lt;/em&gt; intelligence is needed; the LLM provides &lt;em&gt;what&lt;/em&gt; to do about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: 5 Blocking Quality Gates
&lt;/h3&gt;

&lt;p&gt;Every finding goes through validation before it reaches the report:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Coverage gate&lt;/strong&gt;: Did the agent analyze every assigned document?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema validation&lt;/strong&gt;: Does every finding have the required fields (severity, citations, category)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Citation verification&lt;/strong&gt;: Can we trace the finding back to a specific page and quote?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic dedup&lt;/strong&gt;: Are two agents saying the same thing about the same document? (rapidfuzz token_sort_ratio ≥ 80)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Numerical audit&lt;/strong&gt;: Do financial figures in findings match what's in the source documents?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Fail-closed. If validation fails, the pipeline stops — it doesn't silently produce bad output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chat Mode (My Favorite Feature)
&lt;/h2&gt;

&lt;p&gt;After the pipeline runs, you can interrogate the results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dd-agents chat &lt;span class="nt"&gt;--report&lt;/span&gt; _dd/forensic-dd/runs/latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chat agent has 14 MCP tools: citation verification against source PDFs, cross-contract search, entity resolution, and sandboxed document generation. Ask "build me a board summary of all P0 findings with revenue impact" and it writes a Python script, executes it in a sandbox, and hands you the &lt;code&gt;.xlsx&lt;/code&gt; file.&lt;/p&gt;

&lt;h2&gt;
  
  
  15 Things I Learned Building This
&lt;/h2&gt;

&lt;p&gt;These lessons apply to any system doing cross-document analysis at scale — not just M&amp;amp;A.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Extraction is harder than analysis. By a lot.
&lt;/h3&gt;

&lt;p&gt;Everyone focuses on the LLM prompts. But 80% of the real engineering is getting clean text out of messy documents. Our extraction pipeline has 4 tiers: pymupdf → pdftotext → OCR (Tesseract → GLM-OCR) → Claude vision as last resort. Each tier has 6 quality gates (min chars, printable ratio, density, readability, watermark detection, corruption check). Confidence scales with method quality — pymupdf gets 0.9 base, OCR gets 0.65.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Entity resolution is your invisible foundation
&lt;/h3&gt;

&lt;p&gt;"IBM", "International Business Machines", and "Red Hat" — are these the same entity? We use a 6-stage cascade: exact match → normalized (strip legal suffixes) → alias expansion → fuzzy match (rapidfuzz) → TF-IDF cosine similarity → learned matches from prior runs. Names ≤5 characters are blocked from fuzzy matching — without this, "Inc." matches random entities.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Don't dump everything into one context. Map-merge-resolve.
&lt;/h3&gt;

&lt;p&gt;A 200-page master agreement might have the deal-killer on page 147. You can't skip large files. But dumping them into one context drops accuracy from 95% to 74% (&lt;a href="https://www.addleshawgoddard.com/globalassets/insights/technology/llm/rag-report.pdf" rel="noopener noreferrer"&gt;Addleshaw Goddard, 510 contracts&lt;/a&gt;). Instead: chunk at page boundaries (150K chars, 15% overlap), analyze each chunk independently, merge with priority logic (YES beats NO, specific beats generic), and only invoke LLM arbitration when chunks disagree. The 21-point accuracy gain is entirely engineering — no model change.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Hallucination is an engineering problem, not a model problem
&lt;/h3&gt;

&lt;p&gt;No single defense works. We use 5 layers: (1) Pydantic schema validation on every response, (2) mandatory citation with file_path/page/exact_quote verified against source, (3) explicit "NOT_FOUND" escape valve — without this, models fabricate clauses rather than admit ignorance, (4) adversarial Judge review with accusatory framing ("this finding appears fabricated — prove it with a direct quote"), (5) 6-layer deterministic numerical audit.&lt;/p&gt;

&lt;p&gt;Layer 3 changed everything. When you tell the model "if you can't find this clause, say NOT_FOUND," hallucination drops dramatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Know when to stop using LLMs
&lt;/h3&gt;

&lt;p&gt;We had an LLM agent doing validation and report synthesis. We replaced it with deterministic Python. Quality went up, cost went down. The rule: use LLMs for analysis and synthesis; use Python for validation, dedup, and audit. If you can write the logic as deterministic code, do it.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Self-verification works — but only with accusatory framing
&lt;/h3&gt;

&lt;p&gt;After agents produce findings, a follow-up pass challenges them on high-severity claims. Polite prompts ("please review your finding") have near-zero effect — models confirm their own output. Accusatory prompts ("this finding appears fabricated," "the cited clause doesn't exist") force re-examination and produce a 9.2% accuracy improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Cross-agent dedup is different than you think
&lt;/h3&gt;

&lt;p&gt;When 4 agents analyze the same document, they find the same issue but describe it differently. Three rules: (1) never dedup within the same agent — two similar findings from Legal are intentionally distinct, (2) only dedup across agents on the same document — similar findings on different documents are different findings, (3) keep contributing agent metadata so you know which domains flagged it.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Context window engineering is a first-class discipline
&lt;/h3&gt;

&lt;p&gt;It's not just about fitting data in — it's about &lt;em&gt;where&lt;/em&gt; things go. Critical instructions go at the start (highest recall zone). Document content goes in the middle (lowest recall — ~40% worse). Constraints and format rules go at the end (second-highest recall). We budget 40% of the context window for tool calls and reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Quality gates must be blocking, not advisory
&lt;/h3&gt;

&lt;p&gt;If validation just logs a warning, nobody reads it. If it halts the pipeline, quality is non-negotiable. Same for agent guardrails: hard turn limits (soft at 200, force-kill at 3x), path guards (agents can only write under &lt;code&gt;_dd/&lt;/code&gt;), bash guards (24 blocked patterns — no &lt;code&gt;rm -rf&lt;/code&gt;, no &lt;code&gt;sudo&lt;/code&gt;, no pipe-to-shell). Better to produce nothing than unreliable output.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Every claim must be traceable to source
&lt;/h3&gt;

&lt;p&gt;Citation verification uses 4 scopes: exact page match → adjacent pages ±1 → full document fuzzy match (80%+) → cross-file search. That last one matters — if the quote isn't in the cited file, we search all files for that entity. Auto-corrects file misattribution.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Most of what AI finds is noise
&lt;/h3&gt;

&lt;p&gt;Run 9 agents across hundreds of documents and you'll get thousands of findings. We use a 3-stage classification: noise filter (15 patterns for extraction artifacts), data quality filter (14 patterns for "data unavailable" gaps), then material findings. Plus 5 severity recalibration rules — e.g., a change-of-control clause that only applies to competitors gets downgraded from P0 to P3 automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  12. Same clause, different deal, different severity
&lt;/h3&gt;

&lt;p&gt;An anti-assignment clause is P0 in an asset purchase (blocks contract transfer) but P3 in a stock purchase (entity doesn't change). Deal-type context must flow through the entire pipeline: prompt-time rules, post-hoc deterministic adjustments, and executive judgment overrides — with full audit trail.&lt;/p&gt;

&lt;h3&gt;
  
  
  13. Every API call is a deal cost
&lt;/h3&gt;

&lt;p&gt;Three model profiles: economy (Haiku for extraction), standard (Sonnet for analysis), premium (Opus for synthesis). Per-agent cost tracking. Hard budget limits that halt the pipeline. Right model for right task.&lt;/p&gt;

&lt;h3&gt;
  
  
  14. Pydantic v2 everywhere
&lt;/h3&gt;

&lt;p&gt;137+ models with &lt;code&gt;model_json_schema()&lt;/code&gt; for structured outputs. Strict mypy across 199 source files. The type system catches real bugs — a finding with &lt;code&gt;evidence&lt;/code&gt; instead of &lt;code&gt;citations&lt;/code&gt; gets blocked by the schema guard hook before it's written to disk.&lt;/p&gt;

&lt;h3&gt;
  
  
  15. Make every run smarter than the last
&lt;/h3&gt;

&lt;p&gt;Inspired by Karpathy's "LLM Wiki" pattern: a persistent knowledge base compounds across runs. Finding lineage via SHA-256 fingerprinting tracks findings even when wording changes. A NetworkX knowledge graph with 11 typed edge types captures entity relationships, contradictions, and clause interactions. Run 2 knows what Run 1 found — and catches what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dd-agents
dd-agents auto-config &lt;span class="s2"&gt;"Buyer"&lt;/span&gt; &lt;span class="s2"&gt;"Target"&lt;/span&gt; &lt;span class="nt"&gt;--data-room&lt;/span&gt; ./your_data_room
dd-agents run deal-config.json &lt;span class="nt"&gt;--dry-run&lt;/span&gt;  &lt;span class="c"&gt;# Preview without API calls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://zoharbabin.github.io/due-diligence-agents/" rel="noopener noreferrer"&gt;Sample report&lt;/a&gt; (synthetic data, no install needed)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/zoharbabin/due-diligence-agents" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — Apache 2.0, 3,714 tests, strict mypy.&lt;/p&gt;

&lt;p&gt;Built on &lt;a href="https://github.com/anthropics/claude-agent-sdk-python" rel="noopener noreferrer"&gt;Anthropic's Claude Agent SDK&lt;/a&gt;. Looking for feedback — especially from anyone who's dealt with data room analysis and can tell me whether the report structure maps to how DD findings are actually consumed.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>opensource</category>
      <category>sideprojects</category>
    </item>
  </channel>
</rss>
