<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zohar Babin</title>
    <description>The latest articles on DEV Community by Zohar Babin (@zoharbabin).</description>
    <link>https://dev.to/zoharbabin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936734%2F51a9e2c8-9c93-45ba-91ae-2449d592c478.png</url>
      <title>DEV Community: Zohar Babin</title>
      <link>https://dev.to/zoharbabin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zoharbabin"/>
    <language>en</language>
    <item>
      <title>Building a 13-Agent AI System for M&amp;A Due Diligence — Architecture Deep Dive</title>
      <dc:creator>Zohar Babin</dc:creator>
      <pubDate>Sun, 17 May 2026 19:05:46 +0000</pubDate>
      <link>https://dev.to/zoharbabin/building-a-13-agent-ai-system-for-ma-due-diligence-architecture-deep-dive-20ah</link>
      <guid>https://dev.to/zoharbabin/building-a-13-agent-ai-system-for-ma-due-diligence-architecture-deep-dive-20ah</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Nobody Was Solving
&lt;/h2&gt;

&lt;p&gt;As a corp dev lead, I spent weeks doing the same thing after every deal: assembling the cross-domain picture from siloed advisor reports.&lt;/p&gt;

&lt;p&gt;Legal would flag a termination clause. Finance would flag revenue concentration. Same entity. Nobody connected the dots.&lt;/p&gt;

&lt;p&gt;This happens because due diligence is split into parallel workstreams — legal, financial, commercial, tax, regulatory — each run by separate teams with separate deliverables. The cross-referencing happens in someone's head, over coffee, two days before the IC memo is due.&lt;/p&gt;

&lt;p&gt;The numbers back this up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;31% of M&amp;amp;A failures trace back to DD shortcomings&lt;/strong&gt; (HBR, McKinsey, KPMG research)&lt;/li&gt;
&lt;li&gt;DD timelines keep compressing — six weeks becomes three, same scope&lt;/li&gt;
&lt;li&gt;Corp dev teams screen 200-1,000+ companies/year but close 1-3%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built &lt;a href="https://github.com/zoharbabin/due-diligence-agents" rel="noopener noreferrer"&gt;Due Diligence Agents&lt;/a&gt; to fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;13 AI agents analyze every document in an M&amp;amp;A data room across 9 specialist domains — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, and ESG — then cross-reference findings automatically and trace each one to the exact page, section, and quote.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dd-agents
dd-agents auto-config &lt;span class="s2"&gt;"Buyer"&lt;/span&gt; &lt;span class="s2"&gt;"Target"&lt;/span&gt; &lt;span class="nt"&gt;--data-room&lt;/span&gt; ./your_data_room
dd-agents run deal-config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: an interactive HTML report, a 14-sheet Excel workbook, and per-subject JSON findings. &lt;a href="https://zoharbabin.github.io/due-diligence-agents/" rel="noopener noreferrer"&gt;See a sample report from synthetic data.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The system has four layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: 38-Step Async Pipeline
&lt;/h3&gt;

&lt;p&gt;The orchestrator (&lt;code&gt;engine.py&lt;/code&gt;) is a state machine with 38 async steps grouped into phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Setup&lt;/strong&gt; (steps 1-5): Load config, validate data room, resolve entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery&lt;/strong&gt; (steps 6-13): Extract documents, build inventory, classify files, compute precedence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysis&lt;/strong&gt; (steps 14-17): Build specialist prompts, route documents, spawn agents in parallel, check coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Domain&lt;/strong&gt; (steps 18-20): Symbolic trigger evaluation, targeted respawn, merge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt; (steps 21-26): Judge review, merge findings, validate, deduplicate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reporting&lt;/strong&gt; (steps 27-38): Generate HTML, Excel, JSON, knowledge base&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every step supports checkpoint/resume. If the pipeline crashes at step 23, it restarts from step 23 — not from scratch. Steps are typed, and the state object serializes cleanly to JSON.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: 13 Agents
&lt;/h3&gt;

&lt;p&gt;9 specialists + 4 meta-agents, each spawned via Anthropic's &lt;code&gt;claude-agent-sdk&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specialists&lt;/strong&gt;: Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG. Each gets domain-specific prompts, the relevant documents, and a set of tools (file read, search, finding write).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-agents&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Judge&lt;/strong&gt;: Reviews specialist findings for quality, consistency, and missed coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executive Synthesis&lt;/strong&gt;: Produces the deal-level summary with go/no-go signals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red Flag Scanner&lt;/strong&gt;: Pattern-matches across all findings for deal-killers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acquirer Intelligence&lt;/strong&gt;: Tailors findings to the buyer's strategic context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Specialists run in parallel (batched by resource constraints). Meta-agents run sequentially after all specialists complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Neurosymbolic Cross-Domain Analysis
&lt;/h3&gt;

&lt;p&gt;This is the part that solved my original problem.&lt;/p&gt;

&lt;p&gt;After specialists produce their findings (pass 1), a &lt;strong&gt;deterministic rule engine&lt;/strong&gt; scans them for cross-domain dependencies. No LLM calls — just Python pattern matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: Finance finds revenue recognition issue
# → Rule fires → Legal agent re-examines specific contracts
# for enforceability, clawback clauses, delivery milestones
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seven built-in trigger rules cover the most common M&amp;amp;A cross-domain dependencies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source → Target&lt;/th&gt;
&lt;th&gt;When It Fires&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Finance → Legal&lt;/td&gt;
&lt;td&gt;Revenue recognition finding needs contract enforceability check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal → Finance&lt;/td&gt;
&lt;td&gt;Change-of-control clause needs financial exposure quantification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal → Finance&lt;/td&gt;
&lt;td&gt;Termination-for-convenience needs revenue-at-risk calculation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal → ProductTech&lt;/td&gt;
&lt;td&gt;IP ownership dispute needs technical dependency assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ProductTech → Legal&lt;/td&gt;
&lt;td&gt;Data privacy finding needs DPA/GDPR compliance review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commercial → Finance&lt;/td&gt;
&lt;td&gt;SLA risk with &amp;gt;10% service credits needs financial quantification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance → Commercial&lt;/td&gt;
&lt;td&gt;Pricing discrepancy needs commercial rate card validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When a rule fires, it creates a &lt;code&gt;CrossDomainTrigger&lt;/code&gt; with the specific contracts to re-examine and instructions for the target agent. The target agent runs a &lt;strong&gt;targeted pass-2 review&lt;/strong&gt; — only on the cited contracts, not the full data room. This keeps costs bounded.&lt;/p&gt;

&lt;p&gt;Budget-capped, priority-ordered. If no triggers fire, zero additional cost.&lt;/p&gt;

&lt;p&gt;The design is inspired by the &lt;a href="https://arxiv.org/abs/2604.00555" rel="noopener noreferrer"&gt;FAOS Platform&lt;/a&gt; — asymmetric coupling where symbolic rules constrain the LLM's scope while the LLM provides judgment. Symbolic decides &lt;em&gt;when&lt;/em&gt; intelligence is needed; the LLM provides &lt;em&gt;what&lt;/em&gt; to do about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: 5 Blocking Quality Gates
&lt;/h3&gt;

&lt;p&gt;Every finding goes through validation before it reaches the report:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Coverage gate&lt;/strong&gt;: Did the agent analyze every assigned document?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema validation&lt;/strong&gt;: Does every finding have the required fields (severity, citations, category)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Citation verification&lt;/strong&gt;: Can we trace the finding back to a specific page and quote?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic dedup&lt;/strong&gt;: Are two agents saying the same thing about the same document? (rapidfuzz token_sort_ratio ≥ 80)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Numerical audit&lt;/strong&gt;: Do financial figures in findings match what's in the source documents?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Fail-closed. If validation fails, the pipeline stops — it doesn't silently produce bad output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chat Mode (My Favorite Feature)
&lt;/h2&gt;

&lt;p&gt;After the pipeline runs, you can interrogate the results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dd-agents chat &lt;span class="nt"&gt;--report&lt;/span&gt; _dd/forensic-dd/runs/latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chat agent has 14 MCP tools: citation verification against source PDFs, cross-contract search, entity resolution, and sandboxed document generation. Ask "build me a board summary of all P0 findings with revenue impact" and it writes a Python script, executes it in a sandbox, and hands you the &lt;code&gt;.xlsx&lt;/code&gt; file.&lt;/p&gt;

&lt;h2&gt;
  
  
  15 Things I Learned Building This
&lt;/h2&gt;

&lt;p&gt;These lessons apply to any system doing cross-document analysis at scale — not just M&amp;amp;A.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Extraction is harder than analysis. By a lot.
&lt;/h3&gt;

&lt;p&gt;Everyone focuses on the LLM prompts. But 80% of the real engineering is getting clean text out of messy documents. Our extraction pipeline has 4 tiers: pymupdf → pdftotext → OCR (Tesseract → GLM-OCR) → Claude vision as last resort. Each tier has 6 quality gates (min chars, printable ratio, density, readability, watermark detection, corruption check). Confidence scales with method quality — pymupdf gets 0.9 base, OCR gets 0.65.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Entity resolution is your invisible foundation
&lt;/h3&gt;

&lt;p&gt;"IBM", "International Business Machines", and "Red Hat" — are these the same entity? We use a 6-stage cascade: exact match → normalized (strip legal suffixes) → alias expansion → fuzzy match (rapidfuzz) → TF-IDF cosine similarity → learned matches from prior runs. Names ≤5 characters are blocked from fuzzy matching — without this, "Inc." matches random entities.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Don't dump everything into one context. Map-merge-resolve.
&lt;/h3&gt;

&lt;p&gt;A 200-page master agreement might have the deal-killer on page 147. You can't skip large files. But dumping them into one context drops accuracy from 95% to 74% (&lt;a href="https://www.addleshawgoddard.com/globalassets/insights/technology/llm/rag-report.pdf" rel="noopener noreferrer"&gt;Addleshaw Goddard, 510 contracts&lt;/a&gt;). Instead: chunk at page boundaries (150K chars, 15% overlap), analyze each chunk independently, merge with priority logic (YES beats NO, specific beats generic), and only invoke LLM arbitration when chunks disagree. The 21-point accuracy gain is entirely engineering — no model change.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Hallucination is an engineering problem, not a model problem
&lt;/h3&gt;

&lt;p&gt;No single defense works. We use 5 layers: (1) Pydantic schema validation on every response, (2) mandatory citation with file_path/page/exact_quote verified against source, (3) explicit "NOT_FOUND" escape valve — without this, models fabricate clauses rather than admit ignorance, (4) adversarial Judge review with accusatory framing ("this finding appears fabricated — prove it with a direct quote"), (5) 6-layer deterministic numerical audit.&lt;/p&gt;

&lt;p&gt;Layer 3 changed everything. When you tell the model "if you can't find this clause, say NOT_FOUND," hallucination drops dramatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Know when to stop using LLMs
&lt;/h3&gt;

&lt;p&gt;We had an LLM agent doing validation and report synthesis. We replaced it with deterministic Python. Quality went up, cost went down. The rule: use LLMs for analysis and synthesis; use Python for validation, dedup, and audit. If you can write the logic as deterministic code, do it.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Self-verification works — but only with accusatory framing
&lt;/h3&gt;

&lt;p&gt;After agents produce findings, a follow-up pass challenges them on high-severity claims. Polite prompts ("please review your finding") have near-zero effect — models confirm their own output. Accusatory prompts ("this finding appears fabricated," "the cited clause doesn't exist") force re-examination and produce a 9.2% accuracy improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Cross-agent dedup is different than you think
&lt;/h3&gt;

&lt;p&gt;When 4 agents analyze the same document, they find the same issue but describe it differently. Three rules: (1) never dedup within the same agent — two similar findings from Legal are intentionally distinct, (2) only dedup across agents on the same document — similar findings on different documents are different findings, (3) keep contributing agent metadata so you know which domains flagged it.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Context window engineering is a first-class discipline
&lt;/h3&gt;

&lt;p&gt;It's not just about fitting data in — it's about &lt;em&gt;where&lt;/em&gt; things go. Critical instructions go at the start (highest recall zone). Document content goes in the middle (lowest recall — ~40% worse). Constraints and format rules go at the end (second-highest recall). We budget 40% of the context window for tool calls and reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Quality gates must be blocking, not advisory
&lt;/h3&gt;

&lt;p&gt;If validation just logs a warning, nobody reads it. If it halts the pipeline, quality is non-negotiable. Same for agent guardrails: hard turn limits (soft at 200, force-kill at 3x), path guards (agents can only write under &lt;code&gt;_dd/&lt;/code&gt;), bash guards (24 blocked patterns — no &lt;code&gt;rm -rf&lt;/code&gt;, no &lt;code&gt;sudo&lt;/code&gt;, no pipe-to-shell). Better to produce nothing than unreliable output.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Every claim must be traceable to source
&lt;/h3&gt;

&lt;p&gt;Citation verification uses 4 scopes: exact page match → adjacent pages ±1 → full document fuzzy match (80%+) → cross-file search. That last one matters — if the quote isn't in the cited file, we search all files for that entity. Auto-corrects file misattribution.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Most of what AI finds is noise
&lt;/h3&gt;

&lt;p&gt;Run 9 agents across hundreds of documents and you'll get thousands of findings. We use a 3-stage classification: noise filter (15 patterns for extraction artifacts), data quality filter (14 patterns for "data unavailable" gaps), then material findings. Plus 5 severity recalibration rules — e.g., a change-of-control clause that only applies to competitors gets downgraded from P0 to P3 automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  12. Same clause, different deal, different severity
&lt;/h3&gt;

&lt;p&gt;An anti-assignment clause is P0 in an asset purchase (blocks contract transfer) but P3 in a stock purchase (entity doesn't change). Deal-type context must flow through the entire pipeline: prompt-time rules, post-hoc deterministic adjustments, and executive judgment overrides — with full audit trail.&lt;/p&gt;

&lt;h3&gt;
  
  
  13. Every API call is a deal cost
&lt;/h3&gt;

&lt;p&gt;Three model profiles: economy (Haiku for extraction), standard (Sonnet for analysis), premium (Opus for synthesis). Per-agent cost tracking. Hard budget limits that halt the pipeline. Right model for right task.&lt;/p&gt;

&lt;h3&gt;
  
  
  14. Pydantic v2 everywhere
&lt;/h3&gt;

&lt;p&gt;137+ models with &lt;code&gt;model_json_schema()&lt;/code&gt; for structured outputs. Strict mypy across 199 source files. The type system catches real bugs — a finding with &lt;code&gt;evidence&lt;/code&gt; instead of &lt;code&gt;citations&lt;/code&gt; gets blocked by the schema guard hook before it's written to disk.&lt;/p&gt;

&lt;h3&gt;
  
  
  15. Make every run smarter than the last
&lt;/h3&gt;

&lt;p&gt;Inspired by Karpathy's "LLM Wiki" pattern: a persistent knowledge base compounds across runs. Finding lineage via SHA-256 fingerprinting tracks findings even when wording changes. A NetworkX knowledge graph with 11 typed edge types captures entity relationships, contradictions, and clause interactions. Run 2 knows what Run 1 found — and catches what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dd-agents
dd-agents auto-config &lt;span class="s2"&gt;"Buyer"&lt;/span&gt; &lt;span class="s2"&gt;"Target"&lt;/span&gt; &lt;span class="nt"&gt;--data-room&lt;/span&gt; ./your_data_room
dd-agents run deal-config.json &lt;span class="nt"&gt;--dry-run&lt;/span&gt;  &lt;span class="c"&gt;# Preview without API calls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://zoharbabin.github.io/due-diligence-agents/" rel="noopener noreferrer"&gt;Sample report&lt;/a&gt; (synthetic data, no install needed)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/zoharbabin/due-diligence-agents" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — Apache 2.0, 3,714 tests, strict mypy.&lt;/p&gt;

&lt;p&gt;Built on &lt;a href="https://github.com/anthropics/claude-agent-sdk-python" rel="noopener noreferrer"&gt;Anthropic's Claude Agent SDK&lt;/a&gt;. Looking for feedback — especially from anyone who's dealt with data room analysis and can tell me whether the report structure maps to how DD findings are actually consumed.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>opensource</category>
      <category>sideprojects</category>
    </item>
  </channel>
</rss>
