Zohar Babin

Posted on May 17

Building a 13-Agent AI System for M&A Due Diligence — Architecture Deep Dive

#ai #claude #opensource #sideprojects

The Problem Nobody Was Solving

As a corp dev lead, I spent weeks doing the same thing after every deal: assembling the cross-domain picture from siloed advisor reports.

Legal would flag a termination clause. Finance would flag revenue concentration. Same entity. Nobody connected the dots.

This happens because due diligence is split into parallel workstreams — legal, financial, commercial, tax, regulatory — each run by separate teams with separate deliverables. The cross-referencing happens in someone's head, over coffee, two days before the IC memo is due.

The numbers back this up:

31% of M&A failures trace back to DD shortcomings (HBR, McKinsey, KPMG research)
DD timelines keep compressing — six weeks becomes three, same scope
Corp dev teams screen 200-1,000+ companies/year but close 1-3%

I built Due Diligence Agents to fix this.

What It Does

13 AI agents analyze every document in an M&A data room across 9 specialist domains — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, and ESG — then cross-reference findings automatically and trace each one to the exact page, section, and quote.

pip install dd-agents
dd-agents auto-config "Buyer" "Target" --data-room ./your_data_room
dd-agents run deal-config.json

Output: an interactive HTML report, a 14-sheet Excel workbook, and per-subject JSON findings. See a sample report from synthetic data.

The Architecture

The system has four layers:

Layer 1: 38-Step Async Pipeline

The orchestrator (engine.py) is a state machine with 38 async steps grouped into phases:

Setup (steps 1-5): Load config, validate data room, resolve entities
Discovery (steps 6-13): Extract documents, build inventory, classify files, compute precedence
Analysis (steps 14-17): Build specialist prompts, route documents, spawn agents in parallel, check coverage
Cross-Domain (steps 18-20): Symbolic trigger evaluation, targeted respawn, merge
Quality (steps 21-26): Judge review, merge findings, validate, deduplicate
Reporting (steps 27-38): Generate HTML, Excel, JSON, knowledge base

Every step supports checkpoint/resume. If the pipeline crashes at step 23, it restarts from step 23 — not from scratch. Steps are typed, and the state object serializes cleanly to JSON.

Layer 2: 13 Agents

9 specialists + 4 meta-agents, each spawned via Anthropic's claude-agent-sdk:

Specialists: Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG. Each gets domain-specific prompts, the relevant documents, and a set of tools (file read, search, finding write).

Meta-agents:

Judge: Reviews specialist findings for quality, consistency, and missed coverage
Executive Synthesis: Produces the deal-level summary with go/no-go signals
Red Flag Scanner: Pattern-matches across all findings for deal-killers
Acquirer Intelligence: Tailors findings to the buyer's strategic context

Specialists run in parallel (batched by resource constraints). Meta-agents run sequentially after all specialists complete.

Layer 3: Neurosymbolic Cross-Domain Analysis

This is the part that solved my original problem.

After specialists produce their findings (pass 1), a deterministic rule engine scans them for cross-domain dependencies. No LLM calls — just Python pattern matching.

# Example: Finance finds revenue recognition issue
# → Rule fires → Legal agent re-examines specific contracts
# for enforceability, clawback clauses, delivery milestones

Seven built-in trigger rules cover the most common M&A cross-domain dependencies:

Source → Target	When It Fires
Finance → Legal	Revenue recognition finding needs contract enforceability check
Legal → Finance	Change-of-control clause needs financial exposure quantification
Legal → Finance	Termination-for-convenience needs revenue-at-risk calculation
Legal → ProductTech	IP ownership dispute needs technical dependency assessment
ProductTech → Legal	Data privacy finding needs DPA/GDPR compliance review
Commercial → Finance	SLA risk with >10% service credits needs financial quantification
Finance → Commercial	Pricing discrepancy needs commercial rate card validation

When a rule fires, it creates a CrossDomainTrigger with the specific contracts to re-examine and instructions for the target agent. The target agent runs a targeted pass-2 review — only on the cited contracts, not the full data room. This keeps costs bounded.

Budget-capped, priority-ordered. If no triggers fire, zero additional cost.

The design is inspired by the FAOS Platform — asymmetric coupling where symbolic rules constrain the LLM's scope while the LLM provides judgment. Symbolic decides when intelligence is needed; the LLM provides what to do about it.

Layer 4: 5 Blocking Quality Gates

Every finding goes through validation before it reaches the report:

Coverage gate: Did the agent analyze every assigned document?
Schema validation: Does every finding have the required fields (severity, citations, category)?
Citation verification: Can we trace the finding back to a specific page and quote?
Semantic dedup: Are two agents saying the same thing about the same document? (rapidfuzz token_sort_ratio ≥ 80)
Numerical audit: Do financial figures in findings match what's in the source documents?

Fail-closed. If validation fails, the pipeline stops — it doesn't silently produce bad output.

The Chat Mode (My Favorite Feature)

After the pipeline runs, you can interrogate the results:

dd-agents chat --report _dd/forensic-dd/runs/latest

The chat agent has 14 MCP tools: citation verification against source PDFs, cross-contract search, entity resolution, and sandboxed document generation. Ask "build me a board summary of all P0 findings with revenue impact" and it writes a Python script, executes it in a sandbox, and hands you the .xlsx file.

15 Things I Learned Building This

These lessons apply to any system doing cross-document analysis at scale — not just M&A.

1. Extraction is harder than analysis. By a lot.

Everyone focuses on the LLM prompts. But 80% of the real engineering is getting clean text out of messy documents. Our extraction pipeline has 4 tiers: pymupdf → pdftotext → OCR (Tesseract → GLM-OCR) → Claude vision as last resort. Each tier has 6 quality gates (min chars, printable ratio, density, readability, watermark detection, corruption check). Confidence scales with method quality — pymupdf gets 0.9 base, OCR gets 0.65.

2. Entity resolution is your invisible foundation

"IBM", "International Business Machines", and "Red Hat" — are these the same entity? We use a 6-stage cascade: exact match → normalized (strip legal suffixes) → alias expansion → fuzzy match (rapidfuzz) → TF-IDF cosine similarity → learned matches from prior runs. Names ≤5 characters are blocked from fuzzy matching — without this, "Inc." matches random entities.

3. Don't dump everything into one context. Map-merge-resolve.

A 200-page master agreement might have the deal-killer on page 147. You can't skip large files. But dumping them into one context drops accuracy from 95% to 74% (Addleshaw Goddard, 510 contracts). Instead: chunk at page boundaries (150K chars, 15% overlap), analyze each chunk independently, merge with priority logic (YES beats NO, specific beats generic), and only invoke LLM arbitration when chunks disagree. The 21-point accuracy gain is entirely engineering — no model change.

4. Hallucination is an engineering problem, not a model problem

No single defense works. We use 5 layers: (1) Pydantic schema validation on every response, (2) mandatory citation with file_path/page/exact_quote verified against source, (3) explicit "NOT_FOUND" escape valve — without this, models fabricate clauses rather than admit ignorance, (4) adversarial Judge review with accusatory framing ("this finding appears fabricated — prove it with a direct quote"), (5) 6-layer deterministic numerical audit.

Layer 3 changed everything. When you tell the model "if you can't find this clause, say NOT_FOUND," hallucination drops dramatically.

5. Know when to stop using LLMs

We had an LLM agent doing validation and report synthesis. We replaced it with deterministic Python. Quality went up, cost went down. The rule: use LLMs for analysis and synthesis; use Python for validation, dedup, and audit. If you can write the logic as deterministic code, do it.

6. Self-verification works — but only with accusatory framing

After agents produce findings, a follow-up pass challenges them on high-severity claims. Polite prompts ("please review your finding") have near-zero effect — models confirm their own output. Accusatory prompts ("this finding appears fabricated," "the cited clause doesn't exist") force re-examination and produce a 9.2% accuracy improvement.

7. Cross-agent dedup is different than you think

When 4 agents analyze the same document, they find the same issue but describe it differently. Three rules: (1) never dedup within the same agent — two similar findings from Legal are intentionally distinct, (2) only dedup across agents on the same document — similar findings on different documents are different findings, (3) keep contributing agent metadata so you know which domains flagged it.

8. Context window engineering is a first-class discipline

It's not just about fitting data in — it's about where things go. Critical instructions go at the start (highest recall zone). Document content goes in the middle (lowest recall — ~40% worse). Constraints and format rules go at the end (second-highest recall). We budget 40% of the context window for tool calls and reasoning.

9. Quality gates must be blocking, not advisory

If validation just logs a warning, nobody reads it. If it halts the pipeline, quality is non-negotiable. Same for agent guardrails: hard turn limits (soft at 200, force-kill at 3x), path guards (agents can only write under _dd/), bash guards (24 blocked patterns — no rm -rf, no sudo, no pipe-to-shell). Better to produce nothing than unreliable output.

10. Every claim must be traceable to source

Citation verification uses 4 scopes: exact page match → adjacent pages ±1 → full document fuzzy match (80%+) → cross-file search. That last one matters — if the quote isn't in the cited file, we search all files for that entity. Auto-corrects file misattribution.

11. Most of what AI finds is noise

Run 9 agents across hundreds of documents and you'll get thousands of findings. We use a 3-stage classification: noise filter (15 patterns for extraction artifacts), data quality filter (14 patterns for "data unavailable" gaps), then material findings. Plus 5 severity recalibration rules — e.g., a change-of-control clause that only applies to competitors gets downgraded from P0 to P3 automatically.

12. Same clause, different deal, different severity

An anti-assignment clause is P0 in an asset purchase (blocks contract transfer) but P3 in a stock purchase (entity doesn't change). Deal-type context must flow through the entire pipeline: prompt-time rules, post-hoc deterministic adjustments, and executive judgment overrides — with full audit trail.

13. Every API call is a deal cost

Three model profiles: economy (Haiku for extraction), standard (Sonnet for analysis), premium (Opus for synthesis). Per-agent cost tracking. Hard budget limits that halt the pipeline. Right model for right task.

14. Pydantic v2 everywhere

137+ models with model_json_schema() for structured outputs. Strict mypy across 199 source files. The type system catches real bugs — a finding with evidence instead of citations gets blocked by the schema guard hook before it's written to disk.

15. Make every run smarter than the last

Inspired by Karpathy's "LLM Wiki" pattern: a persistent knowledge base compounds across runs. Finding lineage via SHA-256 fingerprinting tracks findings even when wording changes. A NetworkX knowledge graph with 11 typed edge types captures entity relationships, contradictions, and clause interactions. Run 2 knows what Run 1 found — and catches what changed.

Try It

pip install dd-agents
dd-agents auto-config "Buyer" "Target" --data-room ./your_data_room
dd-agents run deal-config.json --dry-run  # Preview without API calls

Sample report (synthetic data, no install needed)

GitHub — Apache 2.0, 3,714 tests, strict mypy.

Built on Anthropic's Claude Agent SDK. Looking for feedback — especially from anyone who's dealt with data room analysis and can tell me whether the report structure maps to how DD findings are actually consumed.

DEV Community