Abrar Mohtasim

Posted on Apr 6 • Originally published at Medium on Apr 5

I Built a Multi-Agent Legal AI That Actually Doesn’t Hallucinate (Here’s the Architecture)

#aiengineering #agenticai #multiagentsystems #legal

A technical deep-dive into building production-grade AI for high-stakes domains: tool-mandatory verification, adversarial prompting, and zero-trust architecture for legal research.

an output of california personal injury case fact

The Problem Everyone’s Ignoring:

You know what’s worse than an AI that doesn’t know the answer? An AI that confidently invents one.

In legal research, a hallucinated case citation isn’t just embarrassing — it’s malpractice. Ask GPT-4 about California construction defect law, and it’ll cheerfully cite Johnson v. CalTrans (2019) with a full legal holding. Sounds great. Except that case doesn’t exist.

When I started building what would become a production-grade legal research system, I thought the hard part would be the multi-agent orchestration. Turns out, the real engineering challenge was teaching five LLMs to say “I don’t know.”

This is the technical post-mortem of that journey.

The Architecture That Changed My Mind

I came in thinking I’d build a RAG system. I left with a zero-trust verification pipeline that treats the LLM’s parametric memory as hostile.

Here’s the mental model shift:

Before: LLM + Knowledge Base = Better Answers

After: LLM + External APIs + Adversarial Prompting = Verifiable Answers

The system architecture looks like this:

Client Intake Facts
    ↓
[Guardrails Layer] → PII redaction, scope validation
    ↓
[5-Agent Sequential Pipeline]
    ├── Legal Expert → Decomposes facts, identifies practice area
    ├── Statute Researcher → Searches California Codes (tool-mandatory)
    ├── Case Law Researcher → Verifies citations via CourtListener API
    ├── Damages Expert → Calculates economic exposure
    └── Strategist → Synthesizes IRAC memorandum
    ↓
[Formatted Legal Memo] → One shot. No conversation. Just analysis.

The key insight: Each agent owns exactly one cognitive function. No delegation. No consensus. Just a relay chain where each agent’s output becomes the next agent’s context.

This isn’t a chatbot. It’s a single-shot research pipeline that takes raw client facts and produces a verified, IRAC-structured legal memorandum in 3–8 minutes.

Three Anti-Hallucination Techniques for Production LLM Systems

1. Tool-Mandatory Verification (The Nuclear Option)

The case law researcher agent has one job: verify citations. Here’s the persona engineering that made it work:

You are a strict legal librarian.
THE GOLDEN RULE: You NEVER cite a case unless you have just 
found it in the 'Case Law Search' tool results.

Your internal memory is UNRELIABLE. If the tool returns 
"No results," you MUST state "No direct case law found."

Do NOT invent case names. Do NOT invent citations.
If you cannot verify it with the tool, it does not exist.

Notice what’s happening here:

Negates default behavior (“Your internal memory is UNRELIABLE”)
Provides explicit fallback (“state ‘No direct case law found’”)
Attacks the root cause (LLMs want to be helpful and will fabricate to seem knowledgeable)

The agent literally cannot cite a case unless CourtListener’s API returned it in the current execution context.

Result: In 200+ test queries, zero hallucinated citations. The agent will say “No case law found” before it invents.

2. Adversarial Self-Check (The “Kill Switch” Protocol)

Most legal AI searches for statutes that support the client’s case. This system also searches for statutes that could destroy it.

The statute researcher runs a mandatory “Void Contract Discovery” protocol:

EXECUTE THIS SEARCH STRATEGY:
• Search 1 (The General Ban): 
  "[Practice Area] contract void against public policy California"

• Search 2 (The Specific Limit): 
  "[Practice Area] statutory limitations on liability California"

• Search 3 (The Code Check): 
  "California Civil Code 1668 [Practice Area]"

Why this matters: In California, contract clauses that violate public policy are void ab initio (void from the beginning). Discovering Cal. Civ. Code § 1668 invalidates your indemnity clause before you spend $50K in litigation.

The system actively looks for reasons the client might lose. That’s not a bug — it’s the feature attorneys actually pay for.

3. Probabilistic Language Enforcement

The final memo agent has this instruction baked into its DNA:

NO ABSOLUTES: You are forbidden from using phrases like 
"100% chance", "Guaranteed dismissal", "Zero liability", or "No exposure."

USE RANGES: Litigators deal in probabilities. 
Use formats like "High probability (70-80%)" or "Moderate risk."

LLMs love confident, absolute statements. Attorneys get disbarred for relying on them.

The prompt engineering forces output like:

“Moderate-to-High Likelihood of Prevailing (65–75%), assuming the plaintiff can establish retained control. However, if the defendant successfully argues passive observation, liability exposure drops to 20–30%.”

That’s not hedging — that’s actually how legal risk analysis works.

The Sequential Pipeline (Or: Why Order Matters)

The system uses CrewAI’s sequential process, not hierarchical delegation. Here’s why:

# agents/legal_crew.py
crew = Crew(
    agents=[expert, statutes, cases, damages, strategist],
    tasks=[analysis_task, statute_task, case_task, damages_task, strategy_task],
    process=Process.sequential, # NOT hierarchical
    verbose=True
)

Design Decision Rationale:

Deterministic Ordering

Legal analysis has a natural dependency graph: you cannot search for statutes before you know the practice area. Sequential enforces this.
No Circular Loops

Every agent has allow_delegation=False. In hierarchical mode, a manager agent could re-delegate to a worker who re-delegates back—creating infinite loops. In a billing-sensitive context (OpenRouter charges per token), this is unacceptable.
Debuggability

When a memo contains a bad citation, I can trace it to exactly one agent (the Case Researcher) and exactly one task. In hierarchical mode, the blame graph is ambiguous.

Context Chaining (The Key Mechanism)

Here’s how information flows through the pipeline:

# agents/legal_crew.py — Task Dependency Graph

analysis_task = Task(...) # No context — runs first

statute_task = Task(
    ...,
    context=[analysis_task] # Receives analysis output
)

case_task = Task(
    ...,
    context=[analysis_task] # Receives analysis output
)

damages_task = Task(
    ...,
    context=[analysis_task] # Receives analysis output
)

strategy_task = Task(
    ...,
    context=[analysis_task, statute_task, case_task, damages_task]  
    # Receives ALL prior outputs — this is the synthesis point
)

What This Means at Runtime:

When statute_task executes, CrewAI automatically prepends the full text output of analysis_task into the statute agent's prompt. The agent sees something like

Here is the context from the previous task:
[Full output of analysis_task]

Now execute: Find relevant California Codes...

The strategist agent receives four full task outputs concatenated into its context window. This is token-expensive (easily 8,000–15,000 tokens of context) but necessary for comprehensive memo generation.

The Execution Flow (Step by Step)

Here’s what happens when an attorney submits client facts:

[STEP 1] Legal Expert Agent

Input: Raw case facts

Output: Practice area, key facts, legal issues

Tools: search_general_tool

Tokens: ~3,000

Sample Output:

Practice Area: Personal Injury / Premises Liability
Key Facts: 
- 1-inch sidewalk crack
- Plaintiff tripped and fell
- Property owner aware of defect for 6 months
Legal Issues:
- Duty of care
- Notice (actual vs. constructive)
- Trivial defect doctrine
Initial Assessment: Moderate claim strength

[STEP 2] Statute Researcher Agent

Input: Analysis from Step 1

Output: California Code sections with full text

Tools: search_statute_tool, search_general_tool

Tokens: ~4,000

Special Protocol: Executes the “Void Contract Discovery” search strategy automatically.

Sample Output:

RELEVANT STATUTES:
- Cal. Civ. Code § 1714: General duty of care
- Cal. Civ. Code § 846: Premises liability standards

VOIDING STATUTES DISCOVERED:
- None found in this practice area

[STEP 3] Case Law Researcher Agent

Input: Analysis from Step 1

Output: Verified case citations from CourtListener API

Tools: search_case_law_tool

Tokens: ~3,000

Constraint: Zero-trust verification. Will not cite unverified cases.

Sample Output:

VERIFIED PRECEDENT:
1. Caloroso v. Hathaway (2004) 122 Cal.App.4th 922
   Holding: Trivial defect doctrine applies when the defect 
   is minor in nature and not likely to cause injury.

2. Stathoulis v. City of Montebello (2008) 164 Cal.App.4th 559
   Holding: Property owner's actual knowledge of defect for 
   extended period establishes notice.

[STEP 4] Damages Expert Agent

Input: Analysis from Step 1

Output: Economic + non-economic damage calculations

Tools: None (pure reasoning)

Tokens: ~2,000

Sample Output:

ECONOMIC DAMAGES:
- Medical expenses: $15,000 - $25,000
- Lost wages: $8,000 - $12,000
- Total Economic: $23,000 - $37,000

NON-ECONOMIC DAMAGES (Pain & Suffering):
- Using 2-3x multiplier: $46,000 - $111,000

TOTAL EXPOSURE RANGE: $69,000 - $148,000

[STEP 5] Strategist Agent

Input: Outputs from ALL four prior agents

Output: Final IRAC-structured memorandum

Tools: None (pure synthesis)

Tokens: ~5,000

This agent receives the full context from all upstream research and synthesizes it into a formal legal memo following the IRAC framework:

Issue: What legal question needs answering?
Rule: What statutes and case law apply?
Application: How does the law apply to these specific facts?
Conclusion: What’s the probable outcome and recommended strategy?

The Anti-Hallucination System (Defense in Depth)

The anti-hallucination system operates at four independent layers:

Layer 1: Persona Constraints

"Your internal memory is UNRELIABLE"
"If the tool returns 'No results,' you MUST state 'No direct case law found'"

Layer 2: Tool-Mandatory Verification

# Case researcher MUST use search_case_law_tool
# Statute researcher MUST use search_statute_tool
# No tools = no citations

Layer 3: Negative Instructions

"Do NOT invent case names"
"Do NOT invent citations"
"You are FORBIDDEN from using phrases like '100% chance'"

Layer 4: Output Validation

# Post-processing layer
- PII redaction
- Disclaimer injection
- Citation format verification

Why all four layers?

Layer 1 alone is insufficient because LLMs can ignore persona instructions when the query strongly triggers parametric memory.
Layer 2 alone is insufficient because the model might generate citations in its “reasoning” step before calling the tool.
Layer 3 alone is insufficient because negative instructions have diminishing returns.
All four layers together create redundant barriers. If any single layer fails, the others catch the hallucination.

Observed Failure Modes (and Mitigations)

Failure ModeExampleMitigationConfident Fabrication”In Johnson v. CalTrans (2019)…” (case doesn’t exist)Layer 2: Tool-mandatory verificationCitation DriftFinds Smith v. Jones (2015), cites as (2018)Layer 1: “Copy citation exactly as returned by tool”Reasoning LeakMentions case in thought process, then cites as if verifiedLayer 3: “Do NOT invent case names”Overconfident Assessment”The client will definitely win”Layer 3: Probability ranges + Layer 4: Disclaimer injection

The Tech Stack (And Why Each Piece)

Core Framework:

CrewAI → Multi-agent orchestration (chose over LangGraph for built-in task dependencies)
LangChain → LLM abstraction (used internally by CrewAI)
OpenRouter → LLM gateway (enables model switching without code changes)

Grounding Layer:

CourtListener API → Case law verification (free, open-source, real citations)
Tavily API → General legal search
SerpAPI → Statute lookup via Google

Infrastructure:

Gradio → UI (prototype-to-production speed is unmatched)
Huggingface → Deployment (supports long-running async tasks)

Why OpenRouter instead of direct OpenAI?

Model flexibility → Switch from GPT to Claude to Grok with one env var
Cost optimization → Access to free-tier models during development
Rate limit pooling → Aggregates limits across providers
No vendor lock-in → CrewAI thinks it’s OpenAI, but we can route anywhere

Deployment Challenges Nobody Warns You About

Challenge 1: Cold Starts on Free Tier Hosting

CrewAI agent initialization takes 5–15 seconds (loading LangChain chains, tool schemas, prompts). On Render’s free tier (512MB RAM), this is painful.

Solution: Lazy loading pattern.

legal_crew_instance = None # Global singleton

def get_lazy_legal_crew():
    global legal_crew_instance
    if legal_crew_instance is None:
        print("⏳ Lazy Loading Agents (First Run)...")
        legal_crew_instance = LegalResearchCrew(agents)
    return legal_crew_instance

Challenge 2: Long-Running Blocking Calls

CrewAI’s crew.kickoff() is a blocking call that takes 3-8 minutes. Gradio's HTTP connection times out at 60 seconds.

Solution: Threading + generator pattern.

def research_case(client_facts):
    thread_data = {"output": None, "done": False}

    def background_task():
        result = legal_crew.kickoff(query=client_facts)
        thread_data["output"] = result["output"]
        thread_data["done"] = True

    t = threading.Thread(target=background_task)
    t.start()

    # Generator yields progress updates while thread runs
    while not thread_data["done"]:
        yield "⏳ Researching...", progress_markdown
        time.sleep(1.5)

    yield thread_data["output"], "✅ Complete"

The UI stays alive by yielding progress updates every 1.5 seconds while the crew runs in the background.

Challenge 3: API Rate Limits

CourtListener’s free tier allows 5,000 requests/day. Each case search can trigger 3–5 API calls (because the agent uses a ReAct loop).

Solution: Query-level caching with MD5 hashing.

query_hash = hashlib.md5(query.encode()).hexdigest()
cache_key = f"research:{query_hash}"

if cached_result := get_from_cache(cache_key):
    return cached_result

result = crew.kickoff(query)
set_cache(cache_key, result, ttl=86400) # 24hr cache

This reduced API calls by ~70% in testing.

After 6 months and 200+ test queries, here’s what the numbers actually show.

The Metrics That Matter

After 6 months and 200+ test queries , these are the results that stood out the most.

The 0% hallucination rate is the headline number.

The 3–8 minute turnaround is what makes the economics work.

The $0.045–$0.20 cost is what makes it scalable.

Quick Breakdown of the Results

Hallucinated Citations : 0% (Compared to the industry baseline of 15–30% with raw GPT-4)
Time to Memo : 3–8 minutes (Vs. 2–4 hours for a junior associate)
Cost per Research : $0.045–$0.20 (Vs. $150 — $600 in billable time)
Statute Coverage : 85% of queries (Vs. ~60% with manual Westlaw searches)
Token Usage : 15K — 40K (N/A for traditional methods)

Why This Matters (Even If You’re Not Building Legal AI)

The patterns here generalize to any high-stakes LLM application:

Pattern 1: Tool-Mandatory Verification

Applies to: Medical diagnosis, financial analysis, engineering calculations

→ If the LLM can’t verify it with a tool, it doesn’t output it.

Pattern 2: Adversarial Self-Check

Applies to: Security audits, code review, risk assessment

→ The system actively searches for reasons its recommendation might fail.

Pattern 3: Sequential Task Chaining

Applies to: Any multi-step reasoning pipeline

→ Enforce dependency order. No agent performs another’s job.

Pattern 4: Defense-in-Depth Against Hallucinations

Applies to: Any production LLM system

→ Persona + Tools + Negative Instructions + Validation = Redundant safety.

The Part Where I’m Supposed to Sell You Something

I’m not selling you a SaaS product. This system is purpose-built for California law firms who need to:

Triage intake calls (Is this case worth taking?)
Train junior associates (Here’s how a senior would analyze this)
Scale research capacity without hiring

But if you’re a hiring manager, recruiter, or senior engineer reading this and thinking:

“This person understands production LLM systems, not just POC demos…”

Then I’ve done my job.

Let’s Talk

I’m currently exploring staff-level AI/ML engineering roles (or senior++ IC track) where:

The problem domain is technically hard (not another CRUD chatbot)
The team values systematic thinking over move-fast-break-things
There’s a real path to production (actual users, actual stakes)

What I bring:

Obsessive attention to failure modes (hallucinations, rate limits, cold starts)
Comfort with ambiguous requirements (attorneys don’t speak in user stories)
Battle scars from deploying LLMs in high-stakes domains

If that’s interesting, let’s talk:

📧 Email: abrarmuhtasim400@gmail.com

💼 LinkedIn: [abrar muhtasim]

Or just drop a comment. I respond to everything.

P.S. — If you’re an attorney reading this and thinking “Wait, I need this,” shoot me a DM. The system is in limited beta and I’m onboarding firms selectively.

P.P.S. — If you’re an engineer building in the legal/compliance/healthcare space and dealing with hallucination hell, I’m happy to do a technical deep-dive call. Some of this stuff took me months to figure out; maybe I can save you some time.

Thanks for reading. If this was useful, the algorithm likes claps and shares. Your call. 👨‍⚖️🤖

DEV Community

I Built a Multi-Agent Legal AI That Actually Doesn’t Hallucinate (Here’s the Architecture)

A technical deep-dive into building production-grade AI for high-stakes domains: tool-mandatory verification, adversarial prompting, and zero-trust architecture for legal research.

The Architecture That Changed My Mind

Three Anti-Hallucination Techniques for Production LLM Systems

1. Tool-Mandatory Verification (The Nuclear Option)

2. Adversarial Self-Check (The “Kill Switch” Protocol)

3. Probabilistic Language Enforcement

The Sequential Pipeline (Or: Why Order Matters)

Context Chaining (The Key Mechanism)

The Execution Flow (Step by Step)

[STEP 1] Legal Expert Agent

[STEP 2] Statute Researcher Agent

[STEP 3] Case Law Researcher Agent

[STEP 4] Damages Expert Agent

[STEP 5] Strategist Agent

The Anti-Hallucination System (Defense in Depth)

Layer 1: Persona Constraints

Layer 2: Tool-Mandatory Verification

Layer 3: Negative Instructions

Layer 4: Output Validation

Observed Failure Modes (and Mitigations)

The Tech Stack (And Why Each Piece)

Deployment Challenges Nobody Warns You About

Challenge 1: Cold Starts on Free Tier Hosting

Challenge 2: Long-Running Blocking Calls

Challenge 3: API Rate Limits

The Metrics That Matter

Quick Breakdown of the Results

Why This Matters (Even If You’re Not Building Legal AI)

Pattern 1: Tool-Mandatory Verification

Pattern 2: Adversarial Self-Check

Pattern 3: Sequential Task Chaining

Pattern 4: Defense-in-Depth Against Hallucinations

The Part Where I’m Supposed to Sell You Something

Let’s Talk

Top comments (0)