A technical deep-dive into building production-grade AI for high-stakes domains: tool-mandatory verification, adversarial prompting, and zero-trust architecture for legal research.

an output of california personal injury case fact
The Problem Everyone’s Ignoring:
You know what’s worse than an AI that doesn’t know the answer? An AI that confidently invents one.
In legal research, a hallucinated case citation isn’t just embarrassing — it’s malpractice. Ask GPT-4 about California construction defect law, and it’ll cheerfully cite Johnson v. CalTrans (2019) with a full legal holding. Sounds great. Except that case doesn’t exist.
When I started building what would become a production-grade legal research system, I thought the hard part would be the multi-agent orchestration. Turns out, the real engineering challenge was teaching five LLMs to say “I don’t know.”
This is the technical post-mortem of that journey.
The Architecture That Changed My Mind
I came in thinking I’d build a RAG system. I left with a zero-trust verification pipeline that treats the LLM’s parametric memory as hostile.
Here’s the mental model shift:
Before: LLM + Knowledge Base = Better Answers
After: LLM + External APIs + Adversarial Prompting = Verifiable Answers
The system architecture looks like this:
Client Intake Facts
↓
[Guardrails Layer] → PII redaction, scope validation
↓
[5-Agent Sequential Pipeline]
├── Legal Expert → Decomposes facts, identifies practice area
├── Statute Researcher → Searches California Codes (tool-mandatory)
├── Case Law Researcher → Verifies citations via CourtListener API
├── Damages Expert → Calculates economic exposure
└── Strategist → Synthesizes IRAC memorandum
↓
[Formatted Legal Memo] → One shot. No conversation. Just analysis.
The key insight: Each agent owns exactly one cognitive function. No delegation. No consensus. Just a relay chain where each agent’s output becomes the next agent’s context.
This isn’t a chatbot. It’s a single-shot research pipeline that takes raw client facts and produces a verified, IRAC-structured legal memorandum in 3–8 minutes.
Three Anti-Hallucination Techniques for Production LLM Systems
1. Tool-Mandatory Verification (The Nuclear Option)
The case law researcher agent has one job: verify citations. Here’s the persona engineering that made it work:
You are a strict legal librarian.
THE GOLDEN RULE: You NEVER cite a case unless you have just
found it in the 'Case Law Search' tool results.
Your internal memory is UNRELIABLE. If the tool returns
"No results," you MUST state "No direct case law found."
Do NOT invent case names. Do NOT invent citations.
If you cannot verify it with the tool, it does not exist.
Notice what’s happening here:
- Negates default behavior (“Your internal memory is UNRELIABLE”)
- Provides explicit fallback (“state ‘No direct case law found’”)
- Attacks the root cause (LLMs want to be helpful and will fabricate to seem knowledgeable)
The agent literally cannot cite a case unless CourtListener’s API returned it in the current execution context.
Result: In 200+ test queries, zero hallucinated citations. The agent will say “No case law found” before it invents.
2. Adversarial Self-Check (The “Kill Switch” Protocol)
Most legal AI searches for statutes that support the client’s case. This system also searches for statutes that could destroy it.
The statute researcher runs a mandatory “Void Contract Discovery” protocol:
EXECUTE THIS SEARCH STRATEGY:
• Search 1 (The General Ban):
"[Practice Area] contract void against public policy California"
• Search 2 (The Specific Limit):
"[Practice Area] statutory limitations on liability California"
• Search 3 (The Code Check):
"California Civil Code 1668 [Practice Area]"
Why this matters: In California, contract clauses that violate public policy are void ab initio (void from the beginning). Discovering Cal. Civ. Code § 1668 invalidates your indemnity clause before you spend $50K in litigation.
The system actively looks for reasons the client might lose. That’s not a bug — it’s the feature attorneys actually pay for.
3. Probabilistic Language Enforcement
The final memo agent has this instruction baked into its DNA:
NO ABSOLUTES: You are forbidden from using phrases like
"100% chance", "Guaranteed dismissal", "Zero liability", or "No exposure."
USE RANGES: Litigators deal in probabilities.
Use formats like "High probability (70-80%)" or "Moderate risk."
LLMs love confident, absolute statements. Attorneys get disbarred for relying on them.
The prompt engineering forces output like:
“Moderate-to-High Likelihood of Prevailing (65–75%), assuming the plaintiff can establish retained control. However, if the defendant successfully argues passive observation, liability exposure drops to 20–30%.”
That’s not hedging — that’s actually how legal risk analysis works.
The Sequential Pipeline (Or: Why Order Matters)
The system uses CrewAI’s sequential process, not hierarchical delegation. Here’s why:
# agents/legal_crew.py
crew = Crew(
agents=[expert, statutes, cases, damages, strategist],
tasks=[analysis_task, statute_task, case_task, damages_task, strategy_task],
process=Process.sequential, # NOT hierarchical
verbose=True
)
Design Decision Rationale:
Deterministic Ordering
Legal analysis has a natural dependency graph: you cannot search for statutes before you know the practice area. Sequential enforces this.No Circular Loops
Every agent has allow_delegation=False. In hierarchical mode, a manager agent could re-delegate to a worker who re-delegates back—creating infinite loops. In a billing-sensitive context (OpenRouter charges per token), this is unacceptable.Debuggability
When a memo contains a bad citation, I can trace it to exactly one agent (the Case Researcher) and exactly one task. In hierarchical mode, the blame graph is ambiguous.
Context Chaining (The Key Mechanism)
Here’s how information flows through the pipeline:
# agents/legal_crew.py — Task Dependency Graph
analysis_task = Task(...) # No context — runs first
statute_task = Task(
...,
context=[analysis_task] # Receives analysis output
)
case_task = Task(
...,
context=[analysis_task] # Receives analysis output
)
damages_task = Task(
...,
context=[analysis_task] # Receives analysis output
)
strategy_task = Task(
...,
context=[analysis_task, statute_task, case_task, damages_task]
# Receives ALL prior outputs — this is the synthesis point
)
What This Means at Runtime:
When statute_task executes, CrewAI automatically prepends the full text output of analysis_task into the statute agent's prompt. The agent sees something like
Here is the context from the previous task:
[Full output of analysis_task]
Now execute: Find relevant California Codes...
The strategist agent receives four full task outputs concatenated into its context window. This is token-expensive (easily 8,000–15,000 tokens of context) but necessary for comprehensive memo generation.
The Execution Flow (Step by Step)
Here’s what happens when an attorney submits client facts:
[STEP 1] Legal Expert Agent
Input: Raw case facts
Output: Practice area, key facts, legal issues
Tools: search_general_tool
Tokens: ~3,000
Sample Output:
Practice Area: Personal Injury / Premises Liability
Key Facts:
- 1-inch sidewalk crack
- Plaintiff tripped and fell
- Property owner aware of defect for 6 months
Legal Issues:
- Duty of care
- Notice (actual vs. constructive)
- Trivial defect doctrine
Initial Assessment: Moderate claim strength
[STEP 2] Statute Researcher Agent
Input: Analysis from Step 1
Output: California Code sections with full text
Tools: search_statute_tool, search_general_tool
Tokens: ~4,000
Special Protocol: Executes the “Void Contract Discovery” search strategy automatically.
Sample Output:
RELEVANT STATUTES:
- Cal. Civ. Code § 1714: General duty of care
- Cal. Civ. Code § 846: Premises liability standards
VOIDING STATUTES DISCOVERED:
- None found in this practice area
[STEP 3] Case Law Researcher Agent
Input: Analysis from Step 1
Output: Verified case citations from CourtListener API
Tools: search_case_law_tool
Tokens: ~3,000
Constraint: Zero-trust verification. Will not cite unverified cases.
Sample Output:
VERIFIED PRECEDENT:
1. Caloroso v. Hathaway (2004) 122 Cal.App.4th 922
Holding: Trivial defect doctrine applies when the defect
is minor in nature and not likely to cause injury.
2. Stathoulis v. City of Montebello (2008) 164 Cal.App.4th 559
Holding: Property owner's actual knowledge of defect for
extended period establishes notice.
[STEP 4] Damages Expert Agent
Input: Analysis from Step 1
Output: Economic + non-economic damage calculations
Tools: None (pure reasoning)
Tokens: ~2,000
Sample Output:
ECONOMIC DAMAGES:
- Medical expenses: $15,000 - $25,000
- Lost wages: $8,000 - $12,000
- Total Economic: $23,000 - $37,000
NON-ECONOMIC DAMAGES (Pain & Suffering):
- Using 2-3x multiplier: $46,000 - $111,000
TOTAL EXPOSURE RANGE: $69,000 - $148,000
[STEP 5] Strategist Agent
Input: Outputs from ALL four prior agents
Output: Final IRAC-structured memorandum
Tools: None (pure synthesis)
Tokens: ~5,000
This agent receives the full context from all upstream research and synthesizes it into a formal legal memo following the IRAC framework:
- Issue: What legal question needs answering?
- Rule: What statutes and case law apply?
- Application: How does the law apply to these specific facts?
- Conclusion: What’s the probable outcome and recommended strategy?
The Anti-Hallucination System (Defense in Depth)
The anti-hallucination system operates at four independent layers:
Layer 1: Persona Constraints
"Your internal memory is UNRELIABLE"
"If the tool returns 'No results,' you MUST state 'No direct case law found'"
Layer 2: Tool-Mandatory Verification
# Case researcher MUST use search_case_law_tool
# Statute researcher MUST use search_statute_tool
# No tools = no citations
Layer 3: Negative Instructions
"Do NOT invent case names"
"Do NOT invent citations"
"You are FORBIDDEN from using phrases like '100% chance'"
Layer 4: Output Validation
# Post-processing layer
- PII redaction
- Disclaimer injection
- Citation format verification
Why all four layers?
- Layer 1 alone is insufficient because LLMs can ignore persona instructions when the query strongly triggers parametric memory.
- Layer 2 alone is insufficient because the model might generate citations in its “reasoning” step before calling the tool.
- Layer 3 alone is insufficient because negative instructions have diminishing returns.
- All four layers together create redundant barriers. If any single layer fails, the others catch the hallucination.
Observed Failure Modes (and Mitigations)
Failure ModeExampleMitigationConfident Fabrication”In Johnson v. CalTrans (2019)…” (case doesn’t exist)Layer 2: Tool-mandatory verificationCitation DriftFinds Smith v. Jones (2015), cites as (2018)Layer 1: “Copy citation exactly as returned by tool”Reasoning LeakMentions case in thought process, then cites as if verifiedLayer 3: “Do NOT invent case names”Overconfident Assessment”The client will definitely win”Layer 3: Probability ranges + Layer 4: Disclaimer injection
The Tech Stack (And Why Each Piece)
Core Framework:
- CrewAI → Multi-agent orchestration (chose over LangGraph for built-in task dependencies)
- LangChain → LLM abstraction (used internally by CrewAI)
- OpenRouter → LLM gateway (enables model switching without code changes)
Grounding Layer:
- CourtListener API → Case law verification (free, open-source, real citations)
- Tavily API → General legal search
- SerpAPI → Statute lookup via Google
Infrastructure:
- Gradio → UI (prototype-to-production speed is unmatched)
- Huggingface → Deployment (supports long-running async tasks)
Why OpenRouter instead of direct OpenAI?
- Model flexibility → Switch from GPT to Claude to Grok with one env var
- Cost optimization → Access to free-tier models during development
- Rate limit pooling → Aggregates limits across providers
- No vendor lock-in → CrewAI thinks it’s OpenAI, but we can route anywhere
Deployment Challenges Nobody Warns You About
Challenge 1: Cold Starts on Free Tier Hosting
CrewAI agent initialization takes 5–15 seconds (loading LangChain chains, tool schemas, prompts). On Render’s free tier (512MB RAM), this is painful.
Solution: Lazy loading pattern.
legal_crew_instance = None # Global singleton
def get_lazy_legal_crew():
global legal_crew_instance
if legal_crew_instance is None:
print("⏳ Lazy Loading Agents (First Run)...")
legal_crew_instance = LegalResearchCrew(agents)
return legal_crew_instance
Challenge 2: Long-Running Blocking Calls
CrewAI’s crew.kickoff() is a blocking call that takes 3-8 minutes. Gradio's HTTP connection times out at 60 seconds.
Solution: Threading + generator pattern.
def research_case(client_facts):
thread_data = {"output": None, "done": False}
def background_task():
result = legal_crew.kickoff(query=client_facts)
thread_data["output"] = result["output"]
thread_data["done"] = True
t = threading.Thread(target=background_task)
t.start()
# Generator yields progress updates while thread runs
while not thread_data["done"]:
yield "⏳ Researching...", progress_markdown
time.sleep(1.5)
yield thread_data["output"], "✅ Complete"
The UI stays alive by yielding progress updates every 1.5 seconds while the crew runs in the background.
Challenge 3: API Rate Limits
CourtListener’s free tier allows 5,000 requests/day. Each case search can trigger 3–5 API calls (because the agent uses a ReAct loop).
Solution: Query-level caching with MD5 hashing.
query_hash = hashlib.md5(query.encode()).hexdigest()
cache_key = f"research:{query_hash}"
if cached_result := get_from_cache(cache_key):
return cached_result
result = crew.kickoff(query)
set_cache(cache_key, result, ttl=86400) # 24hr cache
This reduced API calls by ~70% in testing.
After 6 months and 200+ test queries, here’s what the numbers actually show.
The Metrics That Matter
After 6 months and 200+ test queries , these are the results that stood out the most.
The 0% hallucination rate is the headline number.
The 3–8 minute turnaround is what makes the economics work.
The $0.045–$0.20 cost is what makes it scalable.
Quick Breakdown of the Results
- Hallucinated Citations : 0% (Compared to the industry baseline of 15–30% with raw GPT-4)
- Time to Memo : 3–8 minutes (Vs. 2–4 hours for a junior associate)
- Cost per Research : $0.045–$0.20 (Vs. $150 — $600 in billable time)
- Statute Coverage : 85% of queries (Vs. ~60% with manual Westlaw searches)
- Token Usage : 15K — 40K (N/A for traditional methods)
Why This Matters (Even If You’re Not Building Legal AI)
The patterns here generalize to any high-stakes LLM application:
Pattern 1: Tool-Mandatory Verification
Applies to: Medical diagnosis, financial analysis, engineering calculations
→ If the LLM can’t verify it with a tool, it doesn’t output it.
Pattern 2: Adversarial Self-Check
Applies to: Security audits, code review, risk assessment
→ The system actively searches for reasons its recommendation might fail.
Pattern 3: Sequential Task Chaining
Applies to: Any multi-step reasoning pipeline
→ Enforce dependency order. No agent performs another’s job.
Pattern 4: Defense-in-Depth Against Hallucinations
Applies to: Any production LLM system
→ Persona + Tools + Negative Instructions + Validation = Redundant safety.
The Part Where I’m Supposed to Sell You Something
I’m not selling you a SaaS product. This system is purpose-built for California law firms who need to:
- Triage intake calls (Is this case worth taking?)
- Train junior associates (Here’s how a senior would analyze this)
- Scale research capacity without hiring
But if you’re a hiring manager, recruiter, or senior engineer reading this and thinking:
“This person understands production LLM systems, not just POC demos…”
Then I’ve done my job.
Let’s Talk
I’m currently exploring staff-level AI/ML engineering roles (or senior++ IC track) where:
- The problem domain is technically hard (not another CRUD chatbot)
- The team values systematic thinking over move-fast-break-things
- There’s a real path to production (actual users, actual stakes)
What I bring:
- Obsessive attention to failure modes (hallucinations, rate limits, cold starts)
- Comfort with ambiguous requirements (attorneys don’t speak in user stories)
- Battle scars from deploying LLMs in high-stakes domains
If that’s interesting, let’s talk:
📧 Email: abrarmuhtasim400@gmail.com
💼 LinkedIn: [abrar muhtasim]
Or just drop a comment. I respond to everything.
P.S. — If you’re an attorney reading this and thinking “Wait, I need this,” shoot me a DM. The system is in limited beta and I’m onboarding firms selectively.
P.P.S. — If you’re an engineer building in the legal/compliance/healthcare space and dealing with hallucination hell, I’m happy to do a technical deep-dive call. Some of this stuff took me months to figure out; maybe I can save you some time.
Thanks for reading. If this was useful, the algorithm likes claps and shares. Your call. 👨⚖️🤖
Top comments (0)