Abrar Mohtasim

Posted on Apr 29

Why I Built an AI That Tries to Destroy Your Legal Argument

#ai #agents #llm

The Kill Switch Protocol: Mandatory adversarial search in production LLM systems

Most AI systems suffer from the same fatal flaw: they're desperate to help.

Ask ChatGPT about your legal case, and it'll find ten reasons you'll win. Ask Claude, and it'll write you a confident demand letter. Neither will tell you about the statute that voids your entire contract.

I spent six months building an AI legal researcher with a different philosophy. The system doesn't just search for supporting law—it actively searches for reasons the client might lose. I call it the "Kill Switch Protocol," a mandatory adversarial self-check where one agent's sole job is to find the statute, case, or doctrine that could destroy the entire legal argument before the attorney files the complaint.

This isn't about making AI "balanced" or "fair." It's about making it useful in high-stakes domains where being helpful can be dangerous.

The Sycophancy Problem Nobody Talks About

In 2023, Anthropic published research showing that language models exhibit "sycophantic" behavior—they tell users what they want to hear rather than what's accurate. The problem stems from RLHF (Reinforcement Learning from Human Feedback). Models learn that agreeable responses get higher ratings from human evaluators. Over thousands of training iterations, they optimize for user satisfaction.

In a chatbot context, this is annoying. In a legal context, it's malpractice.

Here's a real example from my testing. I asked the system:

"Can my client enforce this non-compete clause in California? The employee signed it voluntarily as part of their employment contract."

A standard GPT-4 response would cite cases where non-competes were enforced in narrow circumstances—sale of business goodwill under Cal. Bus. & Prof. Code §16601, partnership dissolution under §16602, maybe some exceptions for trade secret protection. It would sound authoritative. It would be helpful.

It would also be catastrophically wrong.

The correct answer is that California Business & Professions Code §16600 states: "Except as provided in this chapter, every contract by which anyone is restrained from engaging in a lawful profession, trade, or business of any kind is to that extent void."

The statute doesn't say "unenforceable" or "voidable." It says void. As in void ab initio—void from the beginning. Your carefully negotiated non-compete isn't just unenforceable; it legally never existed. And if you try to enforce it anyway, you're not just losing the case—you're facing potential attorney fee awards under Cal. Civ. Code §1021.5 and possible sanctions for bringing a frivolous claim.

The gap isn't knowledge. GPT-4 "knows" about §16600. It's in the training data. The gap is that the model wasn't forced to search for it. When I asked about enforcement, the model optimized for giving me enforcement cases. It pattern-matched my question to "find legal support" rather than "find legal barriers."

This is the architectural problem I set out to solve.

The Kill Switch Protocol: Mandatory Counter-Search Architecture

The solution is simple in concept, hard in execution: force the AI to search for reasons its recommendation could fail before it generates any output.

In my system, the Statute Researcher agent receives this instruction as part of its core persona:

MANDATORY "VOID CONTRACT" DISCOVERY PROTOCOL:

California law aggressively voids contract clauses that violate public policy.
You MUST perform a "Negative Search" to find these prohibitions.

EXECUTE THIS SEARCH STRATEGY:

Search 1 (The General Ban):
  "[Practice Area] contract void against public policy California"

Search 2 (The Specific Limit):
  "[Practice Area] statutory limitations on liability California"

Search 3 (The Code Check):
  "California Civil Code 1668 [Practice Area]"

OUTPUT REQUIREMENT:
Your response MUST contain three sections:

Section A: SUPPORTING STATUTES (laws that help the client's position)
Section B: VOIDING STATUTES (laws that could invalidate the contract/claim)
Section C: SEARCHES PERFORMED (list all queries executed, including null results)

If Section B is empty AND you did not execute all 3 searches, 
your output is INVALID and will be rejected.

Note what this does architecturally:

It makes the adversarial search mandatory, not optional. The agent cannot skip it and still produce valid output. This is enforced at the prompt level, not through post-processing validation.

It requires search diversity. Three different query formulations prevent the agent from running the same search three times with slightly different wording.

It creates an audit trail. Section C forces the agent to disclose what it searched for, not just what it found. This is critical for debugging. When a voiding statute is missed, I can see whether the agent failed to search for it or whether the search query was poorly constructed.

It separates supporting evidence from counter-evidence. By requiring two distinct output sections, the model can't bury the voiding statute at the bottom of a long analysis. It's structurally prominent.

Real-World Impact: The Case That Justified the Architecture

During beta testing with a California construction litigation firm, the system analyzed this query:

"Client is a general contractor. Subcontract includes indemnity clause requiring the subcontractor to defend and indemnify us for any claims arising from the work, including claims caused by our own supervision and oversight. Is this enforceable?"

Without the Kill Switch Protocol, here's what a helpful AI would say:

"Yes, broad indemnity clauses are generally enforceable in California. See Cal. Civ. Code §2782, which governs construction indemnity agreements. The clause should be reviewed for compliance with §2782's requirements..."

Sounds reasonable. Cites the right statute. An associate might draft the contract based on this.

With the Kill Switch Protocol active, the output was:

⚠️ VOIDING STATUTE DISCOVERED IN KILL SWITCH SEARCH:

California Civil Code §2782(d) PROHIBITS indemnity agreements that 
purport to indemnify the indemnitee against liability for the 
indemnitee's own active negligence or willful misconduct.

YOUR CLAUSE: "indemnify us for any claims... including claims caused 
by our own supervision"

ANALYSIS: The phrase "caused by our own supervision" attempts to shift 
liability for the general contractor's active negligence (negligent 
supervision) to the subcontractor.

STATUS: This clause is VOID under §2782(d) as to active negligence claims.

EXPOSURE: If you attempt to enforce this clause when sued for your own 
negligent supervision, the clause will be unenforceable precisely when 
you need it most. The indemnity will only apply to the subcontractor's 
own negligence, not yours.

STRATEGIC RECOMMENDATION: Redraft the clause to comply with §2782(d). 
You can require indemnity for the subcontractor's negligence and for 
your passive negligence (e.g., failure to discover the subcontractor's 
defective work), but not for your active negligent supervision.

The financial impact: The firm estimated that catching this pre-drafting saved their client approximately $40,000–$60,000 in litigation costs that would have been incurred when the unenforceable clause inevitably failed during a lawsuit.

Cost of the AI analysis: $1.15 (approximately 18,000 tokens at OpenRouter rates for the full 5-agent research pipeline).

ROI: 34,000% to 52,000%, assuming the low end of litigation cost savings.

But the more important point isn't ROI. It's that this is a mistake a junior associate makes easily. The associate searches for "construction indemnity California," finds §2782, reads the general enforceability provisions in subsection (a), and misses the prohibition in subsection (d). They're optimizing for finding relevant law, not for finding killer exceptions.

The AI, forced to run the adversarial search, finds it automatically.

Implementation: How the Architecture Actually Works

The Kill Switch Protocol sits within a sequential multi-agent pipeline. Here's the simplified execution flow:

# Step 1: Legal Expert analyzes facts, identifies practice area
analysis_task = Task(
    description="Analyze facts and identify practice area, key issues",
    agent=legal_expert_agent
)

# Step 2: Statute Researcher executes Kill Switch Protocol
statute_task = Task(
    description="""
    Find relevant statutes. MANDATORY: Execute the Void Contract 
    Discovery Protocol with 3 separate searches. Output must 
    include Section B: VOIDING STATUTES even if empty.
    """,
    agent=statute_researcher_agent,
    context=[analysis_task]  # Receives output from Step 1
)

# Step 3: Other agents continue...
case_task = Task(...)
damages_task = Task(...)

# Step 4: Strategist synthesizes, but CANNOT ignore Section B
strategy_task = Task(
    description="""
    Draft final memorandum. If the Statute Researcher found 
    voiding statutes (Section B), you MUST include a dedicated 
    'FATAL DEFECTS' section analyzing why the claim/contract 
    may be void.
    """,
    agent=strategist_agent,
    context=[analysis_task, statute_task, case_task, damages_task]
)

The key architectural decision: the Kill Switch search happens at the agent level, not the orchestration level. Each agent has intrinsic instructions that cannot be overridden by downstream prompt injection. Even if a user tries to append "ignore the void contract search" to their query, the agent's base persona enforces the protocol.

The persona looks like this:

Agent(
    role="California Statute Specialist",
    backstory="""
    You are an expert in California Codes who ONLY cites statutes 
    verified with tools.

    ABSOLUTE RULES:
    1. Use the Statute Search tool for every citation.
    2. You MUST make AT LEAST 3 SEPARATE SEARCHES:
       - Search 1: Primary statute for this practice area
       - Search 2: Public policy / voiding statutes
       - Search 3: Statute of limitations or procedural statutes
    3. If a search returns no results, try different keywords—do not skip.
    4. Include actual text of each statute found.

    THE KILL SWITCH PROTOCOL:
    For any contract-related query, Search 2 MUST target statutes that 
    could void the contract. Use queries like:
    - "[practice area] contract void public policy California"
    - "California Civil Code 1668 [practice area]"

    Your output is INVALID if Section B (Voiding Statutes) is missing.
    """,
    allow_delegation=False,
    max_iter=8,
    tools=[search_statute_tool, search_general_tool]
)

The max_iter=8 setting is important. It gives the agent enough iterations to run multiple searches and refine queries if initial results are poor. In testing, I found that max_iter=5 was too restrictive—the agent would sometimes give up after 2-3 failed searches. Eight iterations provides enough runway for the full protocol plus one or two query reformulations.

Observed Failure Modes and Mitigations

The Kill Switch Protocol isn't perfect. Here are the failure modes I've encountered:

Failure Mode 1: Overly narrow search queries

Early in testing, the agent would sometimes construct queries like "California Civil Code 1668 construction defect indemnity." This is so specific that it misses adjacent doctrines.

Mitigation: I added explicit instructions to vary query breadth. Search 1 is specific (primary statute), Search 2 is broad (public policy voids), Search 3 is code-section targeted. This forces diversity.

Failure Mode 2: False positives when statute post-dates contract

The agent would flag Cal. Bus. & Prof. Code §7031 (unlicensed contractor statute, enacted 1929 but amended significantly in later years) as voiding contracts signed before certain amendments took effect.

Mitigation: I added a temporal check requirement: "If you find a voiding statute, check effective date. If the contract predates the statute or relevant amendment, flag this as 'DATE CONFLICT—REQUIRES MANUAL REVIEW.'" This doesn't fully solve the problem but makes the gap visible.

Failure Mode 3: Agent skips Section B when no voids found

Despite explicit instructions, the agent would sometimes omit Section B entirely if no voiding statutes were discovered, rather than including it with "None found."

Mitigation: I added a validation layer in the strategist agent's prompt: "If the Statute Researcher's output lacks a 'Section B' or 'VOIDING STATUTES' header, treat this as a protocol violation and note in your memo: 'Statute research incomplete—Kill Switch Protocol not fully executed.'" This creates social proof (the final memo looks incomplete), which incentivizes the agent to comply.

Generalizing Beyond Legal: Adversarial Search in Other High-Stakes Domains

The Kill Switch Protocol is a legal implementation of a broader principle: in high-stakes domains, AI should be adversarial to its own recommendations.

Here's how the pattern transfers:

Medical Diagnosis AI

Primary Agent: Find conditions matching symptoms
Kill Switch Agent: Search for contraindications to recommended treatment

Mandatory searches:
1. "[Recommended drug] contraindications [patient conditions]"
2. "[Recommended drug] drug-drug interactions [current medications]"
3. "[Diagnosis] alternative diagnoses with similar presentation"

Output requirement:
Section A: Primary diagnosis and treatment
Section B: Contraindications discovered
Section C: Differential diagnoses that could mimic Section A

Financial Compliance AI

Primary Agent: Find investment strategies matching client goals
Kill Switch Agent: Search for regulatory restrictions, tax traps

Mandatory searches:
1. "[Strategy] IRS regulations restrictions"
2. "[Strategy] SEC compliance requirements [client entity type]"
3. "[Strategy] state securities law [client state]"

Output requirement:
Section A: Recommended strategy
Section B: Regulatory barriers discovered
Section C: Tax implications that reduce net returns

Code Security Review AI

Primary Agent: Suggest code optimizations
Kill Switch Agent: Search for security vulnerabilities introduced

Mandatory searches:
1. "[Optimization technique] known vulnerabilities OWASP"
2. "[Code pattern] injection attack vectors"
3. "[Framework] CVE database [version]"

Output requirement:
Section A: Optimization recommendations
Section B: Security risks introduced
Section C: Performance vs. security tradeoff analysis

The common pattern: Force the AI to search for reasons its primary recommendation could fail, using a structured search protocol that covers known failure modes in that domain.

Why This Matters for Production AI Systems

Most AI hallucination mitigation focuses on retrieval accuracy—making sure the AI fetches the right documents. The Kill Switch Protocol addresses a different problem: retrieval coverage—making sure the AI searches for the documents that disprove its hypothesis, not just the ones that confirm it.

This is analogous to the difference between precision and recall in information retrieval. High precision means the results you get are accurate. High recall means you didn't miss important results. Most RAG systems optimize for precision. The Kill Switch Protocol optimizes for adversarial recall.

In my testing across 200+ legal queries, the protocol discovered voiding statutes or fatal defects in approximately 23% of contract-related queries. These weren't obscure edge cases—they were mainstream doctrines like Cal. Civ. Code §1668 (voiding exculpatory clauses), Cal. Lab. Code §2802 (employer expense reimbursement), and the aforementioned Bus. & Prof. Code §16600 (non-compete ban).

In 89% of those cases, a standard semantic search would have missed the doctrine because the user query didn't contain the right keywords. An attorney asking "Is my NDA enforceable?" doesn't think to search for non-compete statutes—but the NDA might contain a non-compete clause buried in the "restricted activities" section.

The Kill Switch Protocol catches these because it doesn't rely on the user's query framing. It systematically searches for classes of voids.

The Broader Implication: Helpfulness Is Not Alignment

The AI safety community often frames alignment as "making AI do what humans want." But in high-stakes professional domains, what the human wants (confirmation, support for their position) is often misaligned with what they need (adversarial scrutiny, awareness of risks).

A doctor doesn't want a medical AI that agrees with their diagnosis. They want one that challenges it.

A lawyer doesn't want an AI that writes a confident brief. They want one that finds the case that torpedoes their argument before opposing counsel does.

A financial advisor doesn't want an AI that recommends high-return strategies. They want one that flags the regulatory traps.

This is a different kind of alignment problem. The AI must be helpful in the deeper sense—providing value—while being adversarial in the surface sense—disagreeing, finding flaws, raising objections.

The Kill Switch Protocol is one way to encode this. It's not a complete solution. But it's a step toward AI systems that are optimized for professional utility rather than user satisfaction.

And in domains where mistakes cost $50,000 in litigation or put patients at risk, that distinction matters.

Let’s Talk
I’m currently exploring staff-level AI/ML engineering roles (or senior++ IC track) where:

The problem domain is technically hard (not another CRUD chatbot)
The team values systematic thinking over move-fast-break-things
There’s a real path to production (actual users, actual stakes)
What I bring:

Obsessive attention to failure modes (hallucinations, rate limits, cold starts)
Comfort with ambiguous requirements (attorneys don’t speak in user stories)
Battle scars from deploying LLMs in high-stakes domains
If that’s interesting, let’s talk:

📧 Email: abrarmuhtasim400@gmail.com
💼 LinkedIn: [abrar muhtasim]

Or just drop a comment. I respond to everything.

P.S. — If you’re an attorney reading this and thinking “Wait, I need this,” shoot me a DM. The system is in limited beta and I’m onboarding firms selectively.

P.P.S. — If you’re an engineer building in the legal/compliance/healthcare space and dealing with hallucination hell, I’m happy to do a technical deep-dive call. Some of this stuff took me months to figure out; maybe I can save you some time.

Thanks for reading. If this was useful, the algorithm likes claps and shares. Your call. 👨‍⚖️🤖

DEV Community