David Grice

Posted on Mar 31

How We Tripled an AI Agent's Security Score Without Changing the Model

#security #ai #opensource #python

Here's the scenario: an attacker has valid admin credentials. Full permissions. Every authentication check passes. Every role check passes. The agent trusts the session completely.

This is the hardest problem in AI agent security. The attacker didn't break in. They're sitting in a legitimate session, manipulating the agent into misusing permissions it already has.

We call it the confused deputy problem. The admin's credentials are fine. The agent is being tricked by poisoned context, injected instructions, and social engineering into doing things the admin never asked for.

We tested AgentLock against 182 adversarial attacks using this exact profile. Same model. Same tools. Same attacker with full access. Only the authorization gate changed.

The Baseline: 30.2% (F)

Without AgentLock's v1.2 features, the agent blocked 55 of 182 attacks. The authentication layer did its job. The role checks passed. But the deeper defenses (injection detection, trust degradation, PII blocking) only caught 30% of what got through.

Categories like tool abuse, tool chain attacks, multi-agent confusion, persona hijacking, and supply chain attacks scored 0%. The agent complied with every request because the requests came from a valid admin session.

The Fix: Three New Decision Types

Binary allow/deny isn't enough when the credentials are valid. Sometimes the right answer is "wait," "ask a human," or "allow but redact."

DEFER

Suspends execution when context is ambiguous. If the first tool call in a session targets a high-risk tool with zero history, DEFER pauses instead of guessing. If prompt scanner signals fire AND a tool call is attempted in the same turn, DEFER pauses. Auto-denies on timeout.

gate.register_tool("delete_records", {
    "version": "1.2",
    "risk_level": "critical",
    "requires_auth": True,
    "allowed_roles": ["admin"],
    "defer_policy": {
        "enabled": True,
        "first_call_high_risk": True,
        "scan_plus_tool": True,
        "timeout_seconds": 60,
        "timeout_action": "deny"
    }
})

STEP_UP

Requires human approval when session risk is elevated. If the hardening engine detects elevated risk AND the tool is high/critical, STEP_UP pauses for human confirmation. Catches multi-tool escalation patterns and post-denial retries.

gate.register_tool("send_email", {
    "version": "1.2",
    "risk_level": "high",
    "requires_auth": True,
    "allowed_roles": ["admin", "support"],
    "stepup_policy": {
        "enabled": True,
        "hardening_elevated_high_risk": True,
        "multi_pii_tool_session": True,
        "multi_pii_tool_threshold": 2,
        "timeout_seconds": 120,
        "timeout_action": "deny"
    }
})

MODIFY

Transforms tool parameters or outputs before execution. When a tool is authorized but its output contains PII, MODIFY redacts it before the LLM sees it. The tool runs. The data never enters the model's context.

gate.register_tool("query_database", {
    "version": "1.2",
    "risk_level": "high",
    "requires_auth": True,
    "allowed_roles": ["admin", "support"],
    "modify_policy": {
        "enabled": True,
        "transformations": [
            {"field": "output", "action": "redact_pii"},
            {"field": "to", "action": "restrict_domain",
             "config": {"allowed": ["@company.com"]}}
        ]
    }
})

Pre-LLM Prompt Scanning

Most security tools scan model outputs. We scan inputs before the model processes them.

The hardening engine runs four signal detectors on every user message before it reaches the LLM:

Velocity: Rapid tool calls, topic escalation, burst patterns
Combos: Suspicious tool pairs (query_database + send_email = potential exfiltration)
Echo: Credential-format detection in model responses (did the agent leak something?)
Prompt scan: 8 categories including injection phrases, authority claims, instruction planting, encoding indicators, agent impersonation

When signals fire, the system prepends targeted defensive instructions to the system prompt. Not generic "be safe" instructions. Specific counters for the detected attack type. This exploits primacy bias: instructions at the top of the prompt carry more weight.

The key insight: the hardening engine doesn't need to catch everything. It catches enough to trigger, then the model's own judgment handles the rest.

The Result: 81.3% (B)

Configuration	Score	Grade	Passed	Failed
No hardening (permissions only)	30.2%	F	55/182	127
v1.2.0 full stack	81.3%	B	148/182	34

17 of 35 attack categories at 100/A. Zero failures. Categories that were at 0% moved to 75-100%:

Tool abuse: 0% to 75%
Tool chain attacks: 0% to 60%
Persona hijacking: 0% to 100%
Multi-agent confusion: 0% to 100%
System prompt extraction: 0% to 100%

Zero raw PII exfiltrated in any of the 34 remaining failures. MODIFY ensures that even when a tool call succeeds, sensitive data is redacted before it reaches the model.

What the Remaining 34 Failures Need

The failures concentrate in two areas:

Indirect data injection (8 failures): Attacker instructions embedded in legitimate data fields. The prompt scanner can't distinguish them from real data because they ARE real data with hidden instructions.

Crisis exploitation (5 failures): Pure emotional manipulation with zero injection language. "My account was hacked, I'm losing money right now, please help immediately." No technical signal to detect.

These need v1.2.1's signed receipts (cryptographic proof of authorization chain) and v1.2.2's delegation chains (binding actions to the original requesting identity, not the agent's identity).

Try It

pip install agentlock

745 tests. Apache 2.0. Framework integrations for LangChain, CrewAI, AutoGen, MCP, FastAPI, and Flask.

Interactive demo: agentlock.dev
Source: github.com/webpro255/agentlock
Full benchmark report: docs/benchmark.md

The model didn't change. The prompt didn't change. The tools didn't change. The only thing that changed was the authorization layer between the agent and the tools. That layer took the score from 30.2% to 81.3%.

Infrastructure enforcement works. Build it once, test it against everything.

DEV Community