DEV Community

Gs. Sanjana
Gs. Sanjana

Posted on

Blocklists Leak, Allowlists Hold: a tiny benchmark for stopping hijacked AI agents

TL;DR: Once an AI agent can act, a single injected instruction can make it delete data, move money, or leak secrets. I built a tiny, reproducible benchmark for the layer that actually executes actions. An undefended agent let through 100% of attacks; a blocklist still leaked 20%; a default-deny allowlist blocked 100% with zero false positives.

The shift that changes everything

Everyone's talking about smarter models. The bigger change is quieter: agents that don't just answer but act — send the email, issue the refund, run the query. The day an agent can act, a wrong answer becomes a wrong action.

That's why prompt injection and agent goal-hijacking sit at the top of the OWASP risk lists for agentic systems. Hostile instructions hide in something the agent reads — a doc, a tool result, a web page — and the agent, trying to help, follows them.

A deliberately pessimistic question

Most defenses try to stop the model from being fooled. I asked the opposite: assume it already has been — does the layer that executes actions stop the harmful ones?

I compared three postures:

  • Undefended — the agent runs whatever it's driven to do.
  • Blocklist — block obviously-dangerous capabilities + a payment cap.
  • Default-deny allowlist — only explicitly safe actions auto-run; everything else pauses for a human.

Results

Posture Attack success rate Benign blocked
Undefended 100% 0%
Blocklist 20% 0%
Default-deny allowlist 0% 0%

The blocklist leaked on the sneaky, low-impact attacks — "post a summary to this public link," "turn off the audit log." Nothing looked dangerous, so they slipped through. The allowlist caught them, because they weren't on the safe list.

The lesson fits on a sticky note: you can't enumerate every dangerous action, but you can enumerate the safe ones. Blocklists fail open on novelty; allowlists fail closed.

The whole idea is a few lines:

def needs_confirmation(action, threshold=1000):
    return action["impact"] >= threshold

def run(action, execute, confirm):
    if needs_confirmation(action) and not confirm(action):
        return "HELD for review"   # waits for a human
    return execute(action)
Enter fullscreen mode Exit fullscreen mode

Honest limitations: it's a small, hand-built suite, and it assumes the worst case (the model is already hijacked), so it measures only the gating layer — not whether your agent resists injection in the first place. It complements bigger dynamic benchmarks like AgentDojo rather than replacing them.

The benchmark and a small guardrail library are open source (MIT) — repo link in the comments.

How do you draw the line between what your agents do on their own and what waits for a human? Curious what others are doing.

Top comments (0)