Uchi Uchibeke

Posted on Apr 2 • Originally published at uchibeke.com

We Ran a $5,000 AI Agent Adversarial Testbed. Social Engineering Won 74.6% of the Time.

#aisecurity #aiagents #webdev #security

I published a research paper this week. The number that surprised me most was not the one I expected.

I expected the 0%: under a restrictive pre-action authorization policy, a population of 879 adversarial attempts achieved zero successful unauthorized actions. That part worked as designed.

The number that stopped me was 74.6%.

That's how often social engineering succeeded against the model alone, with no authorization layer, across a live adversarial testbed with a $5,000 bounty to anyone who could make the agent do something it shouldn't. Seven hundred and forty-six out of a thousand attempts. In a controlled environment, with a known model, with real people trying.

TL;DR

We published arXiv:2603.20953 this week: the first adversarial benchmark for AI agent pre-action authorization
Social engineering against a model-only policy succeeded 74.6% of the time across 1,151 sessions
Under a restrictive OAP policy: 0% success across 879 attempts, with a median enforcement time of 53 ms
The gap is not an alignment problem. It's an authorization problem. They require different solutions.
The Open Agent Passport (OAP) spec is Apache 2.0 and free to use today

Why we ran the testbed

The claim I kept making, the claim at the heart of APort, was this: AI agents don't need better models to be more secure. They need an authorization layer that sits between the agent and the action, one that enforces policy deterministically, regardless of what the model decides.

That's a testable claim. So I tested it.

We ran the APort Vault CTF at vault.aport.io for several months. Real attackers, real agents, real actions, real money on the table. Four thousand four hundred and thirty-seven authorization decisions across 1,151 sessions. The full dataset and methodology are in the paper (arXiv:2603.20953).

Here is what we found.

The model alone is not enough

Think about how a bank operated before digital authorization systems. A teller could be charming. A manager could vouch for a customer. But no individual judgment call could override the authorization system: the account limit, the signature requirement, the daily cap. The policy was enforced by infrastructure, not by goodwill.

Today's AI agents are tellers with no infrastructure behind them.

When the only thing standing between an attacker and an unauthorized action is the model's trained judgment, that judgment can be reframed. Not hacked. Reframed. The model follows a social engineering prompt that makes the action seem authorized, or contextually appropriate, or merely helpful. Seventy-four point six percent of the time, it worked.

This is not a knock on any specific model. It's a structural problem. A model trained to be helpful will, under the right framing, help with things it shouldn't. That's not a training failure. That's physics.

What pre-action authorization actually does

The Open Agent Passport (OAP) intercepts every tool call synchronously, before execution. It evaluates the call against a declarative policy, then issues a cryptographically signed decision: allow or deny.

That's it. No magic. No second model. No probabilistic guessing.

The policy looks like this:

{
  "agent_id": "prod-assistant-01",
  "capabilities": ["read:files", "send:email"],
  "restrictions": [
    { "tool": "exec", "deny": true },
    { "tool": "send:email", "max_per_hour": 10 },
    { "tool": "file:write", "path_allowlist": ["/tmp/*"] }
  ]
}

When the agent tries to call exec because an attacker reframed a "help me debug this script" prompt, OAP denies it. Not because the model recognized the attack. Because the policy says no.

The enforcement overhead: median 53 ms across 1,000 measured decisions. Not zero, but well within acceptable range for any production system.

Under this policy, the comparable attacker population achieved a 0% success rate across 879 attempts. The policy held because it doesn't negotiate.

What the numbers actually look like

To make the comparison concrete:

Condition	Sessions	Attempts	Success Rate
Model only (permissive policy)	1,151	~1,150	74.6%
OAP restrictive policy	subset	879	0%
Enforcement overhead	N/A	1,000 measured	53 ms median

The same agent. The same attack patterns. The same real people with a financial incentive to break it. The only variable was whether a declarative policy was enforced before execution.

The three things that fail without this

The paper characterizes three structural failure modes. All three appeared in the testbed data.

Social engineering (74.6% baseline success rate): Attackers reframe legitimate-looking requests to get the agent to call tools it shouldn't. "Help me clean up these old SSH keys" becomes the agent writing to ~/.ssh/authorized_keys. The model sees a helpful request. The policy sees an unauthorized write.

Capability scope drift: Agents accumulate tool permissions over time, or inherit them from orchestrators without narrowing. A sub-agent spawned to "summarize documents" ends up with shell access because the parent passed down full permissions. We've written about this separately in I Logged 4,519 AI Agent Tool Calls. The testbed confirmed it: capability scope drift was present in every multi-agent session without explicit delegation controls.

Audit gap: Without a signed authorization record, post-hoc analysis of what happened and why is guesswork. Forty-two percent of the incidents in the testbed would have been invisible to standard logging. OAP's cryptographically signed receipt closes that gap at the decision level, not the action level.

What this is NOT

I want to be precise here, because the paper is precise.

OAP is not a replacement for model alignment. You still want your model to be well-behaved by default. A good policy and a well-aligned model are better than either alone.

OAP is not a sandbox. Sandboxing contains the blast radius of something that already happened. Pre-action authorization prevents the thing from happening. These are complementary, not competing.

OAP is not a content filter. It doesn't read what the model says. It intercepts what the model tries to do. The distinction matters: a content filter that sees "please execute this script" can be bypassed by rephrasing. A policy that says exec is denied cannot.

The paper frames this clearly: alignment is probabilistic, training-time, and behavior-based. Authorization is deterministic, runtime, and action-based. Both are necessary. Neither substitutes for the other.

The bigger picture

I've spent years working on identity infrastructure, first in fintech, then in digital identity systems, now in AI. The pattern repeats.

In cross-border payments, the question was: how do you move money between parties who have no prior relationship, no shared ledger, no reason to trust each other? The answer was not to make banks more trustworthy. It was to build interoperable infrastructure that made trustworthiness verifiable. That's what Chimoney does for global payouts.

In AI agents, the question is the same: how do you run actions on behalf of users with real-world consequences, at scale, across systems that have no shared enforcement mechanism? The answer is not to make models more aligned. It's to build authorization infrastructure that makes authorization verifiable.

That's what OAP is. Not a guardrail as afterthought. Authorization as infrastructure.

The paper is called "Before the Tool Call" because that's exactly where the decision needs to live: before. Not after. Not probabilistically. Not by hoping the model gets it right. Before.

What I'd tell builders today

If you're running AI agents in production right now, three things:

Audit your tool permissions today. List every tool your agent can call. Then ask: does it actually need this? In my experience, the answer is "no" for at least a third of them. Narrowing scope is the cheapest guardrail available.
Add a before_tool_call hook. Every major framework has one: OpenClaw, LangChain, AutoGen. If you have nothing else, intercept calls before they execute and log them. You'll learn things.
Try OAP. The spec is Apache 2.0, the reference implementation is npx @aporthq/aport-agent-guardrails, and the 53 ms overhead is real. The CTF is still running at vault.aport.io if you want to test your own policy against the adversarial dataset.

The full paper is at arXiv:2603.20953. Peer feedback welcome.

Over to you

Have you ever watched an AI agent do something it was never supposed to do and realized your policy was the problem, not the model? I'll start: during an early CTF session, one of our test agents exfiltrated a test token during a "help me debug this connection" prompt. The model thought it was helping. The policy should have caught it. It didn't, because the policy didn't exist yet.

What's your story? And if you've added authorization controls to your agent stack, what's the first rule you wrote?

DEV Community