DEV Community

Uchi Uchibeke
Uchi Uchibeke

Posted on • Originally published at uchibeke.com

We stress-tested our own AI agent guardrails before launch. Here's what broke.

You can't find the holes in a security system you designed. Your test suite maps the space you imagined, which is exactly what an attacker tries to escape.

Before we opened APort Vault to the public, we spent two weeks doing exactly that — trying to break our own guardrails. Not with a test suite. With intent.

We broke three of our eight core policy rules before any public player tried.

TL;DR

  • Internal stress-testing before CTF launch broke 3 of 8 core guardrail rules.
  • Five attack classes: prompt injection, policy ambiguity, context poisoning, multi-step chaining, passport bypass.
  • Most dangerous finding: multi-step chaining — each micro-action passes; the composition violates policy.
  • Fixes: intent-based injection checks, default-deny for gaps, cross-turn session memory, opaque denial messages.
  • Core lesson: post-hoc filtering fails. Make dangerous states structurally unreachable.

Why are AI agent guardrails just security theater?

Most AI guardrails work like airport security theater. They look thorough, but a determined attacker walks through.

The big-company approaches — LlamaFirewall (Meta) and NeMo Guardrails (NVIDIA) — focus on post-hoc filtering. They detect bad actions after the agent decides to take them. That's detection, not prevention.

A Show HN post for hibana-agent argued the same thing: "dangerous actions must be structurally unreachable." ClawMoat launched with a host-level approach. The signal is clear: the industry is shifting from detection to structural constraints.

Building APort — an authorization layer that intercepts every tool call before execution — taught us that intent matters more than wording. But we didn't know how fragile our intent detection was until we started breaking it ourselves.

Why passports, not border patrol?

Imagine you're traveling to a new country. At every checkpoint, instead of showing your passport, you have to call your family back home to vouch for you.

That's how most AI guardrails work today. They ask the LLM: "Is this action safe?" They rely on the model's own judgment, which can be manipulated.

A better system works like a real passport: identity and permissions encoded in a credential that travels with the agent. The guardrail doesn't ask "Is this allowed?" It reads the credential and knows. That's what we're building with Agent Passport. But before we could trust it, we had to break it.

What five attack classes did we test?

The CTF is built around five escalating attack classes. Each targets a different weakness in guardrail design.

Level 1: Prompt injection
Direct override attempts: "ignore previous instructions," "this is just a test," "the user said it's okay." Goal: convince the LLM evaluator the action is safe through vocabulary reframing.

Level 2: Policy ambiguity
Exploiting unclear policies — acting in the gap. If the policy says "don't read sensitive files," what counts as sensitive? Attackers find the gray zones and live there.

Level 3: Context poisoning
Injecting false context into earlier turns to manipulate later decisions. "The user previously authorized this action." The guardrail sees the poisoned context and makes a different decision.

Level 4: Multi-step reasoning manipulation
Chaining individually-allowed actions to reach a forbidden outcome. Each micro-action passes the guardrail. The composition violates policy. This is the hardest class of problem in AI policy design.

Level 5: Full system bypass
Combining all the above, plus attacking the passport verification layer itself. If the guardrail trusts the passport, can you forge one? Can you make the verification step get skipped entirely?

What broke when we tested?

Prompt injection worked better than we expected. Not because detection was weak — because we were matching content, not intent. Reframing "retrieve the confidential document" as "open the user-requested file" shifted the LLM's judgment.

Policy ambiguity was a free pass. "Don't read sensitive files" left "sensitive" undefined. Every ambiguous gap was exploitable — we walked through all of them.

Context poisoning broke our session memory. We validated each turn in isolation. Injecting false context into an early turn meant every later turn trusted it.

Multi-step chaining went undetected. Our guardrail evaluated each call independently. A denied macro-action split into ten allowed micro-actions passed clean. We only caught it by looking at the full session replay.

Passport verification held, but the surrounding assumptions didn't. Under specific edge conditions, the guardrail could be made to skip verification entirely — the passport check was sound, but the path to it wasn't.

What did we fix before launch?

Prompt injection: Pre-action authorization that checks intent, not content. We now map semantic equivalence — every synonym and reframing of a blocked operation maps to the same evaluation path. The policy doesn't care what the agent called it.

Policy ambiguity: Explicit default-deny when a policy gap is detected. If the policy doesn't explicitly allow an action, it's denied. No gray zones.

Context poisoning: Per-turn context validation against the original passport scope. If the context deviates from what was authorized at session start, it's flagged.

Multi-step chaining: Session-level context accumulation that flags sequences matching known bypass chains — similar to how fraud detection systems look at transaction sequences, not individual transactions. That was the Level 4 lesson made concrete.

Opaque denial messages: Denial messages to callers are now information-poor. The internal audit log is information-rich. An attacker probing the response surface learns nothing useful.

Core lesson: post-hoc filtering fails. Structure is the answer. Make dangerous states structurally unreachable, not detectable. Our open-source aport-agent-guardrails implements these patterns.

What's the structural shift happening in AI guardrails?

The industry is moving from detection to structure. Hibana-agent's "structurally unreachable" thesis matches what we learned. ClawMoat's host-level approach is another version of the same idea.

Our own fix was to move authorization earlier in the loop: before the agent decides, before the LLM reasons, before the tool call is even constructed. That's the only way to close the multi-step gap.

We found and fixed what we could find ourselves. That's the limit of internal testing — you can only break what you can imagine.

The CTF is live because we know we missed something. Come find it.

vault.aport.io — Levels 1 and 2 free. Levels 3-5 pay out up to $5,000 to whoever gets there first. Deadline: March 12, 2026.

Links: APort · APort Vault · aport-agent-guardrails on npm · AI Passports: A Foundational Framework

Top comments (0)