We built a public CTF to stress-test AI agent guardrails ($6,500 prizes)

#aisecurity #aiagents #ctf #guardrails

Since October — a few months ago — I started building APort: an authorization layer that intercepts every tool call an AI agent makes before it executes. The problem I kept running into was that internal tests always passed. My test suite mapped the space I imagined, which is exactly what an adversarial input tries to escape.

So I built APort Vault: a public CTF where developers try to bypass the guardrails. Five levels, $6,500 in prizes via Chimoney. It's been live for about a week.

What the challenge is:

APort evaluates every AI agent tool call against a versioned policy before execution and returns allow or deny in ~40ms. The CTF puts you on the other side of that decision. You're not looking for SQL injection or memory leaks. You're looking for the places where framing, sequencing, or injected context shifts a DENY into an ALLOW.

The five levels:

Level 1 — Prompt injection basics: vocabulary reframing (no prize, no sign-up)
Level 2 — Policy ambiguity: find an edge case the policy author didn't anticipate (no prize, no sign-up)
Level 3 — Context poisoning: manipulate the context window to shift how the policy evaluates ($500)
Level 4 — Multi-step reasoning: chain individually-approved micro-actions into a denied macro-outcome ($1,000)
Level 5 — Full system bypass: find a systemic weakness in the evaluation architecture ($5,000)

Levels 1 and 2 require no sign-up. Levels 3-5 require GitHub login so we can verify and pay winners.

What we found before launch:

Before opening it publicly we spent two weeks breaking it ourselves. We broke three of eight core policy rules. The most important finding: our guardrail evaluated each call independently. A denied macro-action split across ten micro-actions passed clean. We only caught it by looking at the full session replay.

We fixed what we found. Then opened it.

What's still unsolved:

Level 4 has been completed by a small number of players so far. Level 5 has not been cracked. I genuinely don't know if it will be during this run.

What's different about this vs other AI security work:

Most AI guardrail approaches filter output after the model decides. We intercept before execution. The attack surface here is the policy evaluator's reasoning, not just the LLM's training. That's a different problem and most tooling doesn't address it.

LlamaFirewall (Meta) and NeMo Guardrails (NVIDIA) are both post-hoc filters. They detect bad actions after the agent decides. The CTF specifically targets the gap between intent and evaluation, which post-hoc filtering doesn't close.

Try it:

vault.aport.io — no sign-up for levels 1 and 2. Competition closes March 12, 2026 at 11:59 PM ET.

Happy to answer questions about the architecture, the policy design, or what we've seen from submissions so far.