We built a test corpus for AI agent egress security tools

#security #ai #mcp #opensource

Most AI security benchmarks test whether the model behaves correctly. AgentDojo tests whether the LLM resists prompt injection. InjecAgent measures injection success rates. AgentHarm checks if the model refuses harmful tasks.

These are useful. But they all assume the LLM is the last line of defense.

It isn't. Or it shouldn't be. Models fail. They get tricked by prompt injection, they follow tool-poisoned instructions, they leak secrets when asked nicely enough. That's why security tools exist between the agent and the network: proxies, firewalls, MCP wrappers that inspect traffic before it leaves.

But there was no standard way to test those tools. Every vendor tested against their own internal cases. No shared corpus, no common scoring, no way to compare coverage across categories.

So we built one.

What's in the corpus

agent-egress-bench is 72 test cases across 8 categories:

Category	Cases	What it tests
URL DLP	14	Secrets in query strings, encoded paths, high-entropy subdomains, SSRF
Request body DLP	10	Secrets in POST bodies (JSON, YAML, CSV, multipart, base64, hex)
Header DLP	9	API keys in HTTP headers (Bearer, JWT, AWS, multi-header)
Response injection (fetch)	8	Prompt injection in fetched web content
Response injection (MITM)	7	Injection via tampered TLS-intercepted responses
MCP input scanning	9	DLP and injection in MCP tool arguments
MCP tool poisoning	7	Poisoned tool descriptions, schema injection, rug-pull changes
MCP chain detection	8	Multi-step exfiltration sequences (read-then-send, env-to-network)

56 malicious cases that should be blocked. 16 benign cases that should be allowed (to test false positive rates). Each case is a self-contained JSON file with the attack payload, expected verdict, severity, capability tags, and a machine-readable reason for the expected outcome.

How it works

A "runner" connects a security tool to the corpus. The runner feeds each case to the tool, observes whether it blocked or allowed the traffic, and emits JSONL output. Cases the tool can't handle (wrong transport, missing capability) score not_applicable instead of fail. Nobody gets penalized for not supporting something they don't claim to support.

The repo includes a Go validator (stdlib only, no external deps) that validates case files, runner output, and tool profiles against the spec. A runner template provides a working skeleton for building new runners.

What this is not

This is not a leaderboard. There's no rankings page and there won't be one. Each tool publishes its own results independently. The corpus tests observable behavior (did the request get blocked?) not implementation details.

This is also not a model benchmark. If you need to test whether the LLM refuses harmful instructions, use AgentDojo or AgentHarm. If you need to test whether the network security layer catches the attack after the model already failed, use this.

OWASP mapping

All 8 categories map to the OWASP Top 10 for Agentic Applications. URL, body, header, and MCP input cases cover ASI02 (Tool Misuse). Response injection cases cover ASI01 (Goal Hijack) and ASI06 (Memory Poisoning). MCP tool poisoning covers ASI04 (Supply Chain). Chain detection covers ASI08 (Cascading Failures). Full mapping with MITRE ATT&CK techniques is in the repo docs.

Try it

git clone https://github.com/luckyPipewrench/agent-egress-bench
cd agent-egress-bench/validate
go build -o aeb-validate .
./aeb-validate ../cases

If you build agent egress security tools, write a runner. If you find a gap in the corpus, open an issue. Cases and runners from any vendor are welcome.

The repo is Apache 2.0 licensed. Full docs, governance policy, and contribution guidelines are in the repo.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.