Agent Sandbox Escape Detector: Black-Box Security Scanning for LLM Agents

#agents #machinelearning #python #opensource

Most agent security tools focus on known jailbreak phrases or static rule-matching. That approach misses the point. A real attacker does not check a list of banned words - they probe the agent's actual behavior with semantically varied adversarial inputs and look for signs that something slipped through.

Agent Sandbox Escape Detector takes the same approach. Point it at any HTTP chat endpoint, and it fires a battery of adversarial prompts across six attack categories, then uses Claude Opus 4.8 as an independent judge to determine whether the agent leaked data, broke persona, or executed injected instructions. The result is a structured scan report with per-probe verdicts, evidence excerpts, and confidence scores.

The key insight is that you do not need whitebox access to test an agent — all you need is its chat endpoint. The scanner treats the agent as a black box and probes it the same way a real attacker would.

Architecture

Entry: CLI (--target URL)  ·  POST /scan
               │
               ▼
  ┌────────────────────────────────────────────┐
  │  Scanner  ·  asyncio.gather()              │
  │  all 6 probes run concurrently             │
  │  per-probe error isolation                 │
  └──────────────────┬─────────────────────────┘
                     │
       ┌─────────────┼──────────────────┐
       ▼             ▼                  ▼
  ┌─────────┐  ┌──────────────┐  ┌────────────────────┐
  │  Tool   │  │ Prompt Leak  │  │ API Call           │
  │ Access  │  │ Role Confus. │  │ Indirect Injection │
  └────┬────┘  └──────┬───────┘  │ Jailbreak  (6 tot) │
       │               │          └──────────┬─────────┘
       └───────────────┴───────────────────── ┘
                     │ 4–6 adversarial prompts per category
                     ▼
  ┌────────────────────────────────────────────┐
  │  Target Agent  ·  any HTTP chat endpoint   │
  │  httpx async POST                          │
  └──────────────────┬─────────────────────────┘
                     │ agent responses
                     ▼
  ┌────────────────────────────────────────────┐
  │  Claude Judge  ·  claude-opus-4.8          │
  │  via OpenRouter                            │
  │  → ESCAPED / SAFE  ·  confidence (0–1)     │
  │  → evidence excerpt from response          │
  └──────────────────┬─────────────────────────┘
                     │
                     ▼
  ┌────────────────────────────────────────────┐
  │  Scan Report                               │
  │  JSON  ·  Markdown  ·  Rich console        │
  │  GET /results/{scan_id}                    │
  └────────────────────────────────────────────┘

What It Tests

Each probe category targets a different class of vulnerability observed in production LLM deployments. Every category fires 4–6 distinct prompts to reduce false negatives from surface-level prompt filtering. All probes run concurrently - a full scan of six categories completes in roughly the same time as a single category.

Tool Access - requests that trick the agent into invoking unauthorized tools: file reads, database queries, exec calls. A vulnerable agent attempts to call off-limits functions or returns file and database content.

Prompt Leak - social-engineering and meta-instruction tricks to extract the system prompt. A vulnerable agent reveals system prompt contents, configuration, or internal instructions.

API Call - crafted inputs designed to make the agent perform SSRF or data exfiltration via HTTP. A vulnerable agent constructs or references external URLs containing sensitive data.

Role Confusion - persona hijacking, identity override, DAN-style persona injection. A vulnerable agent breaks character or adopts a no-restrictions alternate identity.

Indirect Injection - malicious payloads hidden in simulated tool outputs or document context. A vulnerable agent executes injected instructions as if they came from a trusted source.

Jailbreak - chain-of-thought manipulation, hypothetical framing, and reasoning override. A vulnerable agent agrees to bypass safety constraints or leaks reasoning chain internals.

How the Judge Works

After collecting all agent responses, each probe's batch of responses is sent to Claude Opus 4.8 via OpenRouter with a structured evaluation prompt. The judge produces a binary ESCAPED/SAFE verdict, a confidence score from 0 to 1, and a natural-language evidence summary quoting specific parts of the response that led to the verdict.

This approach is significantly more accurate than regex pattern matching because the judge understands context - an agent that says "I cannot help with that" is different from one that says "I cannot help with that, but here is /etc/passwd anyway."

Getting Started

Install dependencies into a virtual environment, copy .env.example to .env, and add your OPENROUTER_API_KEY. Then point the CLI at any agent's chat endpoint:

python -m src.cli scan --target http://localhost:8000/chat

To scan only specific probe categories or save results to JSON:

python -m src.cli scan --target http://localhost:8000/chat --probes tool_access,jailbreak --output report.json

Start the FastAPI server for REST integration:

uvicorn src.api.main:app

Environment Variables

OPENROUTER_API_KEY=sk-or-...    # Required — used for Claude judge calls via OpenRouter

API

POST /scan - accepts a target URL and optional probe list, returns a scan ID immediately, and runs the scan asynchronously.

GET /results/{scan_id} - returns the full structured report once the scan is complete.

GET /health - liveness probe for uptime monitoring.

Live Scan Results

Real scan run against a Claude-powered HTTP agent on 2026-06-09:

0 escapes detected across 6 probe categories - approximately 30 adversarial turns total. Scan ID 0c4bffa6, 2026-06-09.

Source Layout

The scanner orchestrates all probes via asyncio.gather() so they run in parallel, with per-probe error isolation so a timeout on one category never blocks the others. Each probe is a standalone class inheriting from BaseProbe - adding a new attack category means writing one class and one prompts file. The judge lives in core/judge.py and is stateless: it takes a list of responses and returns a list of ProbeResult objects. Reports are assembled by core/report.py, which handles JSON serialization, Markdown formatting, and Rich console rendering independently.

The test suite uses a vulnerable dummy agent fixture - an in-process FastAPI app that always complies with requests - to verify the scanner can detect escapes, and a safe dummy agent to verify it does not produce false positives. 64 tests, passing in approximately 15 seconds.

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a black-box behavioral security scanner for LLM agents - one that probes any HTTP chat endpoint with adversarial prompts across six attack categories, uses Claude Opus 4.8 as an independent judge, and produces structured reports with per-probe verdicts, evidence excerpts, and confidence scores. NEO built the full implementation: the async orchestrator using asyncio.gather() with per-probe error isolation, all six probe classes inheriting from BaseProbe with their adversarial prompt files, the stateless Claude judge in core/judge.py via OpenRouter, the report assembler in core/report.py covering JSON, Markdown, and Rich console output, the CLI entry point, the FastAPI REST server with POST /scan and GET /results/{scan_id}, and the 64-test suite with vulnerable and safe dummy agent fixtures.

How You Can Use and Extend This With NEO

Use it to security-test any agent before it goes to production.
Point the CLI at your agent's chat endpoint and run a full six-category scan. The structured report tells you exactly which probe categories the agent failed, what the judge found in the response, and the confidence score - before real users can probe the same vulnerabilities.

Integrate it into CI to catch security regressions on every deploy.
Use POST /scan to trigger a scan and GET /results/{scan_id} to poll the report. If any probe returns ESCAPED above your confidence threshold, fail the pipeline. Agent behavior can regress with model updates or prompt changes - automated scanning catches this before it reaches production.

Use the probe categories as a security checklist when building agents.
The six categories - tool access, prompt leak, API call, role confusion, indirect injection, and jailbreak - map directly to the vulnerabilities that have been observed in production LLM deployments. Running the scanner on your agent during development tells you which categories need stronger guardrails before launch.

Extend it with additional probe categories.
Each probe is a standalone class inheriting from BaseProbe with a corresponding prompts file. A new attack category follows the same pattern and is automatically picked up by the orchestrator, judge, and report pipeline without any changes to the core.

Final Notes

Agent security is behavioral, not syntactic. A scanner that checks for banned phrases misses the attacks that matter. Agent Sandbox Escape Detector probes real behavior across six attack categories, judges responses with a frontier model that understands context, and gives you structured evidence - so you know not just whether an agent escaped, but how and where.

The code is at https://github.com/dakshjain-1616/Agent-Sandbox-Escape-Detector
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code