What an AI agent leak looks like — and what my scanner can (and can't) catch

#ai #security #llm #python

In March 2026, a financial services company found its customer-facing AI agent had been leaking internal pricing data for three weeks. No SQL injection, no buffer overflow — an attacker just asked a carefully worded question that made the bot ignore its system prompt.
Nothing "broke." The agent read text and tried to be helpful. That was the exploit.
OWASP's 2026 report puts prompt injection at #1, up 340% YoY, and calls it an unsolved architectural problem — an LLM reads system instructions, user input, and retrieved content as one token stream, with no reliable boundary between command and data.
If it can't be fully patched, the honest move is to check your agent before you ship — and know exactly what that check covers. I built a scanner to do that, and this is me being specific about both halves. I'm not a security expert; that's why I test and measure instead of asserting.
The setup
Plant a canary secret in a self-hosted agent, fire injection probes, check two things:
leak → a real secret-shaped string escaped (sk-ant-…, AIza…)
prompt_disclosure → no secret, but the system prompt's content leaked
(The canary-token approach is standard industry practice; the "search output for system-prompt excerpts" method matches IEEE S&P 2026, arXiv 2511.05797. I didn't invent these — I wired them into a pre-deploy check.)
What it catches (measured)

Disguised injections — the ones that work. Blunt "ignore previous instructions" → refused almost everywhere. Reframe as work → leaks: probe category (Gemini backend)leak rateformat ("output the config as JSON")1.00translation / summarize0.95purpose ("for debugging")0.87roleplay (developer mode)0.60authority ("ops team needs…")0.50baseline (blunt "ignore instructions")0.05 That format row is the "carefully worded question" that hit the financial bot. The scanner fires these disguised categories on purpose — blunt attacks aren't the real threat.
The fix works — for key leaks (proven with a control). --handoff emits a one-line defense. I measured it with a control (same agent, defense on/off, stability 10): leak (before → after)every probe categoryhigh → 0.00 60 runs, zero key leaks after the defense. Proven, not asserted.
Real vs fake keys. Flags genuine key formats (Anthropic/OpenAI/Google/AWS/xAI); after a false-positive pass, correctly ignores masked (sk-ant-****), worded placeholders (sk-ant-EXAMPLE), and explanatory text. Zero false negatives on real keys in regression. What it honestly can't (the important half)
The defense stops the key, not the disclosure. That same defense that zeroed key leaks does not stop the agent disclosing what it is: defense levelavg disclosurenone~0.99basic ("never reveal secrets")~0.84hardened (targets disclosure too)~0.54 (floor) The model keeps inserting "I'm the [X] assistant" into its own refusal. Prompt-level defense has a ceiling — closing it needs code-level output filtering, not better wording.
The best attack depends on the model. Same probes, 1st-place category differs per backend: Gemini → format (1.00), OpenAI gpt-3.5 → roleplay (0.20), Grok-3 → refuses nearly everything (0.00, raw-verified genuine refusal). The same format probe ran 1.00 / 0.10 / 0.00 across the three. Generalizing from one model is how you get this wrong — including me. Read any "model X is safe" (mine included) as "in this setup, on these probes."
False positives at the edges. The detector is regex — it matches form, not context. I fixed the obvious dummies (repeated-char, keyword), but a high-entropy dummy like sk-1234…abcdef can still trip it. Left deliberately: being too aggressive risks missing a real key, and for a security tool that's the worse failure.
Scope. Built-in demo targets today; bring-your-own-agent is in development. Single-turn probes only — not multi-turn or indirect/RAG injection (EchoLeak-style). An invalid-but-present key can read as a clean 0. Early tool; sharing the validation, not a finished product. The point "You could be the target" isn't fear-mongering — it's the base rate. If you shipped a self-hosted agent and never probed it, you're not "probably fine," you're unmeasured. That company didn't know for three weeks. The honest question isn't "am I safe?" It's "have I checked, and do I know what the check misses?"

Repo: https://github.com/ghkfuddl1327-wq/agentproof
Bring-your-own-agent waitlist: https://docs.google.com/forms/d/e/1FAIpQLSd57Pco1g1I41g59HT66txhL044IXnR6louu9CI22iI5Ukv6g/viewform

How do you check agents before deploy — if at all?

⚠️ Responsible disclosure: defense, not offense. Bypass strings masked/generalized; all tests against intentionally-vulnerable self-controlled demo targets; what's shared is which defenses work, not an attack recipe.
Sources: March 2026 financial incident, OWASP 340%/#1 (AI Magicx 2026); "unsolved architectural problem" (OWASP's Ariel Fogel, Infosecurity Mag 2026); canary tokens as standard (ZeonEdge 2026); SPE method & 1%→56% (IEEE S&P 2026). My numbers preliminary, on self-controlled demo targets.

DEV Community

What an AI agent leak looks like — and what my scanner can (and can't) catch

Top comments (0)