Agent security has a survivorship-bias problem — we're armoring the wrong part of the plane

#agents #ai #programming #security

In WWII, the military studied bombers that came back and wanted to armor the spots with the most bullet holes. Abraham Wald pointed out the mistake: those holes mark where a plane can get hit and still fly home. The armor belongs where the returning planes have no holes — because the planes hit there never came back to be studied.

I've spent the last few weeks building a static + behavioral scanner for LLM agents, and ran it across 60+ open-source agent repos — AutoGPT, CrewAI, LangGraph, mem0, and a pile of newer frameworks.

Two things stood out.

One: the findings cluster around the same handful of issues everyone already talks about — eval(model_output) (yes, real, CVE-2025-51472 in SuperAGI), prompt-injection surfaces, LLM calls with no token ceiling (the "$4k overnight bill" stories). These are real. They're also the bullet holes on the planes that came home. Visible, patchable, survivable — and every SAST tool and guardrail vendor is racing to cover them. In 18 months they're table stakes.

Two, and this is the part that nags me: most well-maintained repos scan clean. And a clean scan is not proof of safety — it's the survivorship bias. The agents that failed catastrophically didn't leave a grep-able fingerprint. They left an incident: money wired to the wrong account, prod data deleted, a defamatory email sent, secrets exfiltrated through a poisoned tool. Those planes went down where our scanners have no data — which is exactly why the scans look clean.

So where were the downed planes actually hit? Not in code patterns. In two boundaries static analysis can't read:

Output → consequential action.

The agent's decision was plausible but wrong, and it triggered something irreversible. Every line of code was fine. The failure was in what it did, not what it is. Does anyone check whether irreversible tool actions (pay, delete, deploy, send) are gated behind a confirmation, a dry-run, a human?

Trust boundaries — MCP tools, agent-to-agent handoffs, persistent memory. The agent trusted a poisoned input and acted on it. No eval, no injection string, nothing to grep. Does anyone verify that agent A should trust agent B's output before acting on it? That an MCP tool's description isn't itself an injection?

These are invisible to SAST (no pattern), to guardrails (they filter one input, not the action), to evaluators (they score the text, not the consequence). Nobody is asking the question that actually keeps people up at night: "what can this agent DO, and is the dangerous part gated?"

I don't have this fully solved — I'm building toward it (release-gate, open source, if you're curious). But I'm posting because I think the framing matters more than any tool right now, and I want to be wrong in public.

So, genuinely: if you run an agent anywhere near production — what's the fatal boundary you're most afraid of that nothing you have today would catch? The irreversible action? The tool you can't fully trust? The model quietly drifting under you?

I'd rather learn what the missing planes look like from people who've flown the mission than keep guessing.

DEV Community

Agent security has a survivorship-bias problem — we're armoring the wrong part of the plane

Top comments (0)