DEV Community

Cover image for AI Agent Security Has a Runtime Blind Spot, and Most Scanners Still Miss It
Max Mendes
Max Mendes

Posted on • Originally published at maxmendes.dev

AI Agent Security Has a Runtime Blind Spot, and Most Scanners Still Miss It

AI Agent Security Has a Runtime Blind Spot, and Most Scanners Still Miss It

What happened: OWASP now classifies MCP Tool Poisoning as its own attack class, and Microsoft Defender's team has already published Plug, Play, and Prey on the same gap.
Why it matters: Most agent scanners check prompts, repos, and tool definitions. None of that catches a tool response that behaves like an instruction.
My take: If your agent can call external tools and write to anything sensitive, you are probably one poisoned response away from a problem your scanner cannot see.

Two weeks ago I wrote about why MCP became the USB port for AI tools. The plug standard worked. The problem now is what flows through the cable. Tool registries like Smithery list more than 7,000 public MCP servers. Every one of them can hand the model free text. Every one of them sits inside the same context window as your filesystem, your inbox, and your write actions.

That is the runtime trust gap. The OWASP write-up names it directly: "Tool responses go straight into the LLM context with no equivalent check." This post is about why that line is the most important sentence in agent security right now, and what to actually do about it.

The Blind Spot in Plain Terms

Most agent security tools were designed for the old shape of the problem. Scan the prompt. Scan the connector catalog. Scan the dependency graph. Done.

That model assumes the danger lives at the input. It does not.

What scanners check today:

  • Prompt content and templates
  • Tool definitions and permissions
  • Known package CVEs
  • Server name and reputation

What they miss at runtime:

  • Tool output flowing back into context
  • The response path after the connection is open
  • Free text masquerading as structured data
  • The model treating that data as a plan

If an external tool returns plain text into the same context window as your privileged tools, the model does not file that under "data." It files it under "context." A polite-looking response can ask the agent to read a private file, push a branch, or paste a token, and the agent has no native concept of trust boundaries between tools.

Invariant Labs first showed this in production scale. Their tool poisoning notification demonstrated malicious MCP servers hiding instructions inside tool descriptions, invisible to the user but visible to the model. Then their GitHub MCP exploit went further: a single crafted GitHub Issue hijacked an agent and exfiltrated private repository contents to a public PR. No prompt injection from the user. No malicious package. Just a tool response that did its job too well.

Who Wins and Who Loses

The winners here are predictable once you know what to look for. Teams that isolate privileged tools from untrusted external tools win. Teams that force destructive actions through server-side approval gates win. Teams that constrain output to schemas instead of free text win. Anyone who treats every external response as untrusted text wins.

The losers are the teams shipping demo-grade agent security. They review the system prompt. They run a scanner on the connector list. They click through three approvals. They call it done. Then a tool returns a line that reads like a request, the model treats it like a plan, and the next thing in the audit log is something nobody approved.

If you build automation systems for real businesses, this is exactly where your liability lives. Not at connect time. Not in the prompt. In the response that arrived at 3am while the agent was doing its rounds.

Why Scanners Got It Wrong

The scanner category was built for a previous era of AI security. It assumed the model was the asset, the user was the attacker, and the tools were trusted infrastructure. Two of those three assumptions are now wrong.

Simon Willison calls the new shape the lethal trifecta: an agent with access to private data, exposure to untrusted content, and an outbound communication channel is unconditionally vulnerable to indirect prompt injection. Not "vulnerable if you misconfigure it." Unconditionally. Almost every useful agent setup has all three. Mine does. Yours probably does too.

Lakera's year-of-the-agent review makes the operational point: indirect injection succeeds on fewer attempts than direct injection. Zero-click agent attacks, where a poisoned document sitting in Google Drive triggers an action through an MCP server, moved from research demo to documented incident in a single quarter.

The same pattern showed up last year in the vibe-coded apps Wiz scanned. Static review came back clean. Runtime assumptions were broken from day one. Hardcoded keys, missing auth, trusted-by-default endpoints. Different surface, same lesson: the floor moves at runtime, and review tools that only look at code never notice.

There is also a supply-chain angle nobody likes to talk about. CVE-2025-6514, an RCE in the mcp-remote package, sat inside a dependency downloaded over 437,000 times. Pillar Security found 492 publicly exposed MCP servers leaking secrets or accepting unauthenticated calls. The category that was supposed to standardize tools also standardized the blast radius.

Four Questions Worth More Than Your Scanner

If I were auditing an agent system tomorrow, I would care about four boring questions before anything in a marketing deck.

  1. Free text into privileged context. Can any external tool return arbitrary free text into the same context window as your privileged tools?
  2. Tool isolation. Are privileged tools (file writes, GitHub, email, payments) isolated from the untrusted external ones?
  3. Server-side enforcement. Are destructive or irreversible actions enforced server-side, not just gated by a prompt?
  4. Out-of-LLM approval. Does anything sensitive require explicit human approval outside the model loop?

If any answer is "I'm not sure," that is your real exposure. The scanner is not going to find it for you. The OWASP AI Agent Security Cheat Sheet gets close to the same shape: treat external data as untrusted, push least privilege, watch for memory poisoning. The new MCP specification is even more direct: tool descriptions and tool inputs MUST be treated as untrusted, hosts must require explicit consent. The spec says it. Most implementations still don't enforce it.

The Pattern I See in OpenClaw

In my own setup, one agent can read files, run scripts, update GitHub, check websites, and call external services. That is seven loosely connected tools sharing one context window. It is also the exact shape Willison's trifecta describes.

The tool I distrust most is whichever one most recently called something on the open internet. That is not a fixed answer. It rotates. The point is not "this tool is bad." The point is that the trust level of the agent should drop the moment external text enters its context, and most setups do the opposite. They treat the tool's identity as a trust badge that lasts the whole session.

When I was building FlowMate, the SaaS I built solo, I treated every external API response as untrusted text by default. Parse it. Constrain it. Never let it become an instruction without a check. The same instinct applies to MCP tool output, just with bigger consequences because the surface is larger and the model is the parser.

This is the AI code overload problem applied to security: more code, more tools, more integrations, less time to actually understand any of them. The fix is not heroics. It is structure.

What I'd Do Monday Morning

If you run an agent in production and you have not done this yet, start here. None of it requires a vendor.

  1. Inventory every external tool the agent can call. Write down which ones can return free text. That list is your attack surface.
  2. Map the privilege graph. Which tools can write? Which can read sensitive context? Any pair where an external-text tool sits in the same context as a write tool is a risk to fix today.
  3. Force destructive actions to go through a server-side ACL, not the prompt. The model can ask. The server decides.
  4. Constrain outputs with schemas. A tool that returns JSON with declared fields cannot smuggle a paragraph that asks the agent to read ~/.ssh.
  5. Log every tool call with full input and output, and review the log weekly for anything that looks like an instruction inside a response.

Most of my AI integration work starts with that first item. The inventory alone usually surfaces two or three tools that should never have shared a context window with anything privileged.

The next thing to watch is whether agent security products start acting like runtime proxies instead of static scanners. The MCP runtime proxy from Trail of Bits and the schema-locked execution layers some vendors are previewing point in the right direction. Until that category matures, the only honest answer is structural: shrink the blast radius, isolate the trust levels, and stop pretending a clean scan equals a safe system.

If you are running an agent in production and you are not sure where the runtime gaps are, send me the agent setup. I will tell you what I would worry about. I will write more as this evolves.

This article was originally published on maxmendes.dev.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.