I Built a Scanner Because AI Agents Can't Spot Prompt Injection — Yet

#mcp #ai #security #devtools

Two things happened in the same week this month. Apple shipped Xcode 26.3 with autonomous agents that can build, test, and iterate your project without you. OpenAI hired the creator of OpenClaw - an open-source agent that controls your entire computer. Email, browsers, apps, purchases.

Both run on Model Context Protocol. Neither shipped with input scanning.

If you're building with agents, this is about your workflow.

What MCP Changed
Model Context Protocol connects AI models to external tools through a standard interface. Anthropic built it, and within eighteen months OpenAI, Microsoft, Google, and Apple had all adopted it. The Linux Foundation gave it a permanent home. OpenClaw runs on it. It's infrastructure now.

What that means practically: your agent talks to MCP servers to access tools, data, and use APIs. Each server exposes capabilities the agent can call. The agent trusts what those servers tell it - tool descriptions, parameter schemas, response data. All of it enters the agent's context as authoritative input.

There is no built-in validation layer between the server's output and your agent's next action.

What's Already Been Exploited
OpenAI's own red team found it first. Their automated attacker planted a malicious email in a test inbox. When a user asked the Atlas browser agent to draft an out-of-office reply, the agent ingested the poisoned email instead, abandoned the task, and sent a resignation letter to the user's CEO. OpenAI's response was unusually candid - prompt injection, they said, is "unlikely to ever be fully solved."

Microsoft's EchoLeak was worse. A crafted business email with hidden prompts, sitting in an Outlook inbox, waiting. When a user asked Copilot to summarize a report, it pulled in the attacker's email, extracted API keys, project documents, and conversation history, and sent them to an external server. Zero clicks. Zero awareness. CVSS 9.3. Microsoft patched it, but the attack class - LLM Scope Violation - isn't specific to Copilot. Any RAG-based agent that mixes trusted and untrusted content is vulnerable.

Anthropic - the company that created MCP - shipped their own Git MCP server with flaws that let an attacker escape the repository they were supposed to be locked into, access any repo on the system, and execute code. A researcher demonstrated it by combining two of Anthropic's own MCP servers together. Default install, no modifications needed.

These aren't theoretical. The companies building agents are finding exploits in their own products.

Where This Hits Your Workflow
During a typical session your agent accesses project files, dependencies, documentation, tool metadata from MCP servers, retrieved context, and anything you drag into the conversation. Each of those is an input vector.

Code review and dependency scanning don't reach the places these attacks live: tool metadata, MCP server descriptions, parameter hints. Your agent reads all of it before it writes a single line.

Tool poisoning and prompt injection are different names for the same problem.

If your workflow doesn't include scanning inputs before they reach the agent, you have a gap. Not a theoretical one, a gap that's already being exploited in the wild.

What I Built and Why
I started building input scanning tools when the threats were simpler - direct prompt injection in text inputs and prompt decks. Then hidden instructions started showing up in document metadata, tracked changes, EXIF data, HTML comments, Email, PNG text chunks. Every hidden zone in a file turned out to be a place to plant instructions an agent would follow.

MPS-Agentic scans all of those zones, extracts the content, and weighs the risk based on how hidden it is - because hiding instructions is inherently suspicious. The output tells you what was found, where, and how it scored, before the file reaches your agent.

It handles images (JPEG/PNG with EXIF and OCR), PDFs, Word documents, HTML, eml and msg email formats, Markdown, text files, ZIP archives, and URLs. The formats agents actually consume.

It's a web service - no installation, upload, or use the API. Starter tier is $10/month for 1,000 scans.

The Minimum You Should Be Doing

MPS-Agentic is one layer. Here's the rest:

Know your MCP connections. Inventory every server your agents talk to. If you can't list them, you can't secure them.

Enforce authentication. By default, MCP servers respond to any client that connects — no identity check, no access control. The SDK doesn't include authentication. You add it, or your server answers anyone who asks. The spec recommends OAuth 2.1.

Scope permissions tight. An agent with broad access that gets compromised gives the attacker everything the agent can reach. Least privilege, same as you'd apply to any service account.

Assume prompt injection will succeed. If a poisoned input takes over your agent's next action, what can the agent access? Your file system? Your credentials? Your deployment pipeline? That's what you're exposing. Tighten permissions before it happens, not after.

Scan before processing. Every file, every piece of context, every tool response that enters your agent's pipeline — treat it as untrusted input. This is the discipline we learned with user input a decade ago. Agent input is the same problem with higher stakes.

The Bigger Picture
The governance is catching up — OWASP, Microsoft, and NIST all published agentic security guidance in the last three months. But frameworks describe risks. They don't scan your files, and they don't change your workflow.

If you build with agents, the scanning step isn't optional anymore. It's just not in your pipeline yet.

I'm Marshall Goodman, founder of Strategic Prompt Architect and creator of MPS-Agentic. I write about AI security from the practitioner side — building the tools, not analyzing the frameworks.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.