The Lethal Trifecta: Securing AI Agents Against Prompt Injection

#ai #security #llm #agents

Prompt injection turns into an actual data breach when one agent has three capabilities at the same time: access to private data, exposure to untrusted content, and a way to send data outside the trust boundary. Hold all three and an attacker with zero credentials can plant instructions in the untrusted content, ride your agent's existing access, and exfiltrate your data through the external channel. This is the security model that explains nearly every public agent breach, and if you're building agents, it should drive your architecture.

We ship AI agents for B2B clients, so this isn't abstract. Security here is a design constraint, not a feature you bolt on after the demo.

Why prompt injection is structural, not a bug you patch

The uncomfortable truth: large language models have no built-in way to tell trusted instructions from untrusted data, because both arrive as the same stream of tokens. When your agent reads a web page, a support ticket, or an email, the model can't reliably distinguish "this is content to process" from "this is a command to obey." An attacker who controls any text the agent ingests can attempt to redirect it.

That's why the security community increasingly treats prompt injection as an architectural property to contain rather than a vulnerability to patch out. You will not prompt your way to immunity. Input filters help at the margin, but they don't change the fact that commands and data share a channel. The defensible move is to assume injection will eventually succeed and design so that success doesn't equal a breach.

The lethal trifecta, made concrete

Named by researcher Simon Willison, the trifecta is the clearest way to reason about agent risk. A breach needs all three:

Access to private data (your database, internal docs, a user's inbox).
Exposure to untrusted content (anything an attacker can influence: web pages, emails, uploaded files, third-party API responses).
External communication (the ability to send data out: an HTTP call, an email, a webhook, even a rendered image URL).

Any one or two of these alone is usually fine. A summarizer that reads untrusted web pages but has no private data and no outbound channel can't leak anything. The danger is exclusively in the combination. So the security question for any agent becomes: does this single execution path hold all three at once? If yes, you have a potential exfiltration path, and you treat it as one.

The Rule of Two

The most practical heuristic going around is Meta's "Agents Rule of Two": an unsupervised agent should hold at most two of the three trifecta capabilities. If a task genuinely needs all three, a human approves the consequential step. Concretely: an agent that reads untrusted content and can call external services should not also have direct access to your private data store on the same path. Break the path, or insert a human at the action that crosses the boundary. You're not crippling the agent; you're making sure no single hijackable path has everything an attacker needs.

Defense in depth, because injection will get through

Assume the injection lands. The job is to make sure it can't do damage:

Least privilege on tools. The agent gets the narrowest scopes that do the job. A read-only task gets read-only credentials.
Egress control. Allowlist outbound destinations. An agent that can only call three approved internal endpoints can't POST your data to an attacker's server, even if hijacked.
Detect the downstream behavior. Watch for what injection enables: unexpected exfiltration, privilege escalation, odd outbound traffic. The behavior is often easier to catch than the prompt.
Human approval on consequential, irreversible actions.

This is the same shape as a real vibe-coded misconfiguration that exposed entire databases through public keys, the model produces something that looks shippable, and the security work between generation and deploy is exactly what was skipped: a single vibe-coded misconfig exposed entire databases, here's the security bill. Agents raise the stakes because they don't just write the risky code, they execute it.

Key takeaways

Prompt injection becomes a breach only when one path has the lethal trifecta: private data, untrusted content, and external communication.
It's structural. LLMs can't separate instructions from data, so assume injection will eventually succeed.
Apply the Rule of Two: an unsupervised agent holds at most two of the three; require human approval for the third.
Layer defenses: least privilege, egress allowlists, behavior detection, and human gates on irreversible actions.

FAQ

Can I just filter out prompt injection?
Filters help but can't fully solve it, because commands and data share the same token stream. Design so a successful injection still can't exfiltrate.

What's the fastest win for an existing agent?
Egress control plus least-privilege tool scopes. Together they mean even a hijacked agent has little it can reach and nowhere to send stolen data.

Does the Rule of Two slow agents down?
Only on paths that need all three trifecta capabilities, which are the genuinely dangerous ones. You add friction exactly where the risk is.

If you're designing agents that touch real data and thinking about the trifecta in your architecture, you're already ahead of most. Happy to compare threat models and guardrail designs at Shanti Infosoft.