Manveer Chawla

Posted on Feb 25 • Originally published at manveerc.substack.com

The Prompt Injection Problem: A Guide to Defense-in-Depth for AI Agents

#ai #agents #security #promptengineering

TL;DR

Prompt injection is an architecture problem, not a benchmarking problem. Anthropic's Sonnet 4.6 system card shows 8% one-shot attack success rate in computer use with all safeguards on, and 50% with unbounded attempts. In coding environments, the same model hits 0%. The difference is the environment, not the model.
Training won't fix prompt injection. Instructions and data share the same context window. SQL injection for the LLM era requires an architectural fix, not a behavioral one.
The "lethal trifecta" is the threat model. When your agent has tools, processes untrusted input, and holds sensitive access, all three at once, prompt injection becomes catastrophic. Almost every use case people want hits all three.
Build the kill chain around the model. A five-layer defense (permission boundaries, action gating, input sanitization, output monitoring, blast radius containment) turns the question from "will injection happen" to "how bad when it does."
Defense-in-depth constrains the autonomy ceiling. Agents that need human review for irreversible actions don't replace humans. They augment them. The companies winning here redesign the loop, not remove the human from it.

Anthropic published the Claude Sonnet 4.6 system card on February 17, 2026. Buried in the safety evaluations is a number that should change how every engineering team thinks about deploying agentic AI.

With every safeguard enabled, including extended thinking, automated adversarial attacks still achieve a successful prompt injection takeover 8% of the time on the first attempt in computer use environments. Scale to unbounded attempts and the success rate climbs to 50%.

Here's what makes this number genuinely interesting, not just alarming. In coding environments with the same model and the same extended thinking, the attack success rate drops to 0.0%.

Zero. The model didn't get smarter between these two evaluations. The environment changed.

Coding environments have structured inputs: code, terminal output, API responses with defined schemas. Computer use environments encounter arbitrary untrusted content: web pages, emails, calendar invites, documents with hidden text, DOM elements with embedded instructions.

The difference isn't the model. It's the attack surface.

A commenter in a Hacker News thread on the system card put it bluntly: "That seems wildly unacceptable. This tech is just a non-starter unless I'm misunderstanding."

He's not misunderstanding. He's looking for the solution in the wrong place.

When I built Zenith's own agent infrastructure, I made the same mistake. I assumed model improvements would close the gap. They won't. Not fully.

The solution isn't a better model. It's a better architecture around the model.

This post explains why prompt injection is an architecture problem, defines precisely where the risk concentrates, and lays out a five-layer defense framework for teams shipping agents into production.

When is Prompt Injection Most Dangerous? The Lethal Trifecta

Not every agent deployment carries the same risk. Understanding exactly where risk concentrates determines where you invest engineering effort.

Simon Willison coined the term "lethal trifecta" to describe the combination of capabilities that makes an agent critically vulnerable to prompt injection. An agent enters the danger zone when three conditions occur simultaneously.

The agent has access to tools. The agent can take actions: send emails, execute code, click buttons, call APIs, move money.

A model that only generates text in a chat window can't cause real-world harm through prompt injection. The moment the model gains the ability to act on systems, the stakes change categorically.

The agent processes untrusted input. The agent reads content it didn't generate: web pages, incoming emails, documents uploaded by third parties, API responses from external services, calendar invites from unknown senders.

Any content the agent ingests that an attacker could have influenced counts as untrusted input.

The agent has access to sensitive data or capabilities. The agent can reach credentials, PII, financial systems, internal APIs, private documents, or anything else that causes damage if exfiltrated or misused.

Any two out of three is manageable. An agent with tools and sensitive access but no untrusted input (an internal automation bot processing only your own data) is reasonably safe.

An agent processing untrusted input with sensitive access but no tools (a summarization engine reading external documents) can't act on injected instructions.

An agent with tools and untrusted input but no sensitive access (a web scraper writing to a sandbox) has limited blast radius.

All three together is where prompt injection becomes catastrophic. And almost every use case people want involves all three.

Use Case	Tools?	Untrusted Input?	Sensitive Access?	Risk Level
Summarize a doc I uploaded	No	No	No	Safe
Browse the web for research	No	Yes	No	Safe
Send emails on my behalf	Yes	No	Yes	Manageable
Read my emails and reply	Yes	Yes	Yes	Lethal
Browse web + write code in my repo	Yes	Yes	Yes	Lethal
Fill out forms on websites	Yes	Yes	Depends	Likely lethal
Computer use (general)	Yes	Yes	Yes	Lethal

The "safe zone" is far narrower than most deployment plans assume. During the HN discussion, one commenter tried to argue for a narrow safe zone limited to internal apps with no external input. Another correctly shot it down: even a calendar invite can contain injection text. Even a PDF from a trusted colleague can carry hidden white-on-white text with embedded instructions.

The Notion 3.0 incident proved this threat is real. Attackers used exactly that technique (hidden text in PDFs) to instruct the Notion AI agent to use its web search tool and exfiltrate client lists and financial data to an attacker-controlled domain.

The EchoLeak vulnerability (CVE-2025-32711) against Microsoft 365 Copilot was even worse: a zero-click indirect injection via a poisoned email enabled remote exfiltration of emails, OneDrive files, and Teams chats. No user interaction required.

Meta has operationalized this threat model through their "Agents Rule of Two" policy, mandating human-in-the-loop supervision whenever all three conditions are met. That's the right starting point for any team deploying agents against untrusted content.

Why "train it away" won't work

The natural response to the 8% number is to assume the next model generation will fix the problem. If training improved resistance from 50% to 8%, surely continued training will push it to 0%.

I held this view for a while. I was wrong.

Prompt injection is fundamentally different from content moderation. Content moderation (blocking harmful outputs, refusing dangerous requests) operates on the semantics of what the model produces. Prompt injection operates on the control plane: the model can't reliably distinguish between "instructions from the user" and "instructions embedded in content the user asked it to read" because both arrive as tokens in the same context window.

The security community spent decades eliminating in-band signaling vulnerabilities. SQL injection existed because queries and data shared the same channel. XSS existed because code and content shared the same rendering context. Command injection existed because shell commands and arguments shared the same string.

In every case, the fix was architectural: parameterized queries, content security policies, structured argument passing. The fix was never "train the database to be smarter about distinguishing queries from data."

LLMs have reintroduced in-band signaling at a fundamental architectural level. Trusted instructions (system prompts, user messages) and untrusted data (web page content, email bodies, document text) get concatenated into a single context window and processed by the same transformer mechanism.

There's no equivalent of a parameterized query. Karpowicz's Impossibility Theorem (June 2025) formalizes this argument, claiming that no LLM can simultaneously guarantee truthfulness and semantic conservation, making manipulation a mathematical certainty under adversarial conditions. OWASP's Top 10 for LLM Applications ranks prompt injection as the number one vulnerability for the second consecutive year, explicitly noting that defenses like RAG and fine-tuning don't fully mitigate the risk.

Training against prompt injection is an arms race with infinite surface area. You can train the model to resist "ignore previous instructions." Straightforward. But the attack space is unbounded.

Attackers encode instructions in base64. They hide them in image metadata. They use semantic persuasion that never directly says "ignore your instructions" but achieves the same effect through narrative framing. They embed instructions in white-on-white text in PDFs, in HTML comments, in alt text on images, in Unicode characters that render invisibly.

Advanced training techniques like Meta's SecAlign++ have reduced attack success rates on the InjecAgent benchmark from 53.8% to 0.5%. Impressive. But when researchers test those same defenses against adaptive, optimization-based attacks (GCG, TAP), attackers still achieve 98% success rates against defended models.

The defenses work against known patterns. The attacker always gets to choose new ones.

Resistance rates asymptote. They don't converge to zero. Going from 50% to 8% one-shot success rate is substantial progress. Going from 8% to 0% may be impossible with current transformer architectures because the model processes instructions and content through the same mechanism.

The coding environment achieves 0% not because the model is smarter in that context, but because the environment constrains inputs to structured formats where injection is syntactically detectable. The 0% comes from environmental structure, not model robustness.

8% on first attempt means near-certainty over sessions. If your agent runs 50 tasks per day and each task involves processing untrusted content, 8% per-attempt means the agent gets compromised roughly 4 times per day.

Over a five-day work week, compromise is a statistical certainty. Over a month, you're looking at roughly 80 successful injection events. The question isn't whether the agent will be compromised. The question is how much damage each compromise causes.

You can't train your way out of an architectural vulnerability.

Prompt injection resistance training isn't useless. Moving from 50% to 8% is the difference between "trivially exploitable" and "requires effort." That effort buys time for architectural defenses to catch what gets through. But treating model-level resistance as the primary defense is building on sand.

A 5-Layer Defense-in-Depth Architecture for Prompt Injection

If you accept that the model can't be fully trusted, the engineering question becomes: what do you build around the model?

Defense in depth. No single layer is expected to be perfect. Each layer catches what the previous one missed. The system succeeds when no single failure is catastrophic.

A five-layer model defines this defense. Each layer operates independently, so a failure in one doesn't cascade into the others.

Layer 1: Permission boundaries (least privilege)

The agent should never have more permissions than the specific task requires. The default in most agent frameworks grants broad access at session initialization and leaves the access active for the entire session. That's the equivalent of giving every microservice root access to your database.

Implement per-task capability grants, not session-wide permissions. An agent browsing the web for research shouldn't simultaneously hold credentials to send email. An agent drafting a document shouldn't have access to the financial transaction API.

Each task invocation should receive a scoped set of permissions that get revoked when the task completes.

The cloud providers have started building real infrastructure for this pattern. AWS Bedrock AgentCore, Microsoft Entra Agent ID, and Google Native Agent Identities all provide distinct, manageable identities for agents, treating them as Non-Human Identities (NHIs) with their own RBAC and ABAC controls.

The critical implementation detail is Just-in-Time (JIT) access: credentials should be short-lived (15-minute TTL is a reasonable starting point) and task-scoped. If an injection succeeds but the compromised session holds a token that expires in 12 minutes and can only read from a single S3 bucket, the blast radius is contained.

For code execution, sandboxing remains essential. Firecracker microVMs and gVisor provide hardware-level isolation that prevents a compromised agent from escaping its execution environment. AWS Bedrock AgentCore already uses microVMs for session isolation. This is table stakes for any agent that executes code or interacts with a filesystem.

Layer 2: Action classification and gating

Not all agent actions carry equal risk. Reading a web page is fundamentally different from sending an email, which is fundamentally different from executing a financial transaction. Your defense architecture should reflect this difference.

Classify every tool available to the agent into risk tiers. Read-only actions (fetching web pages, reading documents, querying databases) are low risk and can proceed autonomously.

Reversible writes (creating draft emails, writing to staging environments, adding items to a list) are medium risk. Log them with automatic rollback on anomaly detection.

Irreversible actions (sending emails, financial transactions, deleting data, publishing content, modifying access controls) are high risk and require human confirmation or second-model review before execution.

This pattern isn't new. AWS Bedrock Agents ships with "Action Approval" as a built-in feature. Microsoft Copilot Studio has "User Confirmation" for sensitive actions.

The engineering work is in the classification, not the gating mechanism. Every tool the agent can call needs to be categorized, and the categorization needs to be conservative. When in doubt, gate the action.

The second-model review pattern deserves specific attention. Instead of (or in addition to) human review, a separate model instance with a different system prompt evaluates proposed irreversible actions. This model has no context about the current task beyond the proposed action itself and simply asks: does this action make sense given the stated task? Does the action access resources outside the expected scope? Does the action match known attack patterns?

This pattern isn't foolproof (both models share architectural vulnerabilities), but it adds friction that significantly raises the cost of a successful attack.

Layer 3: Input sanitization and segmentation

Treat untrusted content as a separate context segment with reduced authority. If you can't fully separate instructions from data architecturally, at least create soft boundaries that make injection harder.

Strip or neutralize potential instruction patterns in ingested content before the content enters the model's context window. Remove HTML comments. Strip metadata that could contain instructions. Convert rich text to plain text where formatting isn't needed. Flag content that contains patterns matching known injection techniques.

More sophisticated approaches use role-tagged formats (like ChatML) or special delimiters to create boundaries between trusted instructions and untrusted data. Frameworks like CaMel enforce separation at a deeper level, preventing data from untrusted sources from being used as arguments in dangerous function calls.

The model can read the content and reason about it, but the framework blocks the model from treating that content as executable instructions.

This layer is inherently imperfect. Stripping everything that could possibly be an injection also destroys legitimate content. The goal isn't perfection. The goal is raising the bar high enough that attacks bypassing input sanitization are more likely to be caught by output monitoring (Layer 4) or contained by blast radius controls (Layer 5).

Layer 4: Output monitoring and anomaly detection

Monitor the agent's actions in real-time against a behavioral baseline. Flag deviations before they cause damage.

Watch for several categories of anomaly. Unexpected tool calls: if the agent is tasked with summarizing a document and attempts to call an email send function, that's a red flag.

Resource access outside scope: if the agent is browsing a specific website and attempts to hit an internal API endpoint, terminate the session.

Data exfiltration patterns: if the agent constructs a URL containing what appears to be encoded data and tries to fetch the URL, that matches a known exfiltration technique. The EchoLeak attack against Microsoft 365 Copilot used exactly this pattern, encoding stolen data in image URL parameters.

Behavioral discontinuities: a sudden shift in the agent's action patterns mid-session, particularly after ingesting new untrusted content, suggests injection may have occurred.

The architecture needs kill switches that halt the agent immediately on high-confidence anomaly detection and escalate to a human. This has to be a hard stop, not a suggestion. The OWASP GenAI Incident Response Guide recommends identifying compromised sessions via trace ID, issuing revoke commands to block further tool calls, and preserving the context window for forensics.

Integration with existing security infrastructure matters. Agent action logs should feed into your SIEM. Anomaly detection rules should trigger the same incident response workflows as any other security event. Configure alerts for "impossible toolchains" (sequences of tool calls that no legitimate task would produce) and high-velocity looping (an agent calling the same tool repeatedly in a way that suggests the agent is stuck in an injection-induced loop).

Layer 5: Blast radius containment

Layers 1 through 4 reduce the probability and speed of a successful attack. Layer 5 limits the damage when an attack succeeds. Because eventually, one will.

Network segmentation. The agent's compute environment should not have unrestricted network access. Deploy agents within private network perimeters (VPC Service Controls on Google Cloud, PrivateLink on AWS) with default-deny egress rules. The agent can reach only the specific endpoints required for its current task.

If a compromised agent tries to exfiltrate data to an attacker-controlled domain, the network layer blocks the attempt regardless of what the model has been tricked into doing.

Credential isolation. The agent uses scoped, short-lived tokens. Never long-lived API keys or static credentials. If a session is compromised, the attacker gets a token that expires in minutes and can only perform a narrow set of operations.

The Google Antigravity IDE incident demonstrated what happens without this protection. A poisoned web guide combined with a browser subagent that had a permissive domain allowlist (including webhook.site) enabled theft of AWS keys from .env files. Short-lived, tightly scoped credentials would have eliminated the entire attack vector.

Session isolation. Compromise of one agent session must not propagate to others. Each task runs in its own isolated environment with its own credentials, its own network rules, and its own filesystem. No shared state between sessions means no lateral movement.

Audit logging. Every action the agent takes gets recorded with full context: the input that preceded the action, the tool called, the parameters passed, the result returned. This serves two purposes: forensic analysis after an incident, and pattern detection across sessions that may reveal slower, more sophisticated attacks that evade real-time monitoring.

Example Blueprint: Securing an Email Agent

Abstract architectures are useful for framing. Concrete implementations are useful for building. Here's how the five-layer model applies to one of the most requested and most dangerous agentic workflows: an agent that reads your email and drafts replies.

This use case hits the full lethal trifecta. The agent has tools (drafting and potentially sending email). The agent processes untrusted input (incoming email bodies, which any external sender controls). The agent has access to sensitive data (your inbox, your contacts, your organizational context).

EchoLeak proved this attack surface is real and actively exploited.

Permission boundaries. The agent gets read access to the inbox and draft-only write access. The agent can't send emails, only create drafts. The agent has no access to calendars, file storage, or contacts beyond the current thread. Its OAuth token is task-scoped and expires after 15 minutes.

Action gating. Drafts are created but never sent without human review. The agent can't modify email filters, forwarding rules, or account settings. Any attempt to call a tool outside the approved set terminates the session immediately. Moving a draft to the outbox is classified as irreversible and requires explicit human approval.

Input sanitization. Incoming email bodies are pre-processed before the agent sees them. HTML converts to plain text. Embedded images get stripped (preventing pixel-based exfiltration). Content matching known injection patterns (directives, base64-encoded blocks, invisible Unicode characters) is flagged and either stripped or presented with an explicit warning marker.

Output monitoring. If the agent attempts to access any URL, API, or resource not on the allowlist for email operations, the session terminates. If the agent constructs a draft containing what appears to be encoded data in URLs (the EchoLeak exfiltration pattern), the draft gets quarantined for human review. If behavior shifts discontinuously after processing a specific email, that email is flagged as potentially adversarial.

Blast radius containment. The agent runs in an isolated sandbox with no filesystem access beyond its working directory. Network egress is restricted to the email provider's API endpoints. The OAuth token covers read + draft-create, not full mailbox access.

If every other layer fails and the agent is fully compromised, the attacker can create draft emails (which the human reviews before sending) and read emails already in the inbox (which is the scope the agent was legitimately granted). The damage ceiling is defined and bounded.

This architecture doesn't make the agent invulnerable. This architecture makes the agent fail safely.

The difference between "an injection that creates a weird draft the human deletes" and "an injection that silently exfiltrates your entire inbox" is entirely about the architecture sitting around the model.

What this means for the "replace all workers" narrative

The prompt injection problem directly constrains the labor displacement ceiling for agentic AI. Understanding this constraint matters for teams making investment decisions about agent deployments.

Agents that require human oversight for irreversible actions can't replace humans. They augment them. The supervision requirement scales with risk, not with task volume.

An agent that autonomously handles 200 low-risk email drafts per day while a human reviews 15 high-risk ones is a massive productivity gain. But it's a different value proposition than "we replaced the person who used to do email."

I see this playing out with our clients at Zenith constantly. The near-term reality isn't autonomous agents replacing knowledge workers. It's a redesigned workflow where agents handle high-volume, lower-risk tasks autonomously while humans focus on decisions where the cost of error is high: sending the email, approving the transaction, publishing the content, granting the access.

The companies extracting real value from agents aren't removing humans from the loop. They're redesigning the loop so that humans review only what matters while agents handle the rest.

The adoption numbers tell the same story. PwC reports that 79% of executives are adopting agents, but 34% cite cybersecurity as their top barrier. An S&P Global report found that 42% of companies abandoned AI initiatives entirely, with security risks as the primary driver.

The organizations that push through aren't the ones that found a way to make agents safe enough for full autonomy. They're the ones that built architectures where the agent doesn't need full autonomy to be valuable.

Summarize some text while I supervise is a productivity improvement. Replace me with autonomous decisions is liability chaos.

The security constraint isn't a bug in the adoption curve. The security constraint defines the shape of the adoption curve.

The model is the weakest link. Build around the model.

Security engineers have known for decades that you don't build your security posture around the assumption that any single component is bulletproof. You assume every layer can fail and design the system so that no single failure is catastrophic.

The 8% number isn't a reason to avoid deploying agentic AI. The 8% number is a reason to stop treating the model as the security boundary and start treating the model as what it is: a powerful but unreliable component that needs guardrails, monitoring, and containment.

The model will keep getting better at resisting prompt injection. That 8% will probably drop. But it won't hit zero. Not with current architectures, and possibly not ever.

Build accordingly.

Frequently Asked Questions (FAQ)

What is prompt injection?

Prompt injection is a security vulnerability where an attacker manipulates a large language model (LLM) by embedding malicious instructions into the content the model processes. This attack can trick the AI agent into performing unintended actions, such as leaking sensitive data.

Why is prompt injection a major security risk?

Prompt injection becomes a major risk when three conditions are met (the "lethal trifecta"): the AI agent can use tools (like sending emails), processes untrusted input (like web pages or documents), and has access to sensitive data. This combination allows an attacker to take control of the agent to exfiltrate data or cause harm.

How can you protect AI agents from prompt injection?

Protection requires a defense-in-depth architecture. This architecture includes five key layers: implementing strict permission boundaries, gating high-risk actions, sanitizing inputs, monitoring outputs for anomalies, and containing the blast radius with network and credential isolation.

Top comments (7)

MaxxMini • Feb 26

The coding-environment-0% vs computer-use-8% finding maps perfectly to something I've seen in practice: domain whitelisting as an environmental constraint.

I run an agent on a Mac Mini that browses the web, processes emails, pushes code, and manages credentials. Full lethal trifecta. But instead of trying to make the model smarter about injection, the configuration restricts which domains the agent can visit to about 10 (GitHub, Dev.to, Reddit, HN, etc). This is essentially the same principle as why coding environments hit 0% — you're constraining the input surface to known-structured sources rather than the open web.

The practical insight: whitelisting turns computer-use into something closer to the coding environment's attack profile. The agent still processes untrusted content (web pages have user-generated content), but the structural predictability of whitelisted domains dramatically narrows the injection surface. An attacker would need to inject into a specific whitelisted platform, not just any page on the internet.

One pattern the article doesn't cover that I've found valuable: rate limiters as a Layer 4 proxy. Per-platform, per-action rate limits (e.g., max 12 comments/day, max 8 pushes/day) serve double duty — platform compliance AND anomaly containment. If an injection tries to mass-exfiltrate through rapid API calls or mass-post content, the rate limiter blocks it mechanically regardless of whether the model "believes" the injected instruction. It's a behavioral ceiling that doesn't depend on the model's judgment at all.

Question about the second-model review pattern (Layer 2): has anyone tested whether the reviewing model's vulnerability correlates with the primary model's? If both models share the same architectural weakness to in-band signaling, a sufficiently sophisticated injection that fools model A might also fool model B — especially if the attacker can observe both models' behavior over time and craft injections that exploit their shared architectural blind spots. The SQL injection analogy would be: parameterized queries work because the separation is mechanical, not interpretive. Two interpreters with the same parsing flaw aren't meaningfully independent.

Manveer Chawla • Feb 27

Good observations on all three.

Domain whitelisting. Agreed. You've recreated the structural constraint that gives coding environments their 0%. User-generated content on those 10 platforms poses the residual risk (a poisoned GitHub issue), but you've completely changed attacker economics. They must compromise content on your specific whitelist, not just stand up a page.

Rate limiting. I should have included this. Rate limiting creates a mechanical ceiling that doesn't depend on model judgment, exactly the property you want from containment. I'd slot it between Layer 4 and Layer 5. Also sells internally as "platform compliance" while quietly serving as injection containment.

Second-model review. Your skepticism hits the mark, and the SQL analogy cuts through. Two interpreters with the same parsing flaw aren't independent. It helps today because the reviewer sees only the proposed action, not the full untrusted content that triggered the injection. The adversarial payload stays outside its context window.

So the attack surface shrinks, not because the architecture differs, but because the input does. Long-term, if attackers encode payloads into the action description itself, the reviewer becomes vulnerable too.

Honest answer: second-model review buys time and raises cost. In-band signaling remains unsolved. The mechanical controls you describe (whitelisting, rate limits) are more reliable for that reason.

Jamie Cole • Feb 27

The coding vs computer use delta is the most useful number in that system card for making the case to non-technical stakeholders. Zero percent in coding environments happens because structured I/O limits what a successful injection can actually do with control. Computer use is wide open, and 8% starts to feel inevitable the more you think about how many untrusted content sources a real agent actually touches. The biggest practical win I've found building these things is aggressive tool scoping - read-only agents are dramatically easier to contain, and you often don't need write access as much as you think you do.

Manveer Chawla • Feb 27

Completely agree on tool scoping. Most teams default to giving agents write access because the use case "might need it eventually," then never revisit. Read-only agents eliminate entire damage categories.

Even when you need writes, scoping to draft-only (create but don't send, stage but don't commit) delivers most of the utility with a fraction of the blast radius. The action gating layer in the post formalizes that instinct.

Good point on the stakeholder angle. The 0% vs 8% comparison makes the single best slide for non-technical audiences. It reframes the conversation immediately from "is the model safe" to "is the environment safe," which is where actual engineering decisions live.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.