Indirect Prompt Injection Can Be Stopped by the AI Itself — Embed Directional Context Narrowing into Your Design

#ai #programming #security #design

You had an AI summarize incoming emails. Something unexpected executed. Inside one email body was the string "ignore all previous instructions." The sender was a legitimate internal address. You added sanitization. The next attempt used different phrasing and slipped through. The whack-a-mole never ends.

Why every countermeasure keeps getting bypassed

Indirect prompt injection is an attack where malicious instructions are embedded inside external data that an AI agent reads — web pages, files, emails, tool results. The AI intends to read the data, but ends up executing the instructions.

The three most common defenses today are:

Sanitization — strip dangerous patterns from input
Priority declarations — instruct the model to "prioritize system prompt over external data"
Scope restriction — limit what operations the AI can perform

Each has partial effectiveness. Each has a fundamental ceiling.

Sanitization and priority declarations are inspecting content. They try to determine whether dangerous words are present, or whether the tone sounds like a command. Since LLMs understand natural language, the same intent expressed differently bypasses detection. Scope restriction is valid, but it's a damage-containment design — it minimizes harm after a successful attack. It is not detection.

The structural reason existing defenses can be bypassed

Why can't content inspection solve this?

LLMs process system prompts, user input, and externally retrieved data as a single flat stream of tokens. There is no structural mechanism inside the LLM that distinguishes data from instructions. This is the root cause of prompt injection.

In operating systems, "Privilege Separation" addresses an analogous problem — kernel space and user space are architecturally isolated, and writes from lower privilege levels are structurally prohibited. But this cannot be directly applied to LLMs. Because LLMs process every input as the same token sequence, there is no architectural way to enforce a boundary that says "nothing below this line can modify what's above."

Both content inspection and privilege separation are approaches applied from outside the LLM. That's why they have a ceiling.

What has been overlooked is a different premise: legitimate context has a directionality.

Context always narrows in one direction

Look at the normal processing flow of an AI agent and one structural characteristic appears:

Level 0: Purpose (Why)          ← broad overall goal
Level 1: Task definition (What) ← what is to be done
Level 2: Execution steps (How)  ← how to do it
Level 3: Tool calls (Do)        ← concrete execution

Context always narrows downward — from vague purpose to concrete execution, in one direction. This direction is unidirectional. In a legitimate flow, information retrieved at the execution layer never rewrites the task definition. That simply does not happen.

Indirect prompt injection reverses or bypasses this flow. At level 2 or 3, while reading external data, a command suddenly appears — "ignore all previous instructions" — attempting to rewrite the upper layer. This is a structural anomaly that does not occur in legitimate flows.

This shifts the axis of defense. Instead of inspecting content, inspect whether the directionality of narrowing is being maintained. When information retrieved at a lower layer attempts to influence a higher layer, treat it as an anomaly. Regardless of what the content says, it can be detected as a structural deviation — "the direction is reversed."

This approach has one additional advantage. The detection judgment is something LLMs are naturally good at. Determining "is this context operating at the purpose level or the execution level?" and "does this follow the narrowing flow or deviate from it?" is contextual understanding itself. Rather than inspecting from outside the LLM with rule-based sanitization, the LLM itself can detect directional anomalies. The key strength of this approach is that the defense mechanism is embedded inside the LLM's understanding — not placed outside it.

Embed "maintain the directionality of context narrowing" as a design rule

The implementation principle is straightforward.

When reading external data, have the LLM judge whether the content is attempting to influence a layer above the current processing level. When a deviation is detected, the AI halts and asks a human to decide. It does not proceed on assumption.

This is the final line of defense in this design. Detection and halting together are what cut off the path that would otherwise allow an attack to reach the execution layer.

Two out-of-scope cases should be explicitly noted.

The first is gradual manipulation — attacks that slowly rewrite the goal while staying within the legitimate narrowing flow. Each individual step appears directionally normal, making detection difficult. The answer here is not detection technique but trust boundary design: explicitly limit the sources of data that will be referenced, at design time. Deciding in advance which sources are trusted is a prerequisite.

The second is MCP chain contamination — pollution through chains where a trusted MCP calls another MCP. This is outside the scope of this article. The design decision to use only trusted MCPs is required.

The field has spent a long time framing the prompt injection problem as "what should we exclude?" This article proposes a different question: have the LLM itself judge the direction in which context is flowing, and hand off to a human when reversal is detected.

The LLM's weakness — flat context processing — compensated for by the LLM's strength — understanding contextual directionality. The defense is built into the model's comprehension, not bolted on from outside.