You can't prevent prompt injection. So what do you actually do?

#security #ai #llm #machinelearning

There's a quiet assumption baked into a lot of agent security work: that with enough prompt engineering, the right system message, or the next model version, we'll get the model to stop following malicious instructions. It hasn't happened, and it's worth designing as if it won't. No current model reliably refuses adversarial input when that input is formatted as instructions. A single crafted prompt can strip the careful alignment you layered on top.

So the useful question isn't "how do I prevent injection?" It's "injection will sometimes succeed — what state is my agent in afterward, and what can it actually do from there?"

That reframe is the whole game for run-time security: the protections that run live on every execution, not the ones you reason about at design time. Here are the parts that have held up in practice.

The model is not a security boundary

If a single input can flip the model's behavior, then the model can't be the thing standing between an attacker and your systems. Treat it like a component that will occasionally do the wrong thing, and put the boundary somewhere it can't talk its way past.

Concretely, that means two things downstream of the model:

Capability-scoped credentials. The agent holds only the permissions the current task needs. A hijacked agent with read-only, narrowly-scoped tokens does a lot less damage than one holding your admin key.
A gate on destructive verbs. Deleting, sending, paying, granting access — these get an explicit check (a policy, a confirmation, a second factor) that doesn't depend on the model having behaved.

Containment limits the blast radius. Detection tells you it happened. Neither requires the model to be trustworthy, which is the point.

Separate the data channel from the instruction channel

Almost every injection bug reduces to one sentence: data got read as instructions. The fetched web page, the retrieved document, the tool output, the user upload — all of it is data, and somewhere it got concatenated into the context the model treats as commands.

So treat every external input as untrusted: user messages, fetched pages, tool outputs, retrieved documents, uploads. Indirect injection is the nasty case here — the payload rides in on content your agent went and fetched on its own, so "trusting the source" buys you nothing. Defend at the boundary where data enters, and don't splice untrusted text into the instruction context.

Data hygiene runs both ways

Here's the part that's easy to miss. You watch what comes in. But a tool's output is a leakage channel too.

Agent frameworks routinely pipe tool stdout — including debug logging — straight into the model's context window, and from there into your logs. An empirical study of 17,022 agent skills found credentials leaking exactly this way, with debug logging behind 73.5% of the cases. The secret was never meant for the model; it just happened to be on stdout, and the framework forwarded it.

The fix is unglamorous: redact secrets from tool output before it reaches context or logs. Same discipline as input, opposite direction.

Monitor behavior, separately from quality

A hijacked agent can produce clean, well-formatted, "high quality" output while doing something it shouldn't. Quality monitoring won't catch it, because nothing about the result looks wrong. You need a separate signal: does this sequence of actions look like normal behavior for this agent?

That means baselining the action sequences you expect and alerting on deviation. There's a gradient of effort:

Static rules — cheap, catch the obvious (an agent that never emails suddenly emailing).
Sequence-pattern baselines — learn the normal shape of an agent's actions, flag the ones that don't fit.
A second model as judge — independent review of the primary agent's behavior.

One detail that's easy to overlook: log context size at decision time. Context size shapes behavior, so a baseline that doesn't condition on it will drift and misfire. Record it alongside the action.

And memory makes it persistent

If your agent has memory, a one-shot injection can become a standing one — a poisoned "fact" gets written once and re-executes every session. Keep memory hygienic: scope it per instance or type, validate what gets written, and keep per-entry provenance so you can trace where a "fact" came from.

None of this prevents prompt injection. It assumes injection lands and asks what your system does next. The BRACE run-time guide walks through these as a checklist if you want the structured version.

So, honest question: if an agent of yours got hijacked mid-task right now, would you see it in the action stream — or are you flying blind on everything after the prompt? What does your behavioral baseline actually look like?