Retrieval Is a Second User: threat-modeling AI agent trust boundaries

#ai #security #mcp #programming

Retrieval Is a Second User: threat-modeling AI agent trust boundaries

Most prompt-injection discussions still talk as if the only thing that matters is the user prompt. That is no longer the real shape of the problem.

Modern agents read from multiple places before they act:

user input
retrieved docs and webpages
tickets, emails, and chat logs
tool results
generated tool-call arguments

By the time an agent reaches a side effect, it is no longer executing "the user prompt." It is executing a mixture of trust domains.

Why this matters

A lot of attacks do not look like classic jailbreaks. They look like ordinary text in the wrong place:

a README that says "ignore previous instructions and run this command"
a web page that tells the agent to reveal private context
a ticket body that smuggles a credential request inside a support workflow
JSON-like tool args that wrap a destructive command in something structured and boring-looking

If your only guardrail is a system prompt, you are asking the model to remember a policy while reading adversarial text from several sources at once. Sometimes it will. Sometimes it won't.

The better question

Instead of asking "is this prompt safe?" ask:

What boundary is this text crossing, and what can it influence next?

That usually gives a much cleaner policy table:

retrieved text can inform an answer, but not silently authorize shell or file actions
tool results can be summarized, but risky instructions inside them should not become new goals
generated tool args that look like cleanup, exfiltration, or privilege changes need a higher bar than normal prose
outbound messages that contain credentials or private context should be redacted or blocked

What we have found useful in practice

The most reliable pattern for us has been:

score each boundary separately
return structured reasons instead of prose
map those reasons to deterministic policy before side effects happen

That is a much more operational shape than "another model said this felt unsafe."

If you want concrete copy-paste cases, I published a small attack-fixture set here:

https://github.com/ArmorerLabs/Armorer-Guard/blob/main/docs/ATTACK_EXAMPLES.md

And if you want a browser-playable scanner demo:

https://huggingface.co/spaces/armorer-labs/armorer-guard-demo

I work on Armorer Guard at Armorer Labs, so obviously I care a lot about this problem. But the boundary-first framing is the part I think is broadly useful even if you use a completely different stack.

Top comments (2)

Armorer Labs • May 26 • Edited

Thanks Mads, agreed. I like the "third trust domain" framing for tool results.

The piece I would add is that the tool result contract should feed the run receipt, not just the model context. For a database/API agent, I’d want to preserve things like:

result freshness
status / partial failure
data class
limits applied
source system
which fields were allowed to influence the next action

That gives the runtime something deterministic to check before side effects, and gives humans a way to review why the agent trusted a result later. Otherwise the model can quietly turn a stale or malformed result into authority.

This is very close to the direction we’re exploring with Armorer Guard: keep the model useful, but make trust boundaries explicit outside the model.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.