Retrieval Is a Second User: threat-modeling AI agent trust boundaries
Most prompt-injection discussions still talk as if the only thing that matters is the user prompt. That is no longer the real shape of the problem.
Modern agents read from multiple places before they act:
- user input
- retrieved docs and webpages
- tickets, emails, and chat logs
- tool results
- generated tool-call arguments
By the time an agent reaches a side effect, it is no longer executing "the user prompt." It is executing a mixture of trust domains.
Why this matters
A lot of attacks do not look like classic jailbreaks. They look like ordinary text in the wrong place:
- a README that says "ignore previous instructions and run this command"
- a web page that tells the agent to reveal private context
- a ticket body that smuggles a credential request inside a support workflow
- JSON-like tool args that wrap a destructive command in something structured and boring-looking
If your only guardrail is a system prompt, you are asking the model to remember a policy while reading adversarial text from several sources at once. Sometimes it will. Sometimes it won't.
The better question
Instead of asking "is this prompt safe?" ask:
What boundary is this text crossing, and what can it influence next?
That usually gives a much cleaner policy table:
- retrieved text can inform an answer, but not silently authorize shell or file actions
- tool results can be summarized, but risky instructions inside them should not become new goals
- generated tool args that look like cleanup, exfiltration, or privilege changes need a higher bar than normal prose
- outbound messages that contain credentials or private context should be redacted or blocked
What we have found useful in practice
The most reliable pattern for us has been:
- score each boundary separately
- return structured reasons instead of prose
- map those reasons to deterministic policy before side effects happen
That is a much more operational shape than "another model said this felt unsafe."
If you want concrete copy-paste cases, I published a small attack-fixture set here:
And if you want a browser-playable scanner demo:
I work on Armorer Guard at Armorer Labs, so obviously I care a lot about this problem. But the boundary-first framing is the part I think is broadly useful even if you use a completely different stack.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.