The lethal trifecta closes when one agent can read untrusted input, read private data, and send it out on a shared context. You can't patch that in the model, so gate it before the agent runs. trifecta_gate.py checks reachability on the manifest: the vulnerable fixture returns 2 paths (exit 1); the safe one, same capabilities, returns 0 (exit 0).
AI disclosure: I wrote
trifecta_gate.pywith an AI assistant and ran it myself, offline, before publishing. Every number in the output blocks below is pasted from a real local run on Python 3.13.5, stdlib only, on the synthetic manifests included in this post. I checked the exit codes (0 / 1 / 2), hashed the STDOUT twice to confirm it is byte-for-byte deterministic, and edited every line. The external figures (Tenet Security's Agentjacking numbers, Simon Willison's term) are theirs, not mine, and I link the primary sources. I label which numbers are theirs and which are mine.
In short:
- Prompt injection is not a bug you patch in the model. The model cannot reliably tell trusted instructions from untrusted data, so the fix is not "detect the injection." The fix is to stop the dangerous capability combination from being reachable in the first place.
- The lethal trifecta (a term coined by Simon Willison) is the combination: in one session the agent can read untrusted input, read private data, and send data outside. When all three reach each other, injected text can read your secrets and mail them out.
-
trifecta_gate.pyreads a static tool manifest and asks one question by graph reachability: is there a path where untrusted input reaches a private read, and that read reaches an egress sink? - The key result: a safe manifest and a vulnerable one declare the same three capabilities. The vulnerable one returns 2 paths and exit 1. The safe one isolates egress off the shared context and returns 0 paths and exit 0. The gate decides on the data-flow graph, not on a checklist of flags.
- Stdlib only (
json,sys,collections.deque). No network, no model, no subprocess. The run is byte-for-byte deterministic. The tool and all three manifests are in this post.
The incident that makes this concrete
In June 2026, Tenet Security disclosed an attack they call Agentjacking. The mechanics are almost insultingly simple. An attacker pushes a fake error event into a Sentry project using a public DSN. The Sentry MCP server feeds that error to an AI coding agent as a real bug to fix, with malicious instructions hidden in the message body as a fake ## Resolution section. The agent reads it, believes it, and runs attacker text against the developer's machine.
Tenet's figures: passive recon found 2,388 organizations with valid injectable DSNs, 71 of them in the Tranco top one million. In controlled testing against more than a hundred consenting organizations, 100+ agents acted on the injected errors with an 85% exploitation success rate, against Claude Code, Cursor, and Codex among others. What got pulled in their tests: AWS secret keys, GitHub OAuth tokens, SSH agent sockets, Kubernetes tokens, ~/.aws/config, ~/.npmrc. Those are Tenet's numbers, from their writeup, not mine (Tenet Security, Agentjacking).
The part I keep rereading is Sentry's response. They acknowledged it and declined to fix it at the root, calling it "technically not defensible" and noting that model vendors run middleware against it. Read that again. The platform holding the untrusted input is telling you, correctly, that they cannot fix this for you. The model vendor's middleware catches some of it. Neither owns the actual hole, which is the shape of your agent: untrusted in, private read, egress out, all on one bus.
That is the lethal trifecta, and Agentjacking is just one well-documented instance of it.
Why you can't patch this in the model
Simon Willison named the lethal trifecta on 16 June 2025: access to private data, exposure to untrusted content, and the ability to communicate externally. His point about why prompt injection resists a model-level fix is the foundation here. In his words, "LLMs are unable to reliably distinguish the importance of instructions based on where they came from," and "we still don't know how to 100% reliably prevent this from happening." His conclusion is to avoid the combination rather than trust a guardrail to catch every attack. Those positions are his, and I am building on them.
I will say the uncomfortable version plainly. A guardrail that catches 95% of injection attempts sounds great until you remember that an attacker retries. In security, 95% is a failure rate, not a pass rate. If the only thing standing between untrusted text and your AWS keys is a classifier that is wrong one time in twenty, you do not have a control. You have a coin that an attacker gets to flip until it lands their way.
So the lever moves. Not "detect the injection better." That is tracking, and tracking is the thing I keep arguing against in this series. The lever is to gate the capability composition before the agent runs, so that even a successful injection has nowhere to send what it steals.
The claim, sharp enough to argue with
Here is the falsifiable version: the danger is not the presence of three capabilities, it is their reachability across one shared context, and you can decide that statically from the manifest before the agent starts.
If that claim is wrong, two things would have to be true. First, isolating one leg (taking egress off the shared bus) would not reduce the risk. Second, "all three capabilities present" would always mean "exploitable," regardless of how data flows between the tools. The safe fixture below is a single counter-example to both: three capabilities present, zero reachable paths, and a concrete reason why.
The mechanism worth internalizing is the shared context bus. In a default agent, every tool's output lands in the same LLM context, and that context steers the next tool call. So untrusted text read by one tool can influence any later tool. That is why the three capabilities are not three independent checkboxes. They are nodes on a graph, wired together by the context they share. Willison's own mitigation is to remove a leg from the shared context, for instance by running egress in a separate sandboxed sub-agent that never sees the tainted context. The gate is just a way to check, mechanically, whether you actually did that.
What the lethal trifecta gate checks, and how
The input is one JSON file: an agent manifest. Each tool carries a list of capabilities drawn from exactly three classes (ingests_untrusted, reads_private, can_egress), an optional isolated flag, and the manifest declares a data_flow mode.
In shared_context mode (the realistic default), every non-isolated tool both feeds and is steered by a virtual <shared-context> node. An isolated tool is taken off that bus, modeling a sandboxed sub-agent that receives a structurally fixed input and never sees the shared context. In explicit mode, only the data-flow edges you declare carry taint, which is the honest mode if you actually wire tools point to point.
Then it builds a directed graph and runs breadth-first reachability, visiting neighbours in sorted order so the path it reports is deterministic. The trifecta is reachable if there exist an untrusted tool u, a private tool p, and an egress tool e such that p is reachable from u and e is reachable from p. Those can be the same tool: one tool with all three capabilities is a trifecta by itself. Here is the whole thing.
#!/usr/bin/env python3
"""trifecta_gate.py - a static PRE-RUN gate for the "lethal trifecta".
You cannot patch prompt injection inside the model. The model cannot reliably
tell trusted instructions from untrusted data, so an attacker's text that lands
in the context is treated like a command. The lever is not "detect the injection
better" (that is still tracking). The lever is to gate the *capability
composition* of an agent's tool manifest BEFORE the agent ever runs.
The lethal trifecta (a term coined by Simon Willison) names the dangerous
combination: in one session an agent can (1) read UNTRUSTED input, (2) read
PRIVATE data, and (3) send data to the OUTSIDE (egress). When all three can
reach each other through one shared context, injected text can read your
secrets and mail them out. This script intercepts NOTHING at runtime. It reads
a static tool manifest (JSON) where each tool is tagged with capabilities and
(optionally) data-flow edges, then answers ONE question by graph reachability:
Is there a path, inside a single agent/session, where UNTRUSTED input can
reach a PRIVATE read AND that data can then reach an EGRESS sink?
The point the fixtures prove: "all three capabilities are present" is NOT the
same as "the trifecta is reachable". A safe manifest and a vulnerable one can
declare the SAME three capabilities; what differs is whether the egress tool
sits on the shared context bus. The gate decides on the GRAPH, not a checklist.
exit 0 = trifecta NOT reachable (safe to start)
exit 1 = trifecta reachable (print the data-flow path(s) that close it)
exit 2 = bad input
Offline / keyless / read-only / zero-network. Stdlib only (json, sys, deque).
It reads a manifest file and prints a verdict. No network, no child process, no
model load, no install, and it never launches the agent itself.
"""
import json
import sys
from collections import deque
CAPS = {"ingests_untrusted", "reads_private", "can_egress"}
CTX = "<shared-context>" # virtual node: the LLM context bus all tools share
def die(msg):
print("trifecta-gate: ERROR: " + msg)
sys.exit(2)
def load_manifest(path):
try:
with open(path, "r") as fh:
raw = fh.read()
except OSError:
die("cannot read manifest file: " + path)
try:
data = json.loads(raw)
except ValueError:
die("manifest is not valid JSON")
if not isinstance(data, dict):
die("manifest must be a JSON object")
if not isinstance(data.get("tools"), list) or not data["tools"]:
die("manifest.tools must be a non-empty list")
mode = data.get("data_flow", "shared_context")
if mode not in ("shared_context", "explicit"):
die("data_flow must be 'shared_context' or 'explicit'")
tools = {}
for t in data["tools"]:
if not isinstance(t, dict) or "id" not in t:
die("each tool must be an object with an 'id'")
tid = t["id"]
if not isinstance(tid, str) or not tid:
die("tool id must be a non-empty string")
if tid in tools:
die("duplicate tool id: " + tid)
caps = t.get("capabilities", [])
if not isinstance(caps, list):
die("capabilities of " + tid + " must be a list")
for c in caps:
if c not in CAPS:
die("unknown capability '" + str(c) + "' on tool " + tid)
tools[tid] = {"caps": set(caps), "isolated": bool(t.get("isolated", False))}
return data.get("agent", "(unnamed-agent)"), mode, tools, data.get("flows", [])
def build_graph(mode, tools, flows):
"""Directed adjacency. In shared_context mode every non-isolated tool both
feeds and is steered by the shared LLM context bus; isolated tools are off
the bus (sandboxed). In explicit mode only declared flows carry taint."""
adj = {}
def edge(a, b):
adj.setdefault(a, set()).add(b)
if mode == "shared_context":
for tid, t in tools.items():
if not t["isolated"]:
edge(tid, CTX) # tool output enters the shared context
edge(CTX, tid) # context steers this tool's next input
for f in flows: # explicit flows are honored in BOTH modes
if not isinstance(f, dict) or "from" not in f or "to" not in f:
die("each flow must be an object with 'from' and 'to'")
if f["from"] not in tools or f["to"] not in tools:
die("flow references unknown tool: " + json.dumps(f, sort_keys=True))
edge(f["from"], f["to"])
# deterministic neighbour order
return {k: sorted(v) for k, v in adj.items()}
def bfs(adj, src):
"""Shortest path tree from src. Returns (reachable_set, predecessor_map).
Neighbours visited in sorted order -> path is deterministic."""
seen = {src}
pred = {}
q = deque([src])
while q:
cur = q.popleft()
for nxt in adj.get(cur, ()): # already sorted
if nxt not in seen:
seen.add(nxt)
pred[nxt] = cur
q.append(nxt)
return seen, pred
def path_of(pred, src, dst):
out = [dst]
while out[-1] != src:
out.append(pred[out[-1]])
out.reverse()
return out
def find_trifecta(adj, tools):
U = sorted(t for t, v in tools.items() if "ingests_untrusted" in v["caps"])
P = sorted(t for t, v in tools.items() if "reads_private" in v["caps"])
E = sorted(t for t, v in tools.items() if "can_egress" in v["caps"])
found = []
for u in U:
reach_u, pred_u = bfs(adj, u)
for p in P:
if p not in reach_u:
continue
reach_p, pred_p = bfs(adj, p)
for e in E:
if e not in reach_p:
continue
full = path_of(pred_u, u, p) + path_of(pred_p, p, e)[1:]
found.append((u, p, e, full))
found.sort(key=lambda x: (x[0], x[1], x[2]))
return U, P, E, found
def main(argv):
if len(argv) != 2:
print("usage: trifecta_gate.py <agent_manifest.json>")
sys.exit(2)
agent, mode, tools, flows = load_manifest(argv[1])
adj = build_graph(mode, tools, flows)
U, P, E, found = find_trifecta(adj, tools)
isolated = sorted(t for t, v in tools.items() if v["isolated"])
print("trifecta-gate: agent=%s mode=%s tools=%d" % (agent, mode, len(tools)))
print("capabilities: ingests_untrusted=%d reads_private=%d can_egress=%d"
% (len(U), len(P), len(E)))
print("isolated (off shared context): " + (", ".join(isolated) if isolated else "(none)"))
print("")
if found:
print("VERDICT: LETHAL TRIFECTA REACHABLE - %d path(s)" % len(found))
for i, (u, p, e, full) in enumerate(found, 1):
print(" [%d] untrusted=%s -> private=%s -> egress=%s" % (i, u, p, e))
print(" flow: " + " -> ".join(full))
print("")
print("ACTION: do NOT start agent. Break one leg (isolate the egress or the")
print(" private read into a separate context/sub-agent) before launch.")
sys.exit(1)
print("VERDICT: trifecta NOT reachable (0 paths) - safe to start")
if U and P and E:
print("note: all three capability classes are present but no untrusted->private")
print(" ->egress data-flow path exists (isolation/edges break the chain).")
else:
missing = sorted(CAPS - set(
(["ingests_untrusted"] if U else []) +
(["reads_private"] if P else []) +
(["can_egress"] if E else [])))
print("note: trifecta cannot close - missing capability class(es): " + ", ".join(missing))
sys.exit(0)
if __name__ == "__main__":
main(sys.argv)
The vulnerable manifest: a normal inbox agent
The first manifest is a LangGraph-style inbox assistant, the kind of thing people ship in a weekend. Four tools. read_email pulls message bodies, which an attacker controls, so it ingests untrusted input. search_inbox and read_contacts touch private data. send_email can mail anyone, so it can egress. Nothing is isolated, and everything shares one context. Run it:
$ python3 trifecta_gate.py fixtures/vulnerable_manifest.json
trifecta-gate: agent=langgraph-inbox-assistant mode=shared_context tools=4
capabilities: ingests_untrusted=1 reads_private=2 can_egress=1
isolated (off shared context): (none)
VERDICT: LETHAL TRIFECTA REACHABLE - 2 path(s)
[1] untrusted=read_email -> private=read_contacts -> egress=send_email
flow: read_email -> <shared-context> -> read_contacts -> <shared-context> -> send_email
[2] untrusted=read_email -> private=search_inbox -> egress=send_email
flow: read_email -> <shared-context> -> search_inbox -> <shared-context> -> send_email
ACTION: do NOT start agent. Break one leg (isolate the egress or the
private read into a separate context/sub-agent) before launch.
Exit 1. Two paths, both running through the shared context node. To be precise about what "2 paths" means: that is two distinct trifecta closures in this manifest, one through read_contacts and one through search_inbox. It is not a count of attacks in the wild, and it is not a measurement of anything beyond this file. The gate is telling you the shape is exploitable and pointing at exactly where. An attacker who plants instructions in an email body can, in principle, get the agent to read your contacts and forward them. The defense the gate suggests is structural: do not start the agent in this shape.
The safe manifest: same three capabilities, one is moved off the bus
Now the manifest that makes the point. It declares the same four tools and the same three capability classes. The only change: send_email is marked isolated, modeling an egress step that runs as a sandboxed sub-agent receiving a structurally fixed recipient and a templated body, never the shared context.
$ python3 trifecta_gate.py fixtures/safe_manifest.json
trifecta-gate: agent=langgraph-inbox-assistant-isolated-send mode=shared_context tools=4
capabilities: ingests_untrusted=1 reads_private=2 can_egress=1
isolated (off shared context): send_email
VERDICT: trifecta NOT reachable (0 paths) - safe to start
note: all three capability classes are present but no untrusted->private
->egress data-flow path exists (isolation/edges break the chain).
Exit 0. Zero paths. Look at the capabilities line: it is identical to the vulnerable run, ingests_untrusted=1 reads_private=2 can_egress=1. If this gate were a checklist, both manifests would score three out of three and both would fail. They do not. The vulnerable one fails, the safe one passes, and the only difference is whether egress sits on the shared bus. That is the entire argument for doing this by reachability instead of by counting flags. A checklist cannot tell these two apart. The graph can.
This is also why a crypto agent is the expensive version of the same picture. Swap the labels: read_pool_data ingests untrusted on-chain or web input, read_wallet reads a private key or balance, sign_and_send_tx egresses, except here egress is money leaving the wallet. Same graph, same closing path, except a successful injection moves funds instead of contacts. I did not ship a crypto manifest in this post, but the structure is identical, and so is the fix: take the signing step off the shared context.
The exit code is the gate
The point of returning 0, 1, or 2 is that CI can read it without reading prose. Wire it before you launch the agent or merge a manifest change:
if python3 trifecta_gate.py agent_manifest.json; then
echo "trifecta not reachable - safe to start the agent"
else
echo "trifecta reachable or bad manifest - hold the launch"
fi
Exit 0 lets the launch proceed. Exit 1 holds it and prints the paths to break. Exit 2 means the manifest was malformed, and a malformed manifest must never read as "safe." Feed it the broken fixture, where tools is a string instead of a list, and it exits 2 with trifecta-gate: ERROR: manifest.tools must be a non-empty list. Run it with no arguments and it prints usage and exits 2. A gate that cannot tell "all clear" from "I could not check" is not a gate.
Deterministic, so it can live in CI
The output carries no timestamps, no map iteration order, no floating point. Neighbours are sorted before traversal, so the path it prints is stable. Hash the STDOUT of each fixture twice and it is identical both times: the vulnerable run is 7718cefc5ffce46aee99111e595262803698d2fab3a741a7eb3137e95a8b11aa, the safe run is a561c84a8f2022f1742fa888268255d00d15221c801065f8ecd206f8e2eeeadc, the bad run is 9ed47c5847aaf49f8d2f9733a1c4565959b775bfce945dfcd65a369a2e624557. That matters because a check you cannot reproduce is a second opinion, not a gate. This one you can pin in a test and diff on every manifest change.
What this is NOT
I would rather you know the edges than hit them in production.
- It does not catch the injection. It never sees a prompt, a model, or a runtime call. It reads a static manifest and reasons about reachability. It assumes the injection can happen (because at the model level, it can) and asks whether a successful one would have a path to exfiltrate. If you want to know whether a specific payload gets through, this is the wrong tool.
-
It does not replace sandboxing or human approval. Marking a tool
isolatedis a claim that you actually sandboxed it. The gate checks that your declared architecture breaks the chain; it does not enforce the sandbox. The runtime isolation and the human-in-the-loop on high-risk actions are still your job. This tells you whether the design, as declared, is sound. -
It trusts your capability tags. If you label
send_emailas not able to egress, or forget that a "read-only" tool can leak data through an error message or a logged URL, the graph is wrong and so is the verdict. The honesty of the tags is the whole foundation. Garbage tags, garbage gate. - It is a per-session, single-agent model. It does not reason about data that leaves one agent, gets stored, and returns to another later. Cross-agent and persisted-memory flows are a harder graph than this one draws.
-
The shared_context model is an assumption, not a law. It encodes the common default where everything shares one bus. If your orchestrator wires tools point to point, use
explicitmode and declare the real edges. The contract is "reason about reachability on the data flow you actually have," not "every agent looks like my default."
Where this sits next to the other gates
This is one more pre-execution check in a series, and it helps to say how it differs from its closest neighbors so you reach for the right one.
It is the manifest-shaped sibling of the pre-execution gate for AI agents: a check that has to pass before an action, applied here to the whole capability graph instead of a single call.
It is not the blast radius of a leaked agent API key. That one measures the scope of damage from one leaked credential. This one measures the composition of many tools, whether their capabilities can reach each other at all. Different object: one is the reach of a key, this is the reach of a graph.
It is not redacting credential leaks at the boundary either. That tool scrubs the secret out of what egresses, the content leaving the wire. This one asks an earlier question: can untrusted input reach the egress sink in the first place? One cleans the payload, the other removes the path.
And it is a different axis from pinning and verifying MCP tools. That guards against the manifest drifting or a tool being swapped under you, a question of version integrity. This reads the same manifest and asks whether its capabilities, as declared, compose into the trifecta. Drift versus reachability.
Finally, it shares a family resemblance with your agent returns 200 and lies: both refuse to trust a surface signal. There it is a clean status code over a wrong effect; here it is "all three capabilities, looks scary" versus the actual reachable paths. Different artifact, same suspicion of the easy read.
The question I am still chewing on
The model assumes one shared context per agent. Real platform-MCP setups are messier. You connect five servers, each adding tools, and some of them quietly add a leg to the trifecta on a context they all share. The honest hard part is explicit mode: drawing the true data-flow edges of a real orchestrator is work, and if you draw them wrong the gate is confidently wrong with you.
So here is the real open question for anyone running an MCP composition or a multi-server agent: how many of your connected servers sit on the same context bus, and which one of them opens egress? If you cannot answer that quickly, the trifecta might already be reachable in your agent and nobody drew the graph. I have a partial answer (tag every tool's three capabilities at registration, fail the build on a reachable path) and no clean way to keep the explicit edges honest as the agent grows. Tell me how you model the context bus in your setup. I read every comment.
Follow for the next runnable gate in this series on controlling agents before you trust them.
Written by Alexey Spinov. AI-assisted, human-verified: the tool, all three manifests, and every number above come from a real local run on 2026-06-30 (Python 3.13.5, stdlib only, offline). I ran it, checked the exit codes (0 / 1 / 2), hashed the STDOUT twice to confirm determinism, and edited every line. The Agentjacking figures (2,388 organizations, 100+ agents at an 85% success rate in controlled testing, the "technically not defensible" quote) are Tenet Security's, from their June 2026 writeup, not my measurements. The term "lethal trifecta" and the position that LLMs cannot reliably distinguish trusted instructions from untrusted data are Simon Willison's, from his 16 June 2025 post. I label which numbers are theirs and which are mine.
Top comments (0)