DEV Community

Neeraj Kumar Singh Beshane
Neeraj Kumar Singh Beshane

Posted on • Originally published at linkedin.com

Your Agent Can't Tell What It Was Told From What It Read

You already treat input as untrusted. You validate what users type. You escape what goes into a query. You have assumed for a year that a document an agent reads can carry a prompt injection. None of this is new to you.

Here is the one place that discipline quietly stops: the line between what your agent is told to do and what your agent reads while doing it. To the model, there is no line. Every byte in its context window is a candidate instruction, whether it came from you, from a web page, or from an error message. Two research teams proved that in a single week, from opposite ends. This is not a new threat model. It is input validation, extended to the one input you never labeled as input.

New here? Securing the Agentic Stack is a weekly operator read on where AI and security collide, mapped to one stable six-layer model. Start with the foundation, linked at the end of this issue.

The same agent pipeline shown twice. In both,

Exhibit one: a game talked six browsers out of their guardrails

On June 29, LayerX researcher Roy Paz published BioShocking. He built a web page with a puzzle, themed on the game BioShock, that rewards wrong answers. Two plus two equals five. Victory is defeat. Once an AI browser accepts that "incorrect actions are acceptable here," it is reasoning inside a fiction the attacker wrote. The final puzzle step tells the agent to fetch a URL that redirects to the user's authenticated work GitHub repo and submit what it finds. It hands over the SSH credentials.

BioShocking attack chain, top to bottom: a puzzle page rewards wrong answers, the browser accepts the fiction, the final step tells it to fetch /code which redirects to the user's authenticated GitHub repo, and SSH credentials are exfiltrated. Scorecard: six agents tested, all six fell. OpenAI patched, Anthropic's patch failed, Perplexity closed the report, three never responded.

LayerX tested six agents: ChatGPT Atlas, Perplexity Comet, Fellou, Genspark, Sigma, and the Claude Chrome plugin. All six failed to recognize credential theft as crossing a line. The vendor scorecard is its own lesson. OpenAI patched Atlas, Anthropic's patch for the Claude plugin failed on retest, Perplexity closed the report on Comet, and the other three never responded. Read the honest limits too: the game and its instructions are visible to the user, so this is a demonstration rather than a stealthy end-to-end attack. The demonstration is the point. The guardrail was keyed on the model's belief about its own reality, and that belief is attacker-controllable.

Exhibit two: an error message became a command

Clone-repo attack chain, top to bottom: a clean GitHub repo with an ordinary README, setup throws a RuntimeError telling you to run python3 -m axiom init, the agent trusts the error and runs init, setup.sh resolves an attacker-controlled DNS TXT record, base64 decodes to bash, and a reverse shell opens running as the developer. The payload was never in the repo and never on disk, invisible to code review, static scanners, and the agent itself.

The same week, Mozilla's 0DIN team published a repo that owns your machine with no malicious code in it. The README has ordinary setup steps. A Python package is built to throw an error on first run that instructs the user to run an init command. Claude Code reads the error, treats it as routine recovery, and runs the fix. That init step calls a shell script that resolves a DNS TXT record the attacker controls, base64-decodes it, and pipes the result to bash. A reverse shell opens, running as the developer.

0DIN's own line is the whole lesson: "Claude Code never decided to open a shell. It decided to fix an error. The reverse shell is three indirection steps away from anything Claude Code actually evaluated: an error message it trusted, a script that fetched a value, and a DNS record it never saw." The payload was never in the repo and never on disk, invisible to code review, to static scanners, and to the agent itself. The instruction hid inside data the agent trusted.

Why these are the same bug

Strip both down and you get one shape. Untrusted content entered the context, and the agent acted on it as if you had authored it.

NOT ACCEPTABLE (the pattern in both incidents):

# the agent's control loop, simplified
observation = read_untrusted_source()   # a web page, a repo, an error string
plan = model.decide(context + observation)   # observation is now instruction
execute(plan)                                 # runs with the agent's full authority
Enter fullscreen mode Exit fullscreen mode

The flaw is line two. observation and your actual instructions are concatenated into one stream, and the model has no reliable way to rank one above the other. BioShocking poisoned observation with a fake reality. 0DIN poisoned it with a fake error. Both won.

ACCEPTABLE (the same loop, with the boundary put back):

observation = read_untrusted_source()
plan = model.decide(context + observation)
for step in plan:
    if step.touches_secrets or step.is_irreversible or step.runs_shell:
        require_human_approval(step)          # belief-independent gate
    if step.source_is_read_content:           # instruction came from data
        deny_unless_allowlisted(step)         # not the model's judgment
    execute(step)
Enter fullscreen mode Exit fullscreen mode

The single change: authority to act does not live in the model's reasoning. It lives in a gate outside the model that does not care what reality the agent thinks it is in. This is the same conclusion Anthropic reaches from the design side in Zero Trust for AI Agents: treat every caller and every input as untrusted until a control says otherwise.

What to do this week

None of these are new controls. They are input validation and privilege separation, pointed at the agent's context.

  1. Gate the dangerous verbs on approval, not on the model's confidence. Any step that touches credentials or runs a shell or is otherwise irreversible waits for a human. You already do this for production deploys. This is the same gate.
  2. Split the trust domains. An agent reading public pages should not, in the same session, hold live access to the repos and secret stores that page might ask about. You separate privileges for services. Separate them for agents.
  3. Never let read content auto-execute. An error message that suggests a fix is still data, not an authorized instruction. Pin auto-fix to a reviewed allowlist of remediations, the way you pin dependencies. Suggested is not authorized.
  4. Log the crossing. When an agent runs a command whose origin is content it read rather than a human instruction, that is a distinct, alertable audit event. You log auth events. Add this one.

The pattern to carry

This is the Instruction layer, the first layer of the stack, and it fails the same way every time. The model cannot label its own inputs. So you label them, outside the model, with a gate it cannot talk its way past. The cheap version of agent security is "make the model refuse bad things." Both teams just showed the model will refuse nothing once you rewrite what it thinks the situation is.

So the question for the week is small and concrete. Between everything your agent reads and everything your agent can do, where is the gate, and does it depend on the model staying convinced of reality? If it does, it is not a gate.

  • Neeraj

Sources and further reading

Every claim above is numbered to its source. Primary research disclosures are marked as such. The rest are corroborating reporting or context.

  1. LayerX Security: "BioShocking AI: Gaming the AI Browser and Escaping its Guardrails" (primary, research disclosure by Roy Paz, June 29, 2026). The mechanism, the six tested agents, the SSH-credential exfiltration, and the vendor-response scorecard.
  2. Ars Technica: "New attack provides one more reason why AI browsers are a bad idea" (June 2026). Independent analysis and the honest limitation that the proof of concept is visible to the user.
  3. Mozilla 0DIN: "Clone This Repo and I Own Your Machine" (primary, research disclosure, June 2026). The clean-repo chain, the DNS TXT indirection, and the three-indirection quote.
  4. BleepingComputer: "Clean GitHub repo tricks AI coding agents into running malware" (June 2026). Corroborating write-up of the 0DIN chain.
  5. Anthropic: "Zero Trust for AI Agents". The design-side framing that every caller and input is untrusted until a control says otherwise.
  6. The foundation for this series: "Your AI Agent Is Not a Chatbot. It Is a New Runtime." The six-layer model referenced throughout.
  7. Last week: "The Web Page Couldn't Reach Localhost. Your Agent Carried It There." The boundary your agent carries with it.

Top comments (0)