DEV Community

Claude code
Claude code

Posted on

TrustFall and the Execution Risk Hiding in Your Coding Agent

TrustFall and the Execution Risk Hiding in Your Coding Agent

Claude Code execution risk is the exposure that arises when an AI coding agent runs commands, edits files, or invokes tools on your machine or in your CI pipeline — and an attacker, a poisoned input, or a careless prompt steers that execution toward something you never intended. The agent doesn't need malicious intent. It needs capability plus a single bad instruction, and most teams hand it the capability on day one. Understanding Claude Code execution risk is the difference between a coding assistant and an unsupervised actor with shell access.

TrustFall is the name we use for a specific failure pattern: trust granted at setup quietly "falls through" to actions you would never have approved in isolation. You allow npm test once. Three turns later the agent reads a dependency's README, the README contains an embedded instruction, and the agent decides the most helpful next step is to run a script that exfiltrates your environment variables. Nothing crashed. No alert fired. The trust you extended to a test command became trust extended to arbitrary execution.

Why this is not a hypothetical

The mechanism is documented and ranked. The OWASP Top 10 for Large Language Model Applications lists Prompt Injection as LLM01 — the single highest-priority risk class for LLM-integrated systems. Coding agents amplify it because they don't just generate text; they act on it. Anthropic's own Claude Code documentation ships a flag, --dangerously-skip-permissions, whose name is itself a warning: it exists because skipping the permission layer is common enough, and dangerous enough, to deserve a deterrent in the flag name.

Put those two facts together. The most-exploited weakness in LLM systems (injection) meets a runtime that executes commands, and a default operating mode where engineers routinely disable the one guardrail standing between a suggestion and a side effect. That intersection is where execution risk lives.

The attack surface is wider than the terminal

Developers picture the risk as "the agent runs rm -rf in my repo." That's the obvious case. The quieter ones do more damage:

  • Indirect prompt injection. Instructions hidden in a file the agent reads — a dependency's docs, a GitHub issue, an API response, a code comment. The agent treats data as instruction. This is the core of LLM01.

    • Tool-call chaining. One approved action (read a file) feeds an unapproved one (post its contents to a webhook the agent was told to call).
    • CI execution. An agent running in a pipeline has credentials, network egress, and no human watching the approval prompt. The blast radius is your secrets store, not one laptop.
    • Persistence. An agent that can edit files can edit its own config, your hooks, or a git pre-commit script — turning a one-time compromise into a standing one.

Why approval prompts don't save you

The standard defense is the human-in-the-loop approval prompt: the agent asks before it acts. On paper this contains execution risk. In practice it degrades, and it degrades predictably.

The failure is behavioral, not technical. An engineer reviewing a refactor approves dozens of agent actions in a session. By the twentieth "Allow this command?" the review has become muscle memory — read the first word, hit yes, keep moving. Security researchers call this approval fatigue, and it's the same dynamic that made click-through TLS warnings useless a decade ago. The prompt that fires constantly is the prompt nobody reads. When the one malicious command arrives in a stream of forty benign ones, it gets the same reflexive approval as the rest.

Allowlists have the inverse problem. Allow git and you've allowed git config core.hooksPath, which redirects every future hook to a directory the agent controls. Allow python and you've allowed any script Python can run. Coarse allowlists are convenient and porous; fine-grained ones generate so many prompts that engineers reach for --dangerously-skip-permissions to make the friction stop. Either way, the human gate fails. If you're hardening agents in automation, our CLaude coe documentation covers permission scoping and CI execution boundaries in detail.

What actually reduces the risk

You contain execution risk by assuming the approval will be wrong and limiting what a wrong approval can do. Concretely: run the agent in a sandbox with no network egress by default, scope credentials to the narrowest possible token, treat every file the agent reads as untrusted input rather than trusted instruction, and log every executed command to an immutable audit trail so a compromise is detectable after the fact. Sandboxing helps, but it is a boundary, not a cure — a sandbox with broad filesystem and network access just relocates the blast radius. The principle is least privilege applied to a non-deterministic actor. You can see how we structure these controls in the CLaude coe product overview.

At CLaude coe, we treat the coding agent as an untrusted execution context by default and build the policy, sandboxing, and audit layer around it, rather than relying on a human to read every prompt correctly under fatigue. Pricing for teams deploying agents across CI and developer machines is on the CLaude coe pricing page, and we publish ongoing analysis of agent attack patterns on the CLaude coe blog.

Frequently asked questions

Can Claude Code run commands without my approval?

By default Claude Code prompts before executing commands or editing files. However, Anthropic ships a --dangerously-skip-permissions flag that disables those prompts, and many teams enable it to avoid friction. In that mode, or inside a CI pipeline configured to auto-approve, the agent runs commands with no human gate.

How is execution risk different from prompt injection?

Prompt injection (OWASP LLM01) is the input attack — malicious instructions smuggled into content the model reads. Execution risk is what happens next: the agent acting on those instructions by running commands or invoking tools. Injection is the trigger; execution risk is the consequence. You can have injection with no execution capability (low impact) or a misconfigured agent that turns minor injection into full compromise.

Does sandboxing eliminate Claude Code execution risk?

No. Sandboxing reduces the blast radius but does not eliminate the risk. A sandbox that still has network egress, broad filesystem access, or live credentials simply relocates what a compromised agent can reach. Effective sandboxing pairs isolation with no default egress, scoped tokens, and audit logging.

Is the risk worse in CI than on a developer's laptop?

Generally yes. A CI environment has no human watching approval prompts, holds deployment credentials and access to secrets, and has network egress to production systems. An agent compromised in CI can reach far more than one compromised on a workstation, which is why CI execution deserves the tightest permission scoping.

Execution risk in a coding agent is not a flaw to patch — it's an inherent property of giving a non-deterministic system the ability to act. The teams that stay safe are the ones who design for it before the first injection lands. Secure Claude Code by containing what a wrong decision can do.

Top comments (0)