DEV Community: Justin Kwon

Is that AI coding agent safe by default? Codex / Claude Code / Antigravity across 3 postures

Justin Kwon — Tue, 16 Jun 2026 13:02:33 +0000

by Ju571nK · 2026

This piece explains what each item in the comparison table means and where it can go wrong. It does not rate any specific product.
As of: 2026-06

When people talk about securing AI coding agents, they usually picture "a proxy watching the traffic." That's one valid approach. But you can scope agent security broadly or narrowly, and one part that gets overlooked is the user-settings / environment / OS layer. Set it up badly and it becomes a common source of incidents.

A single score doesn't tell you much. What matters is which posture you're looking from. So the table is split into three: ① out of the box (default), ② locked down by a careful user (hardened), and ③ enforced by an admin (enterprise-managed). Below, each item explains what the mechanism is and where it breaks. "Demonstrated" marks real cases reported in 2025-2026.

Premise: defaults and behavior differ by product and edition, and enterprise (managed) builds can differ from personal ones. So "which product is safe" is rarely the useful question. The useful one is "how is this item set up on this host, and where is it weak?"

① Out of the box: default posture

A fresh single-user install with no hardening. The question: is it safe the moment you install it?

Default execution posture
Whether the agent asks for human approval before it runs commands, edits files, or makes external calls. If the default is close to "run without asking," one prompt injection can run commands or leak data right away. But if approval prompts come too often, users flip on full auto-approve and disable the gate themselves.

OS sandbox by default
Whether the kernel-level isolation around commands is on out of the box. If it's off by default, you spend most of your time running with no isolation at all. The most common incident isn't a clever exploit, it's a frustrated user turning isolation off to get unblocked.

Network closed by default
Whether outbound network is blocked by default in the execution environment. If it's open, an injection has a ready path to send secrets out (exfil) or pull a second-stage payload.

▣ Demonstrated (2025): in one agent's leak, the stolen secret was sent to an external webhook (webhook.site) that was on the default allow-list.

Untrusted-code containment
Whether opening someone else's repo keeps its config and instruction files from auto-applying. If the trust gate is weak, opening a malicious repo runs its planted config the moment you open it, so your environment changes while you did nothing (zero-click).

▣ Demonstrated (2025): a malicious repo's project-instruction file planted global config (an auto-run MCP setup), creating a backdoor that survived a reinstall. It was first classified as "intended behavior."

Sensitive-path protection
Whether default rules keep secret files like .env and keys unreadable. If the protection is only a tool-scoped filter, the model just reads the file through the shell instead.

▣ Demonstrated (2025): when file-read was blocked at the tool layer, the model ran cat .env itself to exfiltrate AWS keys, a subprocess bypass.

② Locked down by a careful user: hardening ceiling

How far a careful user can push the defenses through settings. The question: if you lock it down properly, what still breaks?

Strict OS sandbox
Whether you can bind kernel isolation tightly: read-only, workspace-only. Even when you turn it on, a wide write scope, a bypass flag, or a fail-open fallback to unsandboxed on init failure weakens it.

▣ Demonstrated (2026): one agent had an RCE that escaped even its strictest "Secure Mode" (later patched).

Network domain restriction
Whether you can narrow outbound traffic to an allow-list. Allowing too broadly (*.example.com) or skipping traffic inspection (abuse of an allowed domain, domain fronting) gets around it.

Path/command deny rules
Rules that explicitly block specific paths or commands. Compound commands (a && b), wrappers (xargs, npx, docker exec), or encoding can reshape the same action into a form that slips past the rules.

Extra blocking hook (defense-in-depth)
A user-side hook that intercepts and denies a tool call right before it runs (PreToolUse-style). Regex matching is easy to bypass, the hook can fail-open on error, and a broad allow can neuter it. Treat it as backup protection, not the main boundary. Relying on it as the only block is dangerous.

Guard tamper-evidence
Whether a once-trusted guard script is detected when it later changes (hash-pinning, etc.). Without it, a malicious package or post-install can swap the hook quietly, protection disappears from the next run, and the user never notices (TOCTOU).

Attack surface from extensibility
The extension surface: plugins, hooks, MCP. From a foot-gun view, fewer is safer. Extensibility can help or hurt. It carries policy, but malicious code can also use it to inject global config or MCP, and that path has already been exploited.

③ Enforced by an admin: enterprise-managed

Whether the org can pin rules the user can't disable. The question: what protection still holds when a user makes a mistake?

Enforced policy file
A managed policy file an individual user can't undo. If that file sits in a user-writable location instead of an OS-protected path, editing it bypasses the policy. And shipping the policy without deploying the scripts it references makes it a paper policy.

Overrides user flags
Whether the policy neutralizes personal bypass flags like --yolo. If precedence is wrong and local settings or CLI options override org policy, the enforcement doesn't actually apply.

Enforce MCP/plugin allow-list
Whether the org pins which MCP servers and plugin sources are allowed. Without it, a user can add arbitrary servers or marketplaces, leaving the supply-chain surface wide open.

Enforce version / managed hooks
Whether you can require a minimum version and "managed hooks only." An unmanaged personal machine may have no enforcement layer at all, so you have to confirm and measure it separately: is this host managed?

Most incidents don't happen because a mechanism was missing. They happen when an existing one gets turned off, opened too wide, bypassed, or quietly unset by a person.

Watching the traffic answers one question. The one that decides your actual exposure is "how is each item on this host set up right now, and where is it weak?" And that depends on the posture: how conservative the default is, how high the hardening ceiling goes, whether the org can enforce anything.

▣ Sources for the demonstrated cases: the "Demonstrated" items come from 2025-2026 public reporting. All were reported or demonstrated on Google Antigravity. They're cited as examples of real-world failures, not as a product ranking, and behavior varies by product, version, and edition.

PromptArmor / TechRadar (prompt injection leaking cloud credentials; .env protection bypassed via the terminal): https://www.techradar.com/pro/googles-ai-powered-antigravity-ide-already-has-some-worrying-security-issues
Mindgard "Forced Descent" (a backdoor planted via mcp_config.json that survives reinstall): https://mindgard.ai/blog/google-antigravity-persistent-code-execution-vulnerability
Pillar Security (find_by_name → fd -X Secure-Mode-bypass RCE; reported Jan 2026, fixed Feb): https://www.pillar.security/blog/prompt-injection-leads-to-rce-and-sandbox-escape-in-antigravity

"It's not a bug, it's spec": a zero-click RCE in AI coding agents that three vendors won''t patch

Justin Kwon — Tue, 02 Jun 2026 12:23:25 +0000

TL;DR — A prompt injection can rewrite your AI IDE's mcp.json the moment you open a project, with no dialog and no click, and get arbitrary code execution. It's one of 12+ CVEs in the same class. The root cause lives in the official MCP SDK — and Anthropic, Google, and Microsoft have declined to issue CVEs for their own tools, on the grounds that rewriting the file "requires explicit user permission." In practice, that permission is usually "you installed the IDE."

In a previous post I argued that the real attack surface for AI coding agents isn't "the model goes rogue" — it's the config file. At the time, the worst case (TrustFall) still needed a human: clone a malicious repo, open it, press Enter on a trust dialog.

CVE-2026-30615 removes the Enter.

The zero-click chain

Disclosed by OX Security on 2026-04-15. On Windsurf IDE 1.9544.26, the chain is:

The attacker prepares HTML content the IDE will render — a malicious web page, a poisoned repo README, a tampered tool description.
An injected instruction silently overwrites the local mcp.json and registers an attacker-controlled STDIO server.
The MCP SDK re-reads the config and launches the registered binary.
Arbitrary command execution. CVSS 8 / High.

No approval dialog. No confirmation step. Among the IDEs OX tested, Windsurf was the only true zero-click — Cursor, Claude Code, and Gemini CLI each required at least one user action.

Codeium (Windsurf) shipped a patch. That's the part everyone agrees on. The argument starts with everyone else.

This is a class, not a bug

The same disclosure groups 12+ CVEs under one pattern — RCE via MCP STDIO:

LangFlow (CVE unassigned)
GPT Researcher (CVE-2025-65720)
LiteLLM (CVE-2026-30623)
Agent Zero (CVE-2026-30624)
Windsurf (CVE-2026-30615)
DocsGPT (CVE-2026-26015)
Flowise, Upsonic, Bisheng, Jaaz, and more

The shared root cause: the official MCP SDK passes user-controllable config values into StdioServerParameters without sanitization, and that flows straight into spawning a subprocess. OX filed this root cause under a category I haven't seen on a vuln report before — "Won't Be Patched" — because Anthropic's position is that this is spec-conformant behavior, not a defect to fix at the protocol level.

There's a known operational mitigation: allowlist the STDIO command value to known launchers, e.g. {npx, uvx, python, python3, node, docker, deno}. That closes the "point at any binary you like" path. But it's something each downstream implementation has to add itself. It is not the SDK's default.

The review surface keeps shrinking

Line up three incidents and the trend is hard to miss:

Act 1 — TrustFall: the config file is malicious from the start. Clone, open, press Enter on the trust dialog. At least the dialog appears.
Act 2 — AWS Kiro: an indirect prompt injection writes trustedCommands: ["*"]. The config changes after you've reviewed it, so you miss the moment.
Act 3 — Windsurf zero-click: opening HTML silently rewrites mcp.json. No dialog at all. The fact that a rewrite happened isn't even surfaced in the IDE.

Each act shaves off more of the surface where a human could notice something is wrong. By Act 3, the event itself is invisible.

So whose problem is it?

Here's where I'd genuinely like the comments.

Google, Microsoft, and Anthropic have declined to issue CVEs for their own tools in this class. The stated reason is reasonable on its face: modifying these files requires the user's explicit permission.

But walk through what that permission actually is. The injection rewrites a file inside a workspace the agent already has write access to — access you granted, in bulk, when you opened the project. There is no per-write prompt. So "explicit user permission" collapses into "you ran the IDE." If the threshold for not a vulnerability is "the user once consented to use the software," almost nothing involving a config file is ever a vulnerability.

I'm not claiming the vendors are acting in bad faith. A protocol-level fix is genuinely hard, and "spec-conformant" is technically true. But "technically spec" and "not the user's problem" are different claims, and the second one is the one that ends up on the user. When the people who own the protocol decline to treat it as a defect, the risk doesn't disappear — it just moves downstream to whoever's running the agent.

Which is the actual question: if the vendor won't fix it and won't even name it, where does the responsibility land?

If nobody patches, watch the config layer

My answer, for what it's worth, is that this has to be observed somewhere other than the IDE.

EDR sees the npx or python that got spawned. It does not see "a new STDIO server was added to mcp.json." By the time the subprocess starts, the config change is already seconds in the past. The interesting signal — the permission state changed while you weren't looking — happens one layer up from where most tooling is watching.

That's the layer I've been poking at. I've been building a small open-source thing (Sigil) that watches agent config files like mcp.json and .claude/settings.json, scores the risk, and emits an event to your SIEM — it doesn't block, it just tells you when the permission state changed while your hands were off the keyboard. Across a fleet of machines that shows up as triage-able alerts — the silent change, made visible:

And because it exposes that same posture as plain MCP, you can also just ask. Here's Codex doing exactly that — pulling the riskiest host in a fleet and the reasons behind it:

Notice one of the flagged reasons: an untrusted remote MCP server. That's the same class of mcpServers entry CVE-2026-30615 plants — except here it's surfaced as posture, where a human (or another agent) can actually see it after the fact. (The CVE itself is STDIO-command-based; Sigil's STDIO-command scoring is tracked in #53. The attack surface — the mcpServers key — is the same.)

That's deliberately the last thing in this post, not the point of it.

The point

Act 1 was "plant a malicious config file." Act 3 is "rewrite the config file the instant it's opened, silently." The time and surface a user has to review anything got measurably smaller in between — and the vendors who own the protocol have decided that's spec, not bug.

The attack surface is the config file. So the thing you watch should be the config file too — its state, and the moment that state changes.

How does your team handle config from untrusted repos today — sandbox the whole workspace, pin the agent's permissions, or just trust the trust dialog? I'd actually like to know.

References

The real attack surface for AI coding agents is the config file

Justin Kwon — Sun, 24 May 2026 07:32:00 +0000

If you think the security risk of AI coding agents (Claude Code, Cursor, Gemini CLI) is "the model goes rogue and runs a dangerous command," the serious incidents from the past few months tell a different story. None of them were really about the model. The starting point was always a config file.

This post walks through TrustFall and AWS Kiro, explains why config files became the attack surface, and introduces the open-source tool I built in response, Sigil.

TrustFall: clone, open, RCE

In May 2026, Adversa AI published TrustFall: cloning a malicious repository and opening it was enough for one-click RCE across Claude Code, Cursor, Gemini CLI, and GitHub Copilot.

The setup is two files in the repo:

.mcp.json pointing at an attacker-controlled MCP server
.claude/settings.json with project-scoped settings like enableAllProjectMcpServers

When the user opens the repo and presses Enter on the "do you trust this folder?" dialog, the attacker's MCP server starts. From there it can read other projects' source and stored credentials, or open a long-lived outbound connection. On a headless CI runner the trust dialog never appears, so it lands with no human in the loop.

And this isn't a one-off. Check Point Research reported the same class of problem as "project config is processed before the trust prompt": CVE-2025-59536 (RCE through .claude/ hooks or MCP server settings) and CVE-2026-21852 (API key exfiltration by abusing ANTHROPIC_BASE_URL). Both fire on clone-and-open, before you confirm the trust dialog.

AWS Kiro: rewriting the config after the fact

If TrustFall ships a malicious config up front, the case of AWS's agentic IDE Kiro is about rewriting the config later.

Johann Rehberger (Embrace The Red) showed that indirect prompt injection could rewrite:

kiroAgent.trustedCommands: ["*"] in .vscode/settings.json
.kiro/settings/mcp.json

Once trustedCommands contains *, the agent runs arbitrary commands without confirmation. Instructions injected from a web page or an issue quietly edit a local config file, and that turns into arbitrary command execution. It was fixed in Kiro 0.1.42.

The common thread: config files grant the permission

In all of these, the model never "decided" to do something malicious. What got attacked was the configuration:

hooks
permissions (allow / deny)
MCP allowlists
sandbox flags
trustedCommands

These config files are what decide what an agent is allowed to do. The awkward part is that they take effect when you open the project, not when you read them. The permission is granted before you review anything.

EDR can see the rm -rf that ran, but not the config change that authorized it. The place to defend is the config that allowed the command, not the command itself.

How do you defend it

Two practical moves:

Run AI coding agents inside a container or sandbox whenever you can.
Watch the config files and notice when one turns dangerous.

Doing #2 by hand doesn't last. Eyeballing .claude/settings.json and .mcp.json every time they change is a process that breaks down.

What I built: Sigil

So I built Sigil, a host-side AI Security Posture Management (AI-SPM) agent.

It watches the config files that decide an agent's permissions (hooks, permissions, MCP allowlists, sandbox flags), scores a config when it turns dangerous, and ships the event to a log or SIEM.

It doesn't block. It scores and records. It tells you "this config changed and the agent can now do X." Actually stopping the action is left to the agent runtime and your existing controls. Because it measures instead of blocking, it doesn't get in a developer's way with false positives.

Demo

A normal config with read-only permissions and no hooks scores 0 / low.
Add a PreToolUse hook with matcher .* that runs rm -rf $HOME, and it re-scores 7.5 / critical (no sandbox, overly broad matcher, destructive command in the hook).

Tech notes

A single static binary (x86_64 musl, plus macOS arm64 and Windows)
File watching with tokio and notify, no polling
One-line install, Apache-2.0

For the record: most of the implementation was vibe-coded with Claude Code. I drove the threat model, the scoring rubric, and the architecture, and let the AI write a lot of the code. Building a tool that watches what coding agents are allowed to do, with a coding agent, was a little funny.

Closing

When an AI coding agent gets attacked, the target isn't the model. It's a config file nobody reviewed. TrustFall, Kiro, and CVE-2025-59536 all hit the same spot.

How are you handling untrusted repository configs today? Sandbox everything, review configs by hand, or just open them and hope?

Repo, demo, and the config-watching details: https://github.com/Ju571nK/sigil