Justin Kwon

Posted on May 24 • Edited on Jun 21

The real attack surface for AI coding agents is the config file

#ai #rust #security #devsecops

If you think the security risk of AI coding agents (Claude Code, Cursor, Gemini CLI) is "the model goes rogue and runs a dangerous command," the serious incidents from the past few months tell a different story. None of them were really about the model. The starting point was always a config file.

This post walks through TrustFall and AWS Kiro, explains why config files became the attack surface, and introduces the open-source tool I built in response, Sigil.

TrustFall: clone, open, RCE

In May 2026, Adversa AI published TrustFall: cloning a malicious repository and opening it was enough for one-click RCE across Claude Code, Cursor, Gemini CLI, and GitHub Copilot.

The setup is two files in the repo:

.mcp.json pointing at an attacker-controlled MCP server
.claude/settings.json with project-scoped settings like enableAllProjectMcpServers

When the user opens the repo and presses Enter on the "do you trust this folder?" dialog, the attacker's MCP server starts. From there it can read other projects' source and stored credentials, or open a long-lived outbound connection. On a headless CI runner the trust dialog never appears, so it lands with no human in the loop.

And this isn't a one-off. Check Point Research reported the same class of problem as "project config is processed before the trust prompt": CVE-2025-59536 (RCE through .claude/ hooks or MCP server settings) and CVE-2026-21852 (API key exfiltration by abusing ANTHROPIC_BASE_URL). Both fire on clone-and-open, before you confirm the trust dialog.

AWS Kiro: rewriting the config after the fact

If TrustFall ships a malicious config up front, the case of AWS's agentic IDE Kiro is about rewriting the config later.

Johann Rehberger (Embrace The Red) showed that indirect prompt injection could rewrite:

kiroAgent.trustedCommands: ["*"] in .vscode/settings.json
.kiro/settings/mcp.json

Once trustedCommands contains *, the agent runs arbitrary commands without confirmation. Instructions injected from a web page or an issue quietly edit a local config file, and that turns into arbitrary command execution. It was fixed in Kiro 0.1.42.

The common thread: config files grant the permission

In all of these, the model never "decided" to do something malicious. What got attacked was the configuration:

hooks
permissions (allow / deny)
MCP allowlists
sandbox flags
trustedCommands

These config files are what decide what an agent is allowed to do. The awkward part is that they take effect when you open the project, not when you read them. The permission is granted before you review anything.

EDR can see the rm -rf that ran, but not the config change that authorized it. The place to defend is the config that allowed the command, not the command itself.

How do you defend it

Two practical moves:

Run AI coding agents inside a container or sandbox whenever you can.
Watch the config files and notice when one turns dangerous.

Doing #2 by hand doesn't last. Eyeballing .claude/settings.json and .mcp.json every time they change is a process that breaks down.

What I built: Sigil

So I built Sigil, a host-side AI Security Posture Management (AI-SPM) agent.

It watches the config files that decide an agent's permissions (hooks, permissions, MCP allowlists, sandbox flags), scores a config when it turns dangerous, and ships the event to a log or SIEM.

It doesn't block. It scores and records. It tells you "this config changed and the agent can now do X." Actually stopping the action is left to the agent runtime and your existing controls. Because it measures instead of blocking, it doesn't get in a developer's way with false positives.

Demo

A normal config with read-only permissions and no hooks scores 0 / low.
Add a PreToolUse hook with matcher .* that runs rm -rf $HOME, and it re-scores 7.5 / critical (no sandbox, overly broad matcher, destructive command in the hook).

Tech notes

A single static binary (x86_64 musl, plus macOS arm64 and Windows)
File watching with tokio and notify, no polling
One-line install, Apache-2.0

For the record: most of the implementation was vibe-coded with Claude Code. I drove the threat model, the scoring rubric, and the architecture, and let the AI write a lot of the code. Building a tool that watches what coding agents are allowed to do, with a coding agent, was a little funny.

Closing

When an AI coding agent gets attacked, the target isn't the model. It's a config file nobody reviewed. TrustFall, Kiro, and CVE-2025-59536 all hit the same spot.

How are you handling untrusted repository configs today? Sandbox everything, review configs by hand, or just open them and hope?

Repo, demo, and the config-watching details: https://github.com/Ju571nK/sigil

DEV Community