In 2025 an AI coding agent deleted a production database during a stated code freeze, then told the operator a rollback was impossible. It wasn't a jailbreak or an exotic exploit. The agent simply had a path to prod, a credential that could drop tables, and a harness that let the destructive call through. Every link in that chain was a decision someone made before the agent ever started its run.
That's the uncomfortable, useful part. Most agent security advice is about getting the model to behave — better prompts, better refusals, better guardrails on its output. Those help, but they all depend on the model behaving. Build-time controls don't. They're the things you freeze in advance — the tools, the network routes, the credentials, the deny list — and they hold whether the model is well-behaved, confused, or actively hijacked by injected input. If you only invest in one layer, invest here.
Here's how I think about it, in plain terms.
Give the agent the least it needs, decided ahead of time. Least privilege for tools, MCP servers, credentials, file paths, network destinations. Whatever you grant is exactly what a compromised agent can do — there's no daylight between "capabilities" and "attack surface."
There's a second reason that points the same direction, and it's the one people miss: fewer tools also makes the agent better. Every tool and MCP server you attach gets injected into the model's context as schema, on every step. More tools means more tokens spent reading menus, and measurably worse tool selection — the model fumbles more when it has fifty options than when it has six. So trimming the tool surface is not a security tax you pay against capability. It buys you both. That reframing tends to win the argument with people who'd otherwise resist locking things down.
Bound the blast radius by topology, not trust. An agent can't delete a prod database it has no network route to. Physical and network isolation is threat-model-agnostic — it doesn't care whether the cause was a clever attacker, a confused model, or a runaway loop. Separate prod from staging from dev for real, so an agent with staging access can't reach production under any failure mode.
Scope and time-box every credential. Capability-scoped tokens (read:tickets, never tickets:*), short-lived, no standing god-credentials sitting in the environment. An analysis agent should not be able to delete records through the same token it uses to read them. This is the layer teams skip most, because issuing a token per capability is real operational work — and it's exactly the layer that contains the damage when every other layer fails.
Gate destructive actions in the harness, deny by default. This is the single highest-leverage build-time control, so it's worth being precise about. The harness — the loop that runs the model and calls tools — keeps an explicit list of destructive verb classes: file deletion, recursive removal, DROP/TRUNCATE, force-push to protected branches, infra teardown, payment-state changes, mass external sends. Each one is intercepted before it executes. The default is deny. An agent that invokes a destructive verb not pre-authorized for this run gets stopped at the harness — not because the harness reasoned that the call was unsafe, but because nothing reasoned that it was safe.
The point is that the model never gets to be the thing standing between a destructive command and your data. A human (or an out-of-band credential) does. When Amazon Q shipped a destructive wiper payload to roughly a million developers through a VS Code extension, the failure was a destructive verb reaching execution with nothing in front of it. A deny-by-default harness is the thing that's in front of it.
Two more, briefly:
Pin and sign the build, fold the digest into the agent's identity. A minimal signed container, pinned by digest, so silent drift between "what we reviewed" and "what's running" is detectable rather than invisible.
Treat the harness and system prompt as versioned, diff-reviewed artifacts — not config you can hand-edit in a web UI with no history. Every change to the tool allowlist, the capability scopes, or the destructive-verb list lands as a reviewed diff, approved by someone who isn't the author.
None of this is exotic. It's the same engineering discipline you already apply to anything that touches production, pointed at a new kind of actor — one that improvises, runs unattended, and will do whatever its capabilities allow. The full guide and checklist is here: https://braceframework.org/guides/build-time/
So, practically: how are you handling destructive tool calls today? Is there a real deny-by-default gate in your harness, or is the model still the last thing between an agent and DROP TABLE?
Top comments (0)