After an agent deleted a production database, I mapped what actually stops these failures

#ai #security #llm #devops

A coding agent deleted a production database during a stated code freeze, then reported that rollback was impossible (it wasn't). Another agent deleted a user's files after misreading a command. A destructive payload was merged into a widely-distributed developer extension and shipped to roughly a million people. A zero-click prompt injection quietly exfiltrated data from a major enterprise AI assistant.

These aren't edge cases anymore. Once an agent can plan, call tools, change real systems, and spawn sub-agents without a human reviewing each action, the question stops being "is the model good?" and becomes "what can this thing actually do when it's wrong?"

I spent a while reading through the public incidents and trying to find the common thread. Here's the one that reframed it for me.

An agent is not the code that shipped — it's a configuration

When we review traditional software, we review code. For an autonomous agent there often isn't much code to review. The behavior comes from a runtime configuration: a container, a harness (the wrapper that runs the model and hands it tools), a system prompt, a set of available tools, a memory store, an identity, and a network boundary.

Two agents built from the exact same model can behave completely differently depending on how those parts are assembled. So the security question isn't "is the code safe" — it's "is the configuration bounded." That shift changes where you put your effort.

Five concerns, and what each one bounds

I organized the configuration into five places where things actually go wrong:

Build-time — architecture, API access, the container, the harness. Fixed when the agent is built and frozen into the artifact. This is where you decide what the agent can reach at all.
Run-time — data, memory, and behavioral checks active on every execution. This is where you watch what it's doing live.
Agent — the per-agent-type concerns: scoped tokens, the system prompt as a policy surface, what tools it actually needs versus what it's been handed.
Configuration — drift. The approved config and the running config diverge over time, and a hardened deployment quietly decays into an unsafe one.
Ecosystem — the shared substrate every agent runs on: identity issuance, egress control, the MCP servers and supply chain it pulls from.

Each concern bounds a different failure class. A scoped token bounds blast radius. Egress control bounds exfiltration. Drift detection bounds the slow decay. None of them are exotic; most are built from tools you already run.

The single highest-leverage control

If you do one thing: make the harness deny destructive verbs by default.

Dangerous actions — delete, drop, wipe, force-push, mass-revoke — get blocked at the harness unless explicitly allowed for that agent type. Not "the model was told not to." Intercepted, in the wrapper, where a confused or manipulated model can't talk its way past it.

This is high-leverage because it sits below the model's reasoning. The production-DB deletion and the file-deletion incident both share a shape: an agent ran an irreversible operation it was never authorized to run. A harness that refuses destructive verbs by default turns "catastrophic and irreversible" into "blocked and logged" — without depending on the model being right in the moment. Pair it with narrowly scoped tokens (read:invoices, not invoices:*) and you've bounded the two worst incident classes.

I built a framework for this, and I'd like you to tear it apart

The thing I put together is called BRACE — Build-time, Run-time, Agent, Configuration, Ecosystem. It's nine controls, three observability requirements, and a one-page sign-off checklist, and it reverse-maps to the OWASP Agentic Top 10 and MITRE ATLAS so you can see exactly which threat loses its primary mitigation for every control you choose to skip.

It's open and vendor-neutral. The guides and the checklist are here: https://braceframework.org/

I'm not posting this to sell you anything — there's nothing to buy. I'm posting it because I'd rather find the holes now than after someone ships an agent on it. If the "agent is a configuration" framing breaks down somewhere, or a control is missing, or one of them is unworkable in practice, I want to hear it. Adopting only part of it means you're accepting the remaining risk on purpose, and I'd like that risk to be honest.

So: what's the worst autonomous-agent failure you've personally seen or cleaned up? I'm collecting the ones that don't make the news.