When you ship a normal service, security review has an anchor: the diff. Someone opens a pull request, someone reads it, and the thing that runs in production is the thing that got reviewed.
Now put an autonomous agent in production. It plans, calls tools, and changes state, often without a human approving each action. Ask the obvious question — where's the PR for what it just did? — and there isn't one. The agent didn't ship the action in a commit. It decided it at runtime.
So the review you're used to doing is aimed at the wrong artifact. Let me try to point it at the right one.
The agent is a configuration, not the code that shipped
Here's the load-bearing observation: an autonomous agent is not code. It's a runtime configuration of infrastructure — a container, a harness (the loop that runs the model and calls tools on its behalf), a system prompt, a tool surface, memory, an identity, and a network egress policy.
The model is mostly fixed. Everything around it is not. Two agents built from the same model with the same task description can behave completely differently depending on how those parts are configured — what tools they can reach, what the system prompt tells them to do, what memory they carry, what the network will let out. The security-relevant artifact is the running configuration, and a review aimed at your application code walks right past it.
Once you see the agent as a config, the question "what do you review?" has an answer: you review the config.
Treat the system prompt and harness as versioned, diff-reviewed artifacts
Most teams treat the system prompt as settings — a text box, editable in a dashboard, changed by whoever has access. That's the problem. The system prompt encodes the task, the rules for refusing requests, and the shape of the output. A one-line change to it can quietly remove a guardrail. An editable-in-prod system prompt is an unreviewed, unattributed code path with full influence over the agent's behavior.
The incidents bear this out. The "Rules File Backdoor" weaponized editable rules files in Cursor and GitHub Copilot using invisible Unicode. DPD's support bot started swearing at customers after a guardrail-removing system update. NYC's MyCity bot told landlords they could refuse Section 8 tenants — illegal advice, live for weeks. None of these were model failures. They were configuration changes that nobody reviewed because nobody treated the configuration as something you review.
So treat it like code. Put the system prompt, the harness config, hooks, and the MCP server list in version control. Change them only through review. Diff them. Now a guardrail removal shows up as a reviewable change with an author on it, instead of a silent edit in a prod console.
Freeze the config per release, and pin it into identity
If the config is the artifact, then a change to the config is a new build — and you want that to be visible. The practical move is to take a content hash over the deployed configuration — container digest, harness version, system prompt version, model identifier, settings — and make that hash the agent's type identity (BRACE calls it the agent-type-id). Now any change to any of those parts produces a different identity. You can't quietly swap a prompt and keep the same name. The new config is a new build, and your logs say so.
Detect drift at agent granularity
Infrastructure-as-code drift detection is a control your platform team probably already runs. The catch is that it usually runs at the host level, and the changes that matter for an agent hide one level down.
So point it at the agent's surface specifically: container images, harness configurations, MCP server lists, system prompts, and per-agent network egress policy — not just the host's. An MCP server quietly added to one agent's list (see the Postmark-MCP malicious-package incident) won't trip a host-level check. A drift check scoped to that agent's configuration will.
Capture the config-adjacent observables
Two things tell you what configuration actually ran, and both are easy to drop on the floor:
- Decision-time context size — how much context the model had in front of it when it acted. The same agent behaves differently with a near-empty context than a near-full one.
- The parent-passed prompt — in multi-agent setups, what the calling agent actually handed down. That's part of the effective configuration of the child, and it's invisible if you only log the child's own prompt.
If an action goes wrong, these are often the difference between "we can see exactly what config was in play" and "we're guessing."
None of this requires new infrastructure. Version control, content hashing, IaC drift detection, and structured logging are things you already run. What changes is where you point them — at the agent's configuration, which is the artifact that actually decides what the agent does.
This is the Configuration concern in BRACE, an open framework for agent security; the guide goes deeper on each control.
One honest question to leave with: do you version and diff-review your agents' system prompts — or are they editable runtime config that anyone with console access can change without a trace? The answer tells you whether there's anything to review at all.
Top comments (0)