Warwick McIntosh

Posted on Jun 23 • Originally published at productiongrade.substack.com

The Agent Nobody Owns

#ai #aisafety #governance

In mid-2026 a security firm showed that you can take over a developer’s coding agent by filing a bug report. There is no exploit in the usual sense. You find a Sentry ingest key, which is public by design and sits in the front end of half the web, and you POST a fake error into someone’s project. It looks like every other crash. Later the developer tells their agent to go and fix the unresolved Sentry issues. The agent reads your fake error over its Sentry connection, finds the “resolution steps” you wrote, which are markdown dressed up to look exactly like Sentry’s own advice, and runs the npx command inside them with the developer’s credentials.

The researchers, Tenet Security, ran this against more than a hundred real organisations and it worked about 85% of the time. The prompt-level defences did nothing: the agents executed the payload even when their system prompts told them to ignore untrusted data. Sentry’s response was that a proper fix is “not defensible,” so they blocked the specific demo string and left the shape of the problem in place.

You can read that as a Sentry story. I think that misses it. The interesting question arrived afterwards, when people asked whose job it had been to stop it. AppSec owns the application. IT owns the laptop. Nobody owned the thing in the middle that could read an attacker’s text and run code with your keys. We have spent a couple of years learning how to build reliable agents. We have not decided who operates them.

That gap is the subject of this post. It is not an argument against agents, and it is not a security advisory about one tool. It is about the discipline that goes missing when a probabilistic system ships to production with no name attached to it.

The short version: treat an agent as a production service, not a model you prompt. It needs one named owner, a wrong-action budget in place of an uptime SLO, on-call that pages on correctness signals rather than “human in the loop,” and a postmortem process that survives not being able to reproduce the failure. Most of that is SRE practice you already run, pointed at a probabilistic service. The rest of this post is how each piece works, with the flow drawn where it helps.

Build-time is solved. Run-time isn’t.

The build-time playbook is, by now, settled. Give every tool a typed contract so the model cannot call it with garbage. Keep a deterministic gate outside the model, so the LLM proposes and your code disposes. Run evals in CI so a prompt change that regresses gets caught before it ships. Put a human in front of the irreversible actions. This is good engineering and you should do all of it. It is also the baseline, and it answers exactly one question: is the agent built correctly.

That is the easy half.

It does not answer the other one. When a well-built agent starts giving subtly worse answers at 3am on a Sunday, because a model was updated underneath you or a retrieval index drifted, who finds out, and how. My take is that this is the more expensive question, and it is the one we have spent the least time on. Read the current writing on agents in production and you can see where the attention sits: on architecture and governance, the design-time half. The run-time half, operating the thing after it ships, is usually one sentence that says “monitor it” before the piece moves on.

The reason the run-time half is hard is not that the tooling is immature. It is that nobody is structurally on the hook for it.

The seam: four owners, one behaviour, no accountability

Here is why ownership is so hard for an agent in a way it is not for a normal service. Four different parties own the pieces of its behaviour.

The model vendor owns the weights, and changes them on their schedule, not yours.
Your prompt engineers own the instructions and the system prompt.
Whoever maintains each connected tool owns its API and, crucially, its output.
The platform team owns the harness that wires it all together and runs it.

The agent’s behaviour is the product of all four. So when it misbehaves, the failure lands in the seam between them, where no single team is on the hook. The prompt author says the instructions were fine. The platform team says the harness did what it was told. The tool owner says the API returned a valid response. The vendor says nothing, because you cannot open a ticket against a model update. Everyone is correct, and the agent is still wrong.

I’m not certain four is exactly the right count, and it might be five or six in your shop once you add retrieval and memory, but the shape holds: more owners than the org chart admits, and a behaviour that belongs to none of them.

You can watch the industry flinch away from this. A lot of “agent health” tooling checks that the right files exist rather than that the agent does the right thing. Existence, not behaviour. One widely shared health checker scores an agent for having a death_detector.py on disk, never for whether it has ever caught a death. That is the seam in miniature. We learned to confirm the artifact was built and quietly skipped confirming that anyone owns the behaviour.

The first fix is boring and nearly free: write down who owns what. Not a committee, not “the platform team” in the abstract, but a named accountable owner per agent and an explicit map of the rest. A plain RACI does the job:

The point of the table is not the table. It is the single name that appears in the Accountable column for every row. The agentjacking attack did not need a new class of exploit. It needed an owner who had once sat down and listed what the agent was allowed to touch, and noticed that “read arbitrary error text and run shell commands” was on the list.

Borrow from SRE, and be honest about where it breaks

We are not the first people who have had to operate something unreliable and keep it honest. SRE has done it for twenty years. Three of its instruments transplant well, and it is worth being straight about where each one breaks against a non-deterministic service.

A wrong-action budget, not an uptime SLO

You cannot put a 99.9% number on a service whose correct output is a distribution rather than a fixed answer. An agent that is “up” can still be steadily, quietly wrong. So translate the idea instead of copying it.

Start by sorting the agent’s actions by blast radius, because not all wrong answers cost the same:

Now you have something you can actually budget. Pick the target rate per class, measure the real rate against a labelled set, and treat the gap as a budget you spend down. When an irreversible-action class burns its budget, you stop shipping changes to that agent and go find out why, the same way an SRE team freezes deploys when they have burned their error budget on outages. The number you choose matters less than the fact that you chose it deliberately, in advance, rather than discovering it during an incident.

The honest caveat: measuring the real wrong-action rate needs a labelled stream of the agent’s outputs, and labelling is work. If you only invest in one piece of run-time infrastructure, make it this, because every other control depends on being able to say how wrong the thing currently is.

On-call, which is not “human in the loop”

This is the instrument the listicles shrink down to “human in the loop.” A human approving a high-stakes action is not the same as a human who gets paged. On-call means a specific person is woken by a specific signal, and for an agent those signals are not CPU and latency. They are correctness signals. Three are worth building first:

Golden transactions. A small, hand-curated set of inputs with known-good outputs that you replay on a schedule. When the agent’s answer to a golden input changes, something moved. It is the cheapest of the three to build, and the closest thing you have to a smoke test.
Shadow diffing. Run the candidate config (new prompt, new model) in parallel with the live one against real traffic, take no action on the shadow, and diff the outputs. A spike in disagreement is your early warning before you promote anything.
Action-distribution drift. Track the histogram of which tools and actions the agent chooses. Agents fail by quietly changing what they do before they change whether they succeed. If “send email” jumps from 2% to 15% of actions overnight, page someone, even if every individual action looks valid.

Wired together, the three signals feed one pager:

None of this fires a pager on its own; you wire these signals into the same alerting path as everything else you run. And keep the expectations sober: the current crop of automated SRE agents resolves only about one scenario in seven on its own. For now, the owner is a person.

Change management across four axes

A normal deploy changes one binary. An agent deploy can change the prompt, the model, a tool, or the retrieval index, and those version independently. Treat them as four version axes and write them down on every release:

agent: support-triage prompt: v14 (changed) model: opus-4.8 (pinned) tools: sentry@2.1, github@3.0 retrieval: kb-2026-06-18 (reindexed)

The discipline is simple and almost nobody does it: never change more than one axis at a time without expecting to lose your ability to attribute a regression. If the prompt and the retrieval index both moved and quality dropped, you have no idea which one did it, and you are back to guessing.

The axis you do not control is the model, and that is where blue-green deployment quietly stops working. Blue-green assumes you can keep the old version ready to fall back to. You cannot keep a vendor’s deprecated model ready forever. I don’t have a tidy answer for the day your model is sunset, and I would be wary of anyone who says they do. The honest move is to rehearse the swap before it is forced on you: keep a second model qualified against your golden transactions at all times, so model deprecation is a planned promotion rather than a fire drill.

The postmortem you can’t reproduce

The hardest transplant is the one SRE cares about most, the blameless postmortem. Agents break it, because you often cannot reproduce the incident. The temperature was non-zero, the model was silently updated, the context that produced the bad output has scrolled away. You are asked to write up a failure you cannot make happen again.

The way through is to stop treating the context as disposable. Capture the full input, the retrieved documents, the tool outputs, and the model version at the moment of failure as a first-class artifact, the way you would keep a core dump. If you log nothing else, log enough to answer “what exactly did the model see, and what did it do.” A workable template:

Incident: agent took action X on <date>
Inputs at failure: <prompt + user message + retrieved docs, captured verbatim>
Tool outputs in context: <the raw tool responses the model read>
Versions: prompt v?, model ?, tools ?, retrieval index ?
Reproducible? <usually no, say so>
Which surface failed: <prompt / model / tool output / harness>
Owner of that surface: <name from the RACI>
Action item: <change to that surface, with an owner and a date>

Then write the action items against the seam rather than against the model. “The model was wrong” is not an action item, because nobody owns it. “Tool output was trusted as instructions, and the tool owner and the platform team had never agreed who sanitises it” is an action item, because it names who does what next. Whether you can write that second sentence is the real test of whether the agent has an owner at all.

A starting point you can finish this week

None of this requires a platform team or a procurement cycle. If you own an agent that is already in production, here is the order I would do it in:

Write the one-line answer to “who is accountable for this agent’s output.” If you can’t, that is the whole finding.
Fill in the RACI table above. It takes an afternoon and it surfaces the seams immediately.
Inventory what the agent can touch, and which of those tools return data an outsider can write to. That list is your real blast radius.
Pick the wrong-action rate for your one irreversible action class, and start labelling outputs against it.
Stand up golden transactions first; shadow diffing and drift alarms can follow.

Ownership is the cheapest control you have

None of this is exotic. It is the operational hygiene we already apply to every other production service, pointed at a service that happens to think probabilistically and runs on a model you rent rather than own. The only reason it feels new is that we shipped the agents first and asked who owns them second.

So ask it now, before the incident instead of during. Pick the name. Write the RACI. Set the wrong-action budget and decide which signal pages someone. None of those steps needs a new tool or a budget line, which is the part I have found easiest to put off and hardest to justify putting off. They take an afternoon, and they are the difference between an incident with an owner and an incident with an audience.

The agent nobody owns is not the clever one or the fast one. It is the one that hurts you, because when it does, everyone involved will be able to explain, accurately and at length, why it was not their job.

Originally published on Production Grade. I write about operating AI systems in production, the discipline that goes missing between "we built it" and "we run it." Subscribe here for the next one.

DEV Community

The Agent Nobody Owns

Top comments (0)