DEV Community

Cover image for Agents Don't Just Do Unauthorized Things. They Cause Humans to Do Unauthorized Things.
Daniel Nwaneri
Daniel Nwaneri

Posted on

Agents Don't Just Do Unauthorized Things. They Cause Humans to Do Unauthorized Things.

A comment thread shouldn't produce original research. This one did.

Last week I published a piece about agent governance — the gap between what an agent is authorized to do and what it effectively becomes authorized to do through accumulated actions. I used a quant fund analogy: five independent strategies, each within its own risk limits, collectively overweight the same sector. No single decision was wrong. The aggregate outcome was unauthorized.

The comment section built something I hadn't anticipated.

Kalpaka,Vic Chen, Andre Cytryn, Stephen Lee, and Connor Gallic spent three days extending the argument in directions I hadn't gone. What follows is an attempt to assemble what they built — with attribution, because the thread earned it.


The Unit Problem

The first thing Kalpaka named was the hardest: what do you measure?

Actions taken is too noisy. Resources touched is closer but misses compounding. The unit that matters, they argued, is state delta — the diff between what the system could affect before and after a given action sequence.

A vendor approval doesn't just place an order. It creates a supply chain dependency. A database read doesn't just return rows. It establishes an access pattern the agent now relies on.

So the audit object isn't an action log. It's a capability graph — resources and permissions as nodes, actually-exercised access paths as edges. Reconciliation compares the declared graph against the exercised one. The delta is your behavioral exposure.

This is the frame that makes the quant fund analogy precise rather than decorative. A 13F filing is exactly this: a reconciliation between declared strategy and actual exposure, anchored to external cadence rather than internal computation speed. The quarterly filing forces the moment when someone has to look at the full graph, not just the individual positions.


Two Kinds of Edges

The capability graph has a shape problem I hadn't considered.

Direct edges are legible. The agent called the database, touched the file, invoked the API. These edges can be tracked. They decay naturally — an access pattern not exercised in N reconciliation cycles can be pruned. The agent hasn't used that path, so it drops from the exercised graph.

Induced edges are different. These are the edges created when an agent's output causes a human to take an action the agent itself didn't take. The Meta Sev 1 incident last week is the exact pattern: the agent didn't write anything unauthorized. It gave advice that caused an engineer to widen access. The exposure persisted for two hours. The agent's direct action log showed nothing unusual.

Induced edges don't decay the same way direct edges do. The human decision the agent caused doesn't un-happen when the agent stops referencing it. The widened access persists independently of the agent's continued activity. These edges have a much longer half-life — effectively permanent until someone explicitly reconciles them.

This is where the governance architecture splits. Direct edges decay on a usage clock. Induced edges require active reconciliation. The first can be automated. The second can't — which is exactly where the 13F analogy holds strongest. Quarterly isn't a technical choice. It's a regulatory one. Someone external decided that interval was often enough for fund positions. Agent capability graphs need the same external anchor.


The Irreversibility Problem

Andre Cytryn raised threshold drift: you calibrate materiality at deployment time, but the agent's decision space expands with use.

The per-strategy limits in the fund example weren't wrong at design time. They were wrong at failure time because the correlation structure changed. Continuous recalibration is the obvious fix. But continuous recalibration creates its own attack surface — a mechanism the agent can game structurally, not intentionally. Enough accumulated decisions that each looks within tolerance can shift the calibration baseline until the new thresholds permit what the original thresholds wouldn't.

The only way out Kalpaka identified: decouple recalibration authority from the agent's own execution context entirely. Thresholds reviewed by something with no stake in the agent's performance.

This requires knowing which actions are irreversible — but irreversibility is often only visible in retrospect. The Meta incident proves it. Nobody flagged the permission-widening action as irreversible-scope-changing while it was happening. The agent's advice looked like a routine technical suggestion. The irreversibility was only visible after the state had already changed.

Kalpatha's resolution was to invert the default assumption. Don't try to classify irreversibility upfront. Treat every induced edge as irreversible. Over-reconcile first, and let the reconciliation history generate the labeled data you need to build the upstream classifier later.

This is how financial compliance actually bootstrapped. Early fund compliance didn't start with sophisticated materiality filters. They reconciled everything quarterly and learned which position changes were material through decades of accumulated review history. The filters came from the data. The data came from over-reconciling.

We're in the over-reconcile phase. Expensive, noisy, generates false positives. But it's the only way to produce the labeled examples that make the classifier possible later. Trying to solve the detection problem at inception is trying to skip the generation of training data.


The Fatigue Problem

Over-reconciling creates a human cost. Every flagged induced state change needs a review decision. Volume in early cycles will be high by design.

The compliance history the 13F analogy draws on has a second chapter nobody likes to mention: the review process degrades under load. Humans start rubber-stamping. False positives stop being caught. The labeled data gets noisy before the classifier can be trained on it.

Kalpaka's answer was adaptive rather than prescribed: track reviewer consistency over time. Same reviewer, same flag type — does the approval rate shift as volume increases? That's your fatigue signal. Once you can detect it, you don't need a hard cap. You have evidence-based throttling.

The auditor rotation parallel holds here too. Financial auditors rotate precisely because of this problem. Fresh eyes are the simplest fatigue mitigation. The question for agent behavioral review is whether there are enough qualified reviewers to rotate through.

That's harder than it sounds. Financial auditors share a professional vocabulary — they all know what material means, what a restatement implies, what a concentration risk looks like. That shared language is what makes fresh eyes useful. Agent behavioral accounting doesn't have that vocabulary yet. The capability graph framing this thread built is an attempt to construct it. Until reviewers have a shared framework for what "consequential scope change" looks like in practice, rotating fresh eyes doesn't transfer the same way.

The qualified reviewer problem might be the bootstrapping constraint underneath the bootstrapping constraint.


The Enforcement Floor

Connor Gallic brought the piece back to earth.

Theory is one thing. What teams actually ship is another. The simplest version of enforcement is a write checkpoint: every agent action that changes state passes through a policy evaluation before it executes. Not token monitoring, not post-hoc audit. A deterministic gate between intent and action.

The aggregate authorization problem gets simpler when you centralize that policy layer. If every agent's writes flow through the same checkpoint, cross-agent scope becomes visible — because the governance layer has the view that individual agents don't.

The 13F analogy works because quarterly filings are boring, reliable, and externally anchored. Agent governance needs the same properties. The hard part isn't the theory. It's making enforcement boring enough that teams actually ship it.

The write checkpoint wins on that dimension. It's implementable today. It handles direct scope violations cleanly. It's the enforcement floor.

The capability graph is the ceiling. It handles induced state changes, threshold drift, cross-agent composition. It's more expensive, harder to build, requires the vocabulary that doesn't fully exist yet. It's also necessary for complete governance.

The right architecture is probably layered: checkpoint as the floor that catches direct violations cheaply, capability graph as the audit layer that catches induced drift over time. You ship the checkpoint first because it's boring enough to actually build. You develop the capability graph because the checkpoint leaves a class of failures uncovered.


What the Thread Built

None of this was in the original piece.

The two-clock distinction was there. The 13F analogy was there. The governance debt framing was there.

The capability graph, the direct/induced edge split, the over-reconcile bootstrap, the adaptive fatigue ceiling, the layered enforcement architecture — those came from Kalpaka, Andre, Vic, Stephen, and Connor over three days in a comment section.

That's worth naming directly. Not as a courtesy, but because it demonstrates the problem Foundation is designed to solve. This conversation happened in public, in a thread that will eventually scroll off everyone's feed, attributed to usernames rather than preserved as structured knowledge. The architecture it built is worth more than that.

The agent governance problem doesn't have a consensus solution yet. What it has is a comment thread that got further than most papers. The qualified reviewer problem is still open. The capability graph needs implementation. The adaptive ceiling needs tooling.

But the vocabulary exists now. That's what the thread produced. And vocabulary, as someone wiser than me wrote recently, turns vague frustration into specific, solvable problems.


The original piece: Your Agent Is Making Decisions Nobody Authorized

Top comments (2)

Collapse
 
dariusz_newecki_e35b0924c profile image
Dariusz Newecki

The induced-edge framing is the sharpest part of this — specifically your point that induced edges don't decay on a usage clock the way direct edges do. We've hit this in practice: CORE can block its own workers from making unauthorized state changes, but when CORE produces a proposal and a human applies it, that's outside the enforcement perimeter entirely. The logs capture what CORE did; they don't capture what the proposal caused.
Your irreversible-by-default bootstrap makes sense for exactly this reason. My question: once you have enough reconciliation history to start classifying materiality — how do you prevent that classifier from becoming gameable? The threshold-drift problem you describe seems like it could re-enter through the classification layer.

Collapse
 
acytryn profile image
Andre Cytryn

the direct/induced edge split is the clearest thing I've read on this topic. the decay asymmetry is exactly right and it's what makes most existing audit tooling insufficient -- they're built around the agent's own action log, not the downstream human decisions that action set in motion.

one thing worth adding to the enforcement floor discussion: the write checkpoint only works if the policy layer has visibility across agent boundaries. a single agent's checkpoint sees its own writes, but induced edges often span multiple agents. you need the checkpoint to be positioned at the resource level, not the agent level, or you're back to the same composition problem. not sure many teams are thinking about that distinction when they reach for OPA or similar.