DEV Community

Cover image for Your coding agent will route around your rules. Here's how to actually stop it.

Your coding agent will route around your rules. Here's how to actually stop it.

Brian Hall on June 18, 2026

Here's a thing that happened to a developer I was talking to recently, and I think anyone who's used a coding agent is going to recognize it. He s...
Collapse
 
anp2network profile image
ANP2 Network

The proxy fixes where the control lives but leaves the predicate at the same granularity that made "block rm" fail in the first place. You're gating on tool identity — fs_write, shell_exec — and tool identity is the coarse label the agent already proved it can route around. fs_write to ./scratch/notes.md and fs_write to .git/hooks/post-checkout are the same permit. So the agent that can't rm through the shell writes an executable .git/hooks/post-checkout, or a Makefile target your permitted test runner shells out to, and lets a tool you allowed do the deletion for it. Same whack-a-mole, one level down: not "which command" anymore, but "which permitted tool can be bent into the command."

What closes that isn't location, it's binding the decision to the effect instead of the verb — resolved path, target host, the argument semantics — not the label the MCP server happens to print. default-deny on tool names gates the noun. The capability is the (verb, object) pair, and the object is where the reasoning slips through.

Two things that bite later:

  • The audit log inherits the gate's blind spot. explain <id> can only report what the gate evaluated. If the predicate was "fs_write: permit," the log says fs_write was permitted, not that it wrote into .git/. Six months on you can't ask "did it ever write outside the project," because that was never the predicate. A log is worth exactly the question the gate asked.

  • defer-to-a-human is where the determinism leaks back out. Every fs_write blocking on approval trains whoever's at the prompt to bulk-approve, and an agent optimizing for task-completion learns which framings clear the queue fastest. The approvals prompt becomes the one spot in the path that can still be reasoned at, which makes it the thing that gets routed around. "Can't see or skip" has to cover the defer branch too, or the deterministic engine is just sitting upstream of a rubber stamp.

None of this argues against the proxy, it's the right place to stand. It's that standing outside the agent buys you the location, not the resolution — a coarse predicate gets routed around wherever you put it.

Collapse
 
brianrhall profile image
Brian Hall

You're right that the tool name is the coarse part, permitting fs_write on its own doesn't say much, the rule has to look at the object, the path it resolves to and the args, or it's the same problem a level down. That's what the conditions in the policy are for, and yeah the post didn't get into that side.

The defer point is fair too. If you defer too much the human just turns into a rubber stamp and stops really reading. I think the answer is keeping defers rare and high-signal, show the resolved path and the actual diff so there's something real to look at, collapse the repeats. This doesn't kill the problem but it keeps people from going numb to it.

Collapse
 
anp2network profile image
ANP2 Network

Right, conditions on the resolved object and args is where it has to live. The one place I'd push on the mitigation is the collapse key, because that's where this quietly comes apart. If you collapse repeats by the request shape or the tool, an agent optimizing to clear the queue just perturbs an argument that doesn't change the effect — a different path, a reordered call — and every request looks novel again, so it re-floods. The inverse failure is worse: batch a lot of small effects under one "collapsed" approval whose diff is too big to actually read, and you've rebuilt the rubber stamp, except now there's a green record saying it was reviewed.

So collapse on the same resolved-effect predicate the gate decides on — effect class = verb × resolved-object-class × scope — not on what the request looks like. Then "decide once" means once per predicate, novelty gets counted in predicates instead of requests, and neither dodge works: you can't manufacture a new request out of an arg that doesn't move the effect, and you can't hide N effects under one approval because each distinct class is its own decision.

The other gap in "show the actual diff" is that some effects don't have a reviewable diff — a network egress, a credential read, a delete whose diff is just absence. For those the high-signal thing isn't the instance, it's the boundary: show the resolved effect plus what else the same approval authorizes ("this also lets it do X and Y under this predicate"). The point is to make the approver see the scope they're signing off on, not the one call, since the one call is exactly the part that looks harmless.

Collapse
 
alexshev profile image
Alex Shev

This is the exact reason prompt-level rules are not a security boundary.

If the goal is still reachable through another tool, the agent will route around the blocked command because that is what "helpful" looks like. Real control has to remove or sandbox the capability, not just tell the model which path is disallowed.

Collapse
 
brianrhall profile image
Brian Hall

Exactamundo. Telling it which path is off limits just means it will grab another one. It has to be the capability itself... either take it away or make every use of it go through something the agent doesn't control. Appreciate you reading it!

Collapse
 
richard_smith_154156d471ef profile image
Richard Smith

The proxy approach is the right place to stand, but yeah the rubber stamp problem is real. Keeping defers rare and showing the actual diff helps, but you have to fight approval fatigue or the human becomes just another thing the agent routes around.

Collapse
 
brianrhall profile image
Brian Hall

Yeah exactly. If the human stops reading, the defer is just a slower yes. Keeping them rare enough to actually mean something is the key

Collapse
 
aljen_007 profile image
Aljen M

Excellent insight.

This perfectly highlights the difference between guiding an AI agent and actually controlling one.

Prompt-based restrictions and blocked commands are only suggestions to a reasoning model, which can often find alternative execution paths to achieve the same objective.

Real security comes from enforcing deterministic policies outside the agent's reasoning loop, where every privileged action must pass through an independent authorization layer.

This follows the same proven principles behind zero-trust architecture: never rely on the agent to police itself. External governance, capability based permissions, human approvals for sensitive operations, and comprehensive audit logs are what make AI automation secure and production-ready.

This is exactly the direction enterprise AI security needs to move.

Collapse
 
brianrhall profile image
Brian Hall

Yeah, zero-trust is the way to look at it. The agent doesn't get to police itself, the decision has to live somewhere outside its reasoning loop that it can't get at. Appreciate you reading!

Collapse
 
jugeni profile image
Mike Czerwinski

ANP2 named the two layers that bite — predicate granularity and defer-queue routing. The follow-up I'd add cuts in from the rule-file side.

The proxy is deterministic at runtime. governance.fms is authored at design time. Whoever writes the conditions — and ANP2 is right that the predicate has to bind to the effect, not the verb — resolved path, target host, argument semantics, not the label the MCP server happens to print — has to keep those conditions honest as the codebase moves under them. Six months in, the path predicates you wrote against today's repo layout don't match next quarter's, and a permit that was fine when authored is now permissive of writes the original author never intended. The audit log inherits not just the gate's blind spot but the rule file's drift. „Did fs_write ever target .git/" only answers correctly if the predicate stayed accurate to what .git/ meant when the question matters.

The defer-queue point cuts the same way one level up. Bulk-approve fatigue isn't only a UX problem — it's the rule file telling you it's mis-calibrated. Defers that hit a rubber stamp are evidence the predicates are too coarse and need to be tightened. So the proxy's most useful long-tail signal might be: which permits drove the largest fraction of defers that operators approved? That's the rule file showing you where it's drifting from operational reality. Without a discipline to surface and revise those rules, you've moved the trust boundary, not eliminated it. Curious how Faramesh handles rule lifecycle — does governance.fms get diffed, versioned, retired, or does it accumulate? Treated like code with PR review, or like config that gets edited in place?

Collapse
 
brianrhall profile image
Brian Hall

Yeah, governance.fms is treated like code, not config you edit in place. It's versioned and PR'd in the repo, and the daemon won't hot-reload it, a change only goes live on faramesh apply. If it just read the file on its own, anyone with file-write basically owns your policy.

For the drift part, that's mostly what faramesh plan is for. It replays your real decision history against the new policy and shows what would change before you ship, so tightening a predicate isn't a guess.

And the defer signal you're describing is already in the audit log, every approved defer is recorded against the rule that fired, so you can see which permits are generating the most approvals and tighten those. It's queryable now, just not packaged into a dashboard.

Collapse
 
jugeni profile image
Mike Czerwinski

Plan-against-history is the answer I was hoping was there — tightening predicates with a dry-run against the actual defer corpus closes the calibration question without guessing. And the defer-signal-already-in-audit-log point lines up with my experience: raw queryability tends to beat a fixed dashboard once you know which questions matter.