Why AI Agents Don't Follow Rules — The Case for Physical Governance

#agents #ai #architecture #security

The incident

One repository carried north of 130 KB of governance Markdown.

An agent consumed it. It answered as if it had understood—then violated those same constraints on its very next Write/Bash.

That rarely means “needs more prompting.” Usually it means the enforcement moment is missing: policy shows up during context load, tool calls happen later.

Why prompt-only bans leak

Teams still anchor on prose in prompts and markdown:

Pattern	Aim
“Never mutate `evals/`”	keep evaluation oracle from being rewritten
“No Writes under `00_Management/`”	guard canonical governance text

The trouble is reliance on attention at ingestion time. Tool calls afterward are not mechanically tied to whether the agent “remembers.” It can skim, reroute, or hallucinate exemptions.

Destructive UNIX commands behave differently: rm -rf / arrives behind a syscall gate, not a PDF. Hardware and OS designers assume humans forget; agents forget faster.

Rough split:

Text-only policy  → warns once, when tokens are assembled
Physical gate       → denies the transition right before disk or shell

When the generator grades itself

Separate problem: self-checking.

If the same conversational loop both authors an artifact and “confirms” it is fine, you import the same biases twice. Mostly not malice—the same shortcuts from generation bleed into adjudication.

A suite that always green may be unplugged instrumentation.

Structural fix: evaluations in different processes (CI, ephemeral runs, reviewers) — not another chat turn in the same session.

What AOS stacks

The AI Operating Standard (AOS) is a small vocabulary for where governance lives. Three slices only:

1 — Zones

Zone	Meaning	Typical write rule
Oracle	Specs and test truth	agents do not write here
Permitted	implementation workspace	scoped by role
Prohibited	outside the agreed tree	sovereign (human operator) clearance only

Oracle is the piece that kills “tests red → loosen expectations.”Truth for pass/fail has to live where automation cannot casually patch it.

2 — Roles

Design / execution / approval stay explicitly disjoint. When an agent crosses its lane, stop and escalate to a human. No sideways title upgrades.

3 — Physical enforcement

Hooks (e.g. Claude Code PreToolUse) inspect JSON before a Write executes. Typical outcomes:

Try this	Typical host response
Write into an Oracle-marked subtree	`exit 2` — canceled call
Forbidden edit patterns (`sed -i`, in-place truncation)	same refusal

Trust is aimed at mechanics, not good intentions.

iron_cage in one breath

iron_cage is just the working name we use for our PreToolUse wiring—it is not magic, it is AOS v0.1 §§4.x rendered as a handful of Python and settings.

Behind it sit two habits we nicknamed Type-91 Governance:

Axis	Aim
Forensic isolation	logs/hashes outsiders can reconstruct
Physical isolation	generation context is not where final evaluations live

Specifications live in AOS-spec on GitHub—iron_cage is one plausible answer. For runnable detail, skim the Hooks companion (#003) first.

Concrete examples of what vanished for us early on: Writes aimed at evaluator JSON under guarded paths and first attempts at sed -i on shared hosts.

Machine-readable preamble

Opening AOS-v0.1.md with machine-facing instructions lets you anchor bans in something outside today’s ephemeral chat.

Not “pretty please”; “this markdown is upstream of the prompt.” It does not automate compliance—it gives reviewers and automation a shared glossary.

Why publish wording at all

Mid-2026, trust in autonomous diffs is still mostly vibes. Everybody reinvents oracle boundaries in private repos. Putting the vocabulary in aos-standard/AOS-spec tries to shave that tax—even if implementations differ.

Related

Long EN walkthrough (ledger #003): binding-ai-agents-with-physics...
CI-heavy companion (ledger #002): ai-governance-one-repo...
Claude Code Hooks primer: docs.claude.com/.../hooks

AOS v0.1 Specification (GitHub)

The "physical governance" approach described in this article is formalized as AOS (AI Operating Standard) v0.1 — a minimal, machine-enforceable spec for AI agent operations.

👉 github.com/aos-standard/AOS-spec

If you find this useful, please ⭐ star the repo. Issues and PRs are welcome — the spec is designed to evolve with real-world usage.