The incident
One repository carried north of 130 KB of governance Markdown.
An agent consumed it. It answered as if it had understood—then violated those same constraints on its very next Write/Bash.
That rarely means “needs more prompting.” Usually it means the enforcement moment is missing: policy shows up during context load, tool calls happen later.
Why prompt-only bans leak
Teams still anchor on prose in prompts and markdown:
| Pattern | Aim |
|---|---|
“Never mutate evals/” |
keep evaluation oracle from being rewritten |
“No Writes under 00_Management/” |
guard canonical governance text |
The trouble is reliance on attention at ingestion time. Tool calls afterward are not mechanically tied to whether the agent “remembers.” It can skim, reroute, or hallucinate exemptions.
Destructive UNIX commands behave differently: rm -rf / arrives behind a syscall gate, not a PDF. Hardware and OS designers assume humans forget; agents forget faster.
Rough split:
Text-only policy → warns once, when tokens are assembled
Physical gate → denies the transition right before disk or shell
When the generator grades itself
Separate problem: self-checking.
If the same conversational loop both authors an artifact and “confirms” it is fine, you import the same biases twice. Mostly not malice—the same shortcuts from generation bleed into adjudication.
A suite that always green may be unplugged instrumentation.
Structural fix: evaluations in different processes (CI, ephemeral runs, reviewers) — not another chat turn in the same session.
What AOS stacks
The AI Operating Standard (AOS) is a small vocabulary for where governance lives. Three slices only:
1 — Zones
| Zone | Meaning | Typical write rule |
|---|---|---|
| Oracle | Specs and test truth | agents do not write here |
| Permitted | implementation workspace | scoped by role |
| Prohibited | outside the agreed tree | sovereign (human operator) clearance only |
Oracle is the piece that kills “tests red → loosen expectations.”Truth for pass/fail has to live where automation cannot casually patch it.
2 — Roles
Design / execution / approval stay explicitly disjoint. When an agent crosses its lane, stop and escalate to a human. No sideways title upgrades.
3 — Physical enforcement
Hooks (e.g. Claude Code PreToolUse) inspect JSON before a Write executes. Typical outcomes:
| Try this | Typical host response |
|---|---|
| Write into an Oracle-marked subtree |
exit 2 — canceled call |
Forbidden edit patterns (sed -i, in-place truncation) |
same refusal |
Trust is aimed at mechanics, not good intentions.
iron_cage in one breath
iron_cage is just the working name we use for our PreToolUse wiring—it is not magic, it is AOS v0.1 §§4.x rendered as a handful of Python and settings.
Behind it sit two habits we nicknamed Type-91 Governance:
| Axis | Aim |
|---|---|
| Forensic isolation | logs/hashes outsiders can reconstruct |
| Physical isolation | generation context is not where final evaluations live |
Specifications live in AOS-spec on GitHub—iron_cage is one plausible answer. For runnable detail, skim the Hooks companion (#003) first.
Concrete examples of what vanished for us early on: Writes aimed at evaluator JSON under guarded paths and first attempts at sed -i on shared hosts.
Machine-readable preamble
Opening AOS-v0.1.md with machine-facing instructions lets you anchor bans in something outside today’s ephemeral chat.
Not “pretty please”; “this markdown is upstream of the prompt.” It does not automate compliance—it gives reviewers and automation a shared glossary.
Why publish wording at all
Mid-2026, trust in autonomous diffs is still mostly vibes. Everybody reinvents oracle boundaries in private repos. Putting the vocabulary in aos-standard/AOS-spec tries to shave that tax—even if implementations differ.
Related
- Long EN walkthrough (ledger
#003):binding-ai-agents-with-physics... - CI-heavy companion (ledger
#002):ai-governance-one-repo... - Claude Code Hooks primer:
docs.claude.com/.../hooks
AOS v0.1 Specification (GitHub)
The "physical governance" approach described in this article is formalized as AOS (AI Operating Standard) v0.1 — a minimal, machine-enforceable spec for AI agent operations.
👉 github.com/aos-standard/AOS-spec
If you find this useful, please ⭐ star the repo. Issues and PRs are welcome — the spec is designed to evolve with real-world usage.
Top comments (0)