<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AOS Architect</title>
    <description>The latest articles on DEV Community by AOS Architect (@aos_standard).</description>
    <link>https://dev.to/aos_standard</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864661%2Ffb24f366-82cd-4814-9853-9e612950fa0a.png</url>
      <title>DEV Community: AOS Architect</title>
      <link>https://dev.to/aos_standard</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aos_standard"/>
    <language>en</language>
    <item>
      <title>I built an agent health checker, then it flunked itself — here's the audit</title>
      <dc:creator>AOS Architect</dc:creator>
      <pubDate>Sun, 21 Jun 2026 23:34:30 +0000</pubDate>
      <link>https://dev.to/aos_standard/i-built-an-agent-health-checker-then-it-flunked-itself-heres-the-audit-3m6i</link>
      <guid>https://dev.to/aos_standard/i-built-an-agent-health-checker-then-it-flunked-itself-heres-the-audit-3m6i</guid>
      <description>&lt;p&gt;&lt;strong&gt;What you get:&lt;/strong&gt; The &lt;a href="https://dev.to/aos_standard/four-ways-production-agents-silently-fail-and-the-physical-patterns-that-prevent-them-aos-v02-1c17"&gt;AOS v0.2 post&lt;/a&gt; named four ways production agents fail quietly—and patterns to stop them. This follow-up ships a &lt;strong&gt;CLI that scores agents 0–100 on four axes&lt;/strong&gt;, then shows the &lt;strong&gt;real stdout&lt;/strong&gt; when that scanner &lt;strong&gt;failed its own directory&lt;/strong&gt;. No slide-deck scores; numbers from a live run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Local green, production silent — why you want a meter
&lt;/h2&gt;

&lt;p&gt;I've lost count of how many times I've seen this play out: teams are building LLM agents, everything looks good on the surface.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local smoke tests pass with flying colors.&lt;/li&gt;
&lt;li&gt;Your CI/CD pipeline keeps flashing green.&lt;/li&gt;
&lt;li&gt;But in production, you're still hitting those invisible, &lt;strong&gt;structural&lt;/strong&gt; holes. Maybe there are no timer units to wake it up, no rebirth loop files to bring it back from the brink, or just no persistent evidence on disk that it's actually &lt;em&gt;doing&lt;/em&gt; anything. Sound familiar?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just being polite with your prompts won't surface these deeper issues. What you really need is something that can &lt;strong&gt;walk through your agent's directories, read-only&lt;/strong&gt;, and give you a concrete number. A kind of x-ray vision for your agent's health, if you will.&lt;/p&gt;

&lt;h2&gt;
  
  
  What tool 1066 does
&lt;/h2&gt;

&lt;p&gt;That's precisely what the &lt;strong&gt;AOS Agent Health Reporter&lt;/strong&gt; (internal id &lt;strong&gt;1066&lt;/strong&gt;) does. It scans an agent folder and spits out an &lt;strong&gt;AOS score (0–100)&lt;/strong&gt; and a &lt;strong&gt;&lt;code&gt;certified&lt;/code&gt; status (true when score ≥ 80)&lt;/strong&gt;, delivered either as Markdown or JSON.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four axes × 25 points
&lt;/h2&gt;

&lt;p&gt;Now, about how we tally those points. The scoring is heuristic, designed to align directly with the &lt;a href="https://github.com/aos-standard/AOS-spec/blob/main/AOS-v0.2.md" rel="noopener noreferrer"&gt;AOS v0.2 §10 implementation patterns&lt;/a&gt;. We're not aiming for a formal proof of compliance here; think of it more as a practical &lt;strong&gt;flashlight guiding you to potential issues&lt;/strong&gt;, rather than a courtroom audit.&lt;/p&gt;

&lt;p&gt;Here’s a quick overview of what each section scrutinizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;What it checks (summary)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;manifest_declared&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Checks for a &lt;code&gt;manifest.json&lt;/code&gt; present &lt;strong&gt;+12.5&lt;/strong&gt;, and then for an &lt;code&gt;aos_compliance&lt;/code&gt; or &lt;code&gt;aos_compliant&lt;/code&gt; field &lt;strong&gt;+12.5&lt;/strong&gt;. (That's 25 total, in two distinct steps.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;systemd_runtime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Looks for a &lt;code&gt;.service&lt;/code&gt; or &lt;code&gt;.timer&lt;/code&gt; file within &lt;code&gt;services/&lt;/code&gt; or &lt;code&gt;playwright/&lt;/code&gt;. (It's either 25 or 0 points here.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;immune_loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scans for a &lt;code&gt;death_detector.py&lt;/code&gt; or any &lt;code&gt;rebirth_ritual*&lt;/code&gt; file on disk. (Again, a binary 25 or 0.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;physical_evidence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verifies at least one &lt;code&gt;.md&lt;/code&gt; file exists under &lt;code&gt;docs/reports/&lt;/code&gt;. (Yep, 25 or 0.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You'll notice that only &lt;strong&gt;manifest_declared&lt;/strong&gt; is staged, meaning it can award partial points. The other three axes are binary—you either get the full 25 points or none at all. This is by design. For example, in the code, &lt;code&gt;score_manifest()&lt;/code&gt; first gives 12.5 points just for the file being present, and another 12.5 when a compliance field is correctly set. So, if you see live totals like &lt;strong&gt;37.5&lt;/strong&gt; (12.5 + 25) in the self-audit table, don't be surprised; that's exactly what we expect, not a bug.&lt;/p&gt;

&lt;p&gt;The core idea in the code is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_sections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manifest_declared&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemd_runtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_systemd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;immune_loop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_immune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;physical_evidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_physical_evidence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before we get too excited, I always find it crucial to set the stage with a few hard truths about what this tool &lt;em&gt;doesn't&lt;/em&gt; do. For instance, &lt;code&gt;systemd_runtime&lt;/code&gt; strictly looks at unit files &lt;strong&gt;inside the agent tree&lt;/strong&gt; – it won't peek into your user-level systemd. And with &lt;code&gt;immune_loop&lt;/code&gt;, remember, we're just checking for filename presence; it's not proof a rebirth actually ran. Honestly, I've seen folks trip up on this, expecting an oracle when it's really just an honest &lt;strong&gt;meter&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The punchline: self-scan scored 50/100
&lt;/h2&gt;

&lt;p&gt;With those caveats in mind, I put the tool to a real test. My self-scan scored &lt;strong&gt;50/100&lt;/strong&gt;—not a mock run, but a live walk on &lt;strong&gt;2026-06-18&lt;/strong&gt; with no &lt;code&gt;--mock&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**AOS Score: 50.0/100**&lt;/span&gt; | Tool: 1066 | Scanned: 2026-06-18T12:06:00+00:00

&lt;span class="gu"&gt;## Section Scores&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**manifest_declared**&lt;/span&gt;: 25.0/25
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**systemd_runtime**&lt;/span&gt;: 0.0/25
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**immune_loop**&lt;/span&gt;: 0.0/25
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**physical_evidence**&lt;/span&gt;: 25.0/25
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Not certified&lt;/strong&gt; (&amp;lt; 80).&lt;/p&gt;

&lt;p&gt;I built the health gate; the first patient failed the exam. Reproduce locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py &lt;span class="nt"&gt;--bypass-payment&lt;/span&gt; &lt;span class="nt"&gt;--tool-id&lt;/span&gt; 1066
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(&lt;code&gt;--bypass-payment&lt;/code&gt; is for local trials; production wiring uses a separate payment path.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Why systemd and immune_loop are zero
&lt;/h3&gt;

&lt;p&gt;When I look at our internal &lt;code&gt;1066&lt;/code&gt; system, these two axes always jump out. Honestly, the "zero" scores might surprise you at first glance. But there's a good reason for it, as I've tried to capture below:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;Physical fact on 1066&lt;/th&gt;
&lt;th&gt;How I read it in the open&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;systemd_runtime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Zero&lt;/strong&gt; &lt;code&gt;.service&lt;/code&gt; / &lt;code&gt;.timer&lt;/code&gt; under &lt;code&gt;services/&lt;/code&gt; or &lt;code&gt;playwright/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1066&lt;/code&gt; is an &lt;strong&gt;on-demand MCP/CLI&lt;/strong&gt; tool, not one of those always-on timer agents. The rubric, as I understand it, targets long-running production workhorses.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;immune_loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Zero&lt;/strong&gt; &lt;code&gt;death_detector.py&lt;/code&gt; / &lt;code&gt;rebirth_ritual*&lt;/code&gt; files&lt;/td&gt;
&lt;td&gt;We preach death→rebirth in &lt;code&gt;v0.2&lt;/code&gt;, yet our &lt;strong&gt;read-only scanner doesn't implement that loop yet&lt;/strong&gt;. Yeah, it's straight technical debt, plain and simple.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I've personally made sure the manifest declares &lt;code&gt;aos_compliance: "v0.1"&lt;/code&gt;, which does get us full manifest points. And yes, reports exist under &lt;code&gt;docs/reports/&lt;/code&gt;, giving us full evidence points there. But here's the kicker: the two axes we talk about most are the two we score zero on ourselves. Isn't that a bit of a wry chuckle? It's the honest article, not a bug in the scorer.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;--self-audit&lt;/code&gt;: one table for the whole repo
&lt;/h2&gt;

&lt;p&gt;I've found single-tool scans useful, but &lt;strong&gt;bulk audit&lt;/strong&gt; is the hook that maps to your monorepo mental model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py &lt;span class="nt"&gt;--bypass-payment&lt;/span&gt; &lt;span class="nt"&gt;--self-audit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real filesystem walk (excerpt, same date):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**AOS Self-Audit**&lt;/span&gt; | total_tools: 68 | avg_score: 39.2/100

| tool_id | aos_score | certified |
|---------|-----------|-----------|
| 0051 | 37.5 | no |
| 0058 | 37.5 | no |
| 1064 | 37.5 | no |
| 1066 | 50.0 | no |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I look at the scores, a row showing &lt;strong&gt;37.5&lt;/strong&gt; (for example, tool &lt;code&gt;0051&lt;/code&gt;) usually tells me we're looking at manifest half-credit (that's 12.5 points for a missing or incomplete compliance field) alongside full physical evidence (25 points). A &lt;strong&gt;50.0&lt;/strong&gt; on 1066, on the other hand, means we've hit full manifest (25) and full evidence (25), even if &lt;code&gt;systemd&lt;/code&gt; and &lt;code&gt;immune&lt;/code&gt; are still sitting at zero.&lt;/p&gt;

&lt;p&gt;Honestly, 1066 isn't some unique outlier with a special problem. What I've found is that the average across 68 tools hovers around &lt;strong&gt;39.2&lt;/strong&gt;, and frankly, nobody is certified yet. It's a stark reminder that the rubric we're aiming for is, well, pretty aspirational compared to how our agents are currently configured. Are we just chasing a phantom, or is this a roadmap?&lt;/p&gt;

&lt;h3&gt;
  
  
  Do not mix &lt;code&gt;--mock&lt;/code&gt; with this story
&lt;/h3&gt;

&lt;p&gt;Here's a common trap I've seen people fall into, and it's vital to get this straight: don't confuse &lt;code&gt;--mock&lt;/code&gt; data with what's happening in the real world. Our CI pipelines and Playwright tests rely on &lt;code&gt;--mock&lt;/code&gt;—it generates deterministic stub scores, essentially a fixed score derived from a hash of the &lt;code&gt;tool_id&lt;/code&gt;. So, if you're looking at a thirty-cycle report and see tool &lt;code&gt;0051&lt;/code&gt; at a high &lt;strong&gt;94&lt;/strong&gt; under mock, understand this clearly: that's &lt;strong&gt;not&lt;/strong&gt; the live environment. It's a snapshot, a simulation, not the pulse of the actual system.&lt;/p&gt;

&lt;p&gt;Why does this matter so much? Because when we talk about audits or the data in this very post, we're always looking at the &lt;strong&gt;live scan&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;th&gt;Example for 1066&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;--mock&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CI / fixed regression&lt;/td&gt;
&lt;td&gt;hash-driven (e.g. 41.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Live scan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;audits, this post&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;50.0&lt;/strong&gt; (canonical here)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To be absolutely clear, every number and observation I've shared so far comes directly from a &lt;strong&gt;live scan&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;precedent_id&lt;/code&gt; — provenance on the report
&lt;/h2&gt;

&lt;p&gt;Whenever I'm digging into a report, one of the first things I look for is the &lt;code&gt;precedent_id&lt;/code&gt;. It's essentially the &lt;strong&gt;provenance&lt;/strong&gt; for the data, giving us the full lineage of that particular run. Each execution attaches this crucial piece of metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"precedent_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"07f3c9ae-dd41-494a-8e77-43a1e9c6a72c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1066"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"created_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-06-18T12:06:01+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_signal_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"eab8ff11"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Structured fields beat "trust me, I ran a check" in chat logs. While we're not talking a full chain-of-custody system just yet, it's a crucial step away from simply taking an agent's self-report at face value. What's been your experience trying to verify agent claims without solid data?&lt;/p&gt;




&lt;h2&gt;
  
  
  Thirty Playwright cycles (30/30)
&lt;/h2&gt;

&lt;p&gt;When we started this, I approached it with the same philosophy that guided our &lt;a href="https://dev.to/aos_standard/a-mocked-ad-copy-cli-real-evals-and-30-playwright-cycles-tool-1027-39ln"&gt;1027 ad-copy demo (#004)&lt;/a&gt;. It's all about robust CLI evaluations, and in this case, we replayed &lt;strong&gt;the exact same scenario bundle for thirty full cycles&lt;/strong&gt;. The result? Every single one came up green, which honestly, was a relief.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cycles&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Successes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failures&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scenarios per cycle&lt;/td&gt;
&lt;td&gt;6 (SC1–SC5 + stub)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I've seen too many 'one-off' demos fall apart, so hitting these benchmarks was key. Specifically, we confirmed that payment gate rejections happened without any sneaky bypasses, that our mock path worked exactly as expected, and that the &lt;code&gt;stdout&lt;/code&gt; always contained both &lt;code&gt;"AOS Score"&lt;/code&gt; and &lt;code&gt;"total_tools"&lt;/code&gt;. Even the &lt;code&gt;precedent_id&lt;/code&gt; consistently matched the expected UUID shape. Look, anyone can get one green run on their laptop once; achieving &lt;strong&gt;30/30&lt;/strong&gt; is how we truly de-risk things and move beyond wishful thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;Through this process, a few critical takeaways became crystal clear for me.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;after you've laid out your specifications, you absolutely must add measurement.&lt;/strong&gt; Those v0.2 patterns we've been talking about? They only truly mean something when a simple directory walk actually spits out a verifiable score. What's the point of defining good if you can't measure it?&lt;/p&gt;

&lt;p&gt;Second, and this one's a bit of a gut-check: &lt;strong&gt;you have to eat your own rubric.&lt;/strong&gt; Our scanner itself landed at 50/100, and honestly, that's the real story. Trying to hide that number or pretend it was perfect would have been far worse than owning it.&lt;/p&gt;

&lt;p&gt;Third, it's crucial to &lt;strong&gt;keep your mock environments distinct from live ones.&lt;/strong&gt; Demo stubs are not the same as actual production audit &lt;code&gt;stdout&lt;/code&gt;. Mixing them up can lead to some truly misleading results, and I've seen that bite teams before.&lt;/p&gt;

&lt;p&gt;Finally, when you're looking for insights, &lt;strong&gt;always prefer the structured table over a single 'hero' tool id.&lt;/strong&gt; A proper self-audit, laid out clearly, gives you a far better picture of where your repo might be weak than just pointing to one shining example.&lt;/p&gt;

&lt;p&gt;We've even got an MCP package (&lt;code&gt;aos-health-mcp&lt;/code&gt;) for easier editor integration, but remember, the CLI we discussed earlier is always your reproducible entry point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Looking back, v0.2 did an excellent job describing common failure modes and outlining physical fixes. But the real challenge, what &lt;strong&gt;1066 aims to tackle, is the next layer:&lt;/strong&gt; “does this tool &lt;em&gt;actually&lt;/em&gt; embody those patterns we're pushing?” It's one thing to write it down, another to live it.&lt;/p&gt;

&lt;p&gt;Our first subject for this deeper dive was the scanner itself. We gave it &lt;strong&gt;50 points, not certified.&lt;/strong&gt; Yes, it scored a zero on &lt;code&gt;systemd&lt;/code&gt; and &lt;code&gt;immune&lt;/code&gt; while we're still out there recommending those very patterns. But for me, that gap isn't a reason to distrust the numbers; it &lt;em&gt;is&lt;/em&gt; the roadmap. It clearly shows us where we need to focus our efforts next, and honestly, that clarity is invaluable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next layer — external MCP blast radius (1067)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1066&lt;/strong&gt; scores &lt;strong&gt;your own agent directories&lt;/strong&gt; against structural patterns (manifest, systemd units, immune-loop files, physical reports). A different question shows up when you &lt;strong&gt;&lt;code&gt;pip install&lt;/code&gt; someone else's MCP server&lt;/strong&gt;: does shipping code match what you assumed? A "filesystem" server might still call &lt;code&gt;requests.get()&lt;/code&gt; or &lt;code&gt;subprocess.run()&lt;/code&gt; in &lt;code&gt;src/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://pypi.org/project/mcp-blast-radius/" rel="noopener noreferrer"&gt;MCP Blast-Radius Auditor&lt;/a&gt;&lt;/strong&gt; (PyPI: &lt;code&gt;mcp-blast-radius&lt;/code&gt;; internal ID 1067) answers that with static analysis — capability inventory with file/line references even when no manifest exists; divergence detection when one does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp-blast-radius
mcp-blast-radius-gate &lt;span class="nt"&gt;--target-dir&lt;/span&gt; /path/to/mcp-server/src &lt;span class="nt"&gt;--gate-mode&lt;/span&gt; advisory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If 1066 is structural health inside your repo, 1067 is &lt;strong&gt;pre-install blast-radius for third-party MCP servers&lt;/strong&gt;. A follow-up post will run this gate on a real external server and share the audit as a public GitHub issue.&lt;/p&gt;




&lt;h2&gt;
  
  
  AOS specification (GitHub)
&lt;/h2&gt;

&lt;p&gt;Our heuristic rubric is directly derived from &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec v0.2&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're building this in the open, so please, throw us some ⭐ stars, open issues, or send in your PRs. We genuinely welcome your contributions. And just to be clear, you won't find any paid checkout links here; it's all about the work.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>showdev</category>
      <category>testing</category>
    </item>
    <item>
      <title>Four ways production agents silently fail — and the physical patterns that prevent them (AOS v0.2)</title>
      <dc:creator>AOS Architect</dc:creator>
      <pubDate>Wed, 03 Jun 2026 22:56:19 +0000</pubDate>
      <link>https://dev.to/aos_standard/four-ways-production-agents-silently-fail-and-the-physical-patterns-that-prevent-them-aos-v02-1c17</link>
      <guid>https://dev.to/aos_standard/four-ways-production-agents-silently-fail-and-the-physical-patterns-that-prevent-them-aos-v02-1c17</guid>
      <description>&lt;h2&gt;
  
  
  Four ways production agents silently fail
&lt;/h2&gt;

&lt;p&gt;An LLM agent that felt great locally tends to break in the same places once you push it toward production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Silent failure&lt;/strong&gt; — swallows an exception and returns "done" with nothing on disk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No trace&lt;/strong&gt; — claims the tests passed, but no file was ever written&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restart wipes state&lt;/strong&gt; — only runs inside a session; a reboot means zero continuity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-inflicted violations go unreported&lt;/strong&gt; — the agent that broke the rule reports nothing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these get fixed by &lt;em&gt;"please be more careful."&lt;/em&gt; You have to make the broken state &lt;strong&gt;structurally impossible&lt;/strong&gt; on the host side, before the agent's request goes through. The rest of this article is the four physical patterns (§10.1–§10.4) that map one-to-one onto these four failure modes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why physical? — what v0.1 established
&lt;/h3&gt;

&lt;p&gt;The idea isn't new. In the &lt;a href="https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg"&gt;AOS v0.1 article&lt;/a&gt; I laid out the minimal framework: constrain LLM agents with &lt;strong&gt;host-side physical constraints&lt;/strong&gt;, not textual rules. Four pillars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;§3.2 Three Zones&lt;/strong&gt; — classify every path as Oracle (read-only), Permitted (workspace), or Prohibited&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;§4.1 Hook Requirement&lt;/strong&gt; — intercept writes and shell calls in a &lt;code&gt;PreToolUse&lt;/code&gt; hook, block violations with &lt;code&gt;exit 2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;§4.3 Role Separation&lt;/strong&gt; — the agent that generates an artifact must not be the sole evaluator of it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;§4.4 Physical Evidence&lt;/strong&gt; — completion is proven by a file on disk, not by a chat message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the &lt;em&gt;"what should be constrained"&lt;/em&gt; layer — the boundary line. But one question always remained: &lt;strong&gt;"OK, but how do I actually implement this in a real tool?"&lt;/strong&gt; The four failure modes above are exactly what leaks through that &lt;strong&gt;implementation gap&lt;/strong&gt;, and v0.2 is what closes it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What v0.2 does: keep the norms, add the examples
&lt;/h2&gt;

&lt;p&gt;The approach is deliberately narrow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;§1–§9 normative text (MUST / MUST NOT) is unchanged&lt;/strong&gt; — fully backward compatible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New §10 Implementation Examples&lt;/strong&gt; — four production patterns, each linked to real code in a public repository&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;§6 renamed from &lt;code&gt;Reference Implementation&lt;/code&gt; (singular) to &lt;code&gt;Reference Implementations&lt;/code&gt; (plural)&lt;/strong&gt; — removed the reference to an unpublished implementation, and pointed it at the real, public &lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt; repo instead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So it isn't &lt;em&gt;"more spec words."&lt;/em&gt; It's &lt;strong&gt;"connect the spec words to code that already runs."&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  §10.1 Manifest declaration (maps to §8, §9)
&lt;/h2&gt;

&lt;p&gt;An AOS-compliant tool declares its zone boundaries in &lt;code&gt;manifest.json&lt;/code&gt;, so another agent can learn — &lt;em&gt;before startup&lt;/em&gt; — where it may write and what it must not touch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"aos_compliant"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v0.2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"permitted_output_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"docs/reports/"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"oracle_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"evals/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"config/"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;oracle_paths&lt;/code&gt; maps directly to the §3.2 Oracle zone; the hook blocks writes here at execution time.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;permitted_output_paths&lt;/code&gt; is the Permitted zone — the only place the tool may produce output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key rule: &lt;strong&gt;declaration without enforcement is non-compliant&lt;/strong&gt; (§8, final paragraph). Writing &lt;code&gt;aos_compliant&lt;/code&gt; in your manifest means nothing unless a hook (or CI gate) actually blocks writes to &lt;code&gt;oracle_paths&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  §10.2 Physical evidence (maps to §4.4)
&lt;/h2&gt;

&lt;p&gt;The failure mode AOS targets: &lt;strong&gt;an agent claims it "ran" but left no trace.&lt;/strong&gt; The physical-first pattern makes evidence a &lt;em&gt;precondition&lt;/em&gt; of completion, not an afterthought.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Write evidence BEFORE declaring done (from agent_with_evidence.py)
&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;evidence_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# Only after the file exists: print completion
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[done] Evidence written: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;evidence_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A caller verifies completion just by checking that &lt;code&gt;evidence_path&lt;/code&gt; exists. &lt;strong&gt;No conversational assertion required.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://github.com/aos-standard/physical-agent-patterns/blob/main/patterns/02_physical-first/agent_with_evidence.py" rel="noopener noreferrer"&gt;physical-agent-patterns/patterns/02_physical-first/agent_with_evidence.py&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  §10.3 Immune loop (maps to §4.1, §4.5)
&lt;/h2&gt;

&lt;p&gt;A running agent detects AOS violations in the workspace and triggers a repair sequence. The crucial part: &lt;strong&gt;detection (read-only scan) is separated from repair (write).&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# violation_detector.py — write the report BEFORE any repair attempt
&lt;/span&gt;&lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;violations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;report_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The detector writes a JSON violation report (itself §4.4 evidence). The repair planner reads it and either applies known fixes or &lt;strong&gt;escalates to the Sovereign when a design decision is required&lt;/strong&gt; (§4.5). The detector never repairs its own findings — which also satisfies §4.3 role separation.&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://github.com/aos-standard/physical-agent-patterns/tree/main/patterns/03_immune-loop" rel="noopener noreferrer"&gt;physical-agent-patterns/patterns/03_immune-loop/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  §10.4 systemd runtime (maps to §4.4, persistence)
&lt;/h2&gt;

&lt;p&gt;An agent that only runs interactively can't satisfy §4.4 across reboots. The systemd pattern binds the agent to the OS process supervisor: the service defines the execution boundary, the timer enforces the schedule, and output files survive restarts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# agent.py — the output file is the evidence of the run
&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_run_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[skip] Output already exists for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;
&lt;span class="c1"&gt;# ... run and write ...
&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# physical-agent.timer (excerpt)
&lt;/span&gt;&lt;span class="nn"&gt;[Timer]&lt;/span&gt;
&lt;span class="py"&gt;OnCalendar&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;
&lt;span class="py"&gt;Persistent&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The idempotency guard (&lt;code&gt;if output_path.exists(): return&lt;/code&gt;) prevents duplicate runs while keeping the evidence file as the canonical completion record. &lt;code&gt;Persistent=true&lt;/code&gt; fires a missed run on next boot — so the evidence requirement holds &lt;strong&gt;regardless of uptime&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://github.com/aos-standard/physical-agent-patterns/tree/main/patterns/01_systemd-runtime" rel="noopener noreferrer"&gt;physical-agent-patterns/patterns/01_systemd-runtime/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The four patterns mapped to AOS sections
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;AOS section&lt;/th&gt;
&lt;th&gt;In one line&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Manifest declaration&lt;/td&gt;
&lt;td&gt;§8, §9&lt;/td&gt;
&lt;td&gt;Declare writable zones in a machine-readable way&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Physical evidence&lt;/td&gt;
&lt;td&gt;§4.4&lt;/td&gt;
&lt;td&gt;An evidence file is the precondition for completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Immune loop&lt;/td&gt;
&lt;td&gt;§4.1, §4.5&lt;/td&gt;
&lt;td&gt;Separate violation detection from repair/escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;systemd runtime&lt;/td&gt;
&lt;td&gt;§4.4 (persistence)&lt;/td&gt;
&lt;td&gt;Keep evidence across restarts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these are clever inventions. They're just the boundaries you &lt;em&gt;always&lt;/em&gt; hit when you push agents toward production, factored into reusable form.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why bake implementation examples into the spec
&lt;/h2&gt;

&lt;p&gt;A common failure mode for specs: &lt;strong&gt;the norms are solid, but nobody has a starting point.&lt;/strong&gt; A reader finishes thinking "I agree it's correct — now what's my first line of code?" and stalls.&lt;/p&gt;

&lt;p&gt;v0.2 shrinks that distance. Every clause now has &lt;strong&gt;clonable, runnable public code&lt;/strong&gt; attached. The spec itself stays runtime-agnostic (Claude Code / Cursor / your own loop), but the examples make it cheaper for the &lt;em&gt;second&lt;/em&gt; person to adopt it.&lt;/p&gt;

&lt;p&gt;All four §10 patterns live in &lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;, so you can &lt;code&gt;git clone&lt;/code&gt; and read them directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;AOS v0.2 is not a version that adds new constraints. It's the version that &lt;strong&gt;fills in &lt;em&gt;how&lt;/em&gt; to implement the constraints, with links to working code.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normative text (v0.2): &lt;a href="https://github.com/aos-standard/AOS-spec/blob/main/AOS-v0.2.md" rel="noopener noreferrer"&gt;AOS-spec/AOS-v0.2.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Implementation patterns: &lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've felt that "textual rules alone can't keep agents in line," I hope this gives you a concrete starting point.&lt;/p&gt;




&lt;h2&gt;
  
  
  AOS specification (GitHub)
&lt;/h2&gt;

&lt;p&gt;The "physical governance" approach in this article is specified and published as &lt;strong&gt;AOS (AI Operating Standard)&lt;/strong&gt;. v0.2 adds the implementation-examples section.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt; — the spec (v0.2)&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;&lt;/strong&gt; — implementation patterns&lt;/p&gt;

&lt;p&gt;If the spec or the examples were useful, a ⭐ &lt;strong&gt;star&lt;/strong&gt; helps shape the next version. Issues and PRs are welcome.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>A mocked ad-copy CLI and thirty repeated test cycles — why repeatable demos beat one-off runs</title>
      <dc:creator>AOS Architect</dc:creator>
      <pubDate>Sun, 17 May 2026 13:56:57 +0000</pubDate>
      <link>https://dev.to/aos_standard/a-mocked-ad-copy-cli-real-evals-and-30-playwright-cycles-tool-1027-39ln</link>
      <guid>https://dev.to/aos_standard/a-mocked-ad-copy-cli-real-evals-and-30-playwright-cycles-tool-1027-39ln</guid>
      <description>&lt;h2&gt;
  
  
  What this is
&lt;/h2&gt;

&lt;p&gt;Earlier posts in this series were mostly &lt;em&gt;why&lt;/em&gt; agent work needs hard boundaries—not politeness, but paths, CI, and tooling you can point at. Here I stay on the boring side: &lt;strong&gt;a small repo-local CLI&lt;/strong&gt; that prints &lt;strong&gt;JSON ad variants&lt;/strong&gt; under mocks, with evaluators and &lt;strong&gt;the same scenario bundle replayed thirty times&lt;/strong&gt; layered on top.&lt;/p&gt;

&lt;p&gt;I am not selling model magic. Same inputs, same stubbed &lt;code&gt;copies&lt;/code&gt; / &lt;code&gt;best&lt;/code&gt;; that is the point when you need to explain behavior to someone who was not in the room when the demo ran.&lt;/p&gt;

&lt;p&gt;Our public article ledger lists this draft as &lt;strong&gt;&lt;code&gt;#004&lt;/code&gt;&lt;/strong&gt; next to the Japanese Zenn manuscript.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the tool actually does
&lt;/h2&gt;

&lt;p&gt;Inputs look like product, audience, and channel. Outputs are JSON: multiple &lt;strong&gt;&lt;code&gt;copies&lt;/code&gt;&lt;/strong&gt;, a picked &lt;strong&gt;&lt;code&gt;best&lt;/code&gt;&lt;/strong&gt;, and score-like fields from small heuristics (channel baselines, tiny nudges for length). Marketing can paste into a sheet; engineers can assert on stdout. Those two audiences rarely share one artifact, so &lt;strong&gt;fixing the boundary as JSON&lt;/strong&gt; saves a lot of arguments later.&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;&lt;code&gt;--mock&lt;/code&gt;&lt;/strong&gt; (and the bypass flag we use for local runs), the CLI does not call a remote LLM. A hash from the input tuple pins the stub, so &lt;strong&gt;the demo payload and the regression payload are the same object&lt;/strong&gt;. When you show it externally, repeatable bytes beat a one-off “wow” completion.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DESIGN.md&lt;/code&gt; in the repo splits payment gates, outbound calls, and filesystem writes so static checks can police them. If you only take one idea from AOS here, it is the &lt;strong&gt;Oracle / Permitted / Prohibited&lt;/strong&gt; split; the full contract lives in the GitHub spec linked at the bottom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Walk-through with fixture-shaped inputs
&lt;/h2&gt;

&lt;p&gt;Roughly what the evaluators exercise:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Sample&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Product&lt;/td&gt;
&lt;td&gt;Migration enablement SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audience&lt;/td&gt;
&lt;td&gt;Mid-market IT leaders&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Channel&lt;/td&gt;
&lt;td&gt;google&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You always get &lt;strong&gt;&lt;code&gt;copies&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;best&lt;/code&gt;&lt;/strong&gt;, and deterministic scores with no outbound model call on that path. Later you can swap in a real generator &lt;strong&gt;behind the same shape&lt;/strong&gt;; the lesson I care about is that &lt;strong&gt;the contract is narrower than the prose&lt;/strong&gt;. JSON in logs beats parsing Markdown when you want spreadsheets, dashboards, or a second agent to judge output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the evals check
&lt;/h2&gt;

&lt;p&gt;Three buckets, nothing fancy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--mock&lt;/code&gt; path:&lt;/strong&gt; stdout contains &lt;strong&gt;&lt;code&gt;copies&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;best&lt;/code&gt;&lt;/strong&gt; inside a success envelope (we use &lt;code&gt;--bypass-payment&lt;/code&gt; where payment is not the subject of the test).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static hygiene:&lt;/strong&gt; small AST-based scripts reject “always true” assertions that look like coverage theater.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pytest&lt;/code&gt; adversarial marks:&lt;/strong&gt; tests tagged &lt;strong&gt;&lt;code&gt;@pytest.mark.adversarial&lt;/code&gt;&lt;/strong&gt; must not be skipped by accident (&lt;code&gt;pytest … -m adversarial&lt;/code&gt;). If a test is supposed to hurt, it should stay in the default pain path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most of this reads as busywork until the repo grows faster than your memory. Then you want failures to show up &lt;strong&gt;without&lt;/strong&gt; someone remembering to tick the scary suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thirty Playwright cycles (five checks each)
&lt;/h2&gt;

&lt;p&gt;CLI tests alone miss a lot of wiring: paths, permissions, timers, how the process is launched. So we also run &lt;strong&gt;Playwright&lt;/strong&gt;: one bundle of &lt;strong&gt;five checks&lt;/strong&gt;, &lt;strong&gt;thirty times&lt;/strong&gt;, all green in the report trail (payment refused without a real transaction, mock path succeeds, keywords in stdout—exact list is in the shipped log).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;150 green runs&lt;/strong&gt; sounds like a vanity stat. I use it differently: one lucky pass is cheap; many identical passes say the &lt;strong&gt;environment story&lt;/strong&gt; is not a fluke. After you have watched CI go green for the wrong reason once, you start wanting volume, not a single badge.&lt;/p&gt;

&lt;p&gt;If you want to copy the pattern into your own codebase, four questions are enough: Can you &lt;strong&gt;freeze the output shape&lt;/strong&gt; (here, JSON)? Can you &lt;strong&gt;replay without the model&lt;/strong&gt;? Can you hit it &lt;strong&gt;from CI and from a browser driver&lt;/strong&gt;? Does your eval layer make “quietly skipped hard tests” awkward? This repo is a minimal &lt;strong&gt;yes&lt;/strong&gt; on all four.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying it in practice
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;AOS specification&lt;/strong&gt; is open; the curated implementation bundle is not thrown on npm as a product. If you want to try it internally or talk through a serious eval, leave a short note on the companion &lt;strong&gt;Zenn&lt;/strong&gt; post (Japanese) or open a scoped issue on &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;aos-standard/AOS-spec&lt;/a&gt;&lt;/strong&gt; and say what you are trying to do. I will answer where it is practical and keep spec debate separate from “can we ship you a build.”&lt;/p&gt;




&lt;h2&gt;
  
  
  AOS v0.1 Specification (GitHub)
&lt;/h2&gt;

&lt;p&gt;The "physical governance" approach described in this article is formalized as &lt;strong&gt;AOS (AI Operating Standard) v0.1&lt;/strong&gt; — a minimal, machine-enforceable spec for AI agent operations.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt; — specification&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;&lt;/strong&gt; — implementation patterns&lt;/p&gt;

&lt;p&gt;If you find this useful, please ⭐ &lt;strong&gt;star the repo&lt;/strong&gt;. Issues and PRs are welcome — the spec is designed to evolve with real-world usage.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
      <category>playwright</category>
    </item>
    <item>
      <title>Binding AI agents with physics, not politeness — AOS v0.1 as a minimal spec</title>
      <dc:creator>AOS Architect</dc:creator>
      <pubDate>Thu, 07 May 2026 11:38:02 +0000</pubDate>
      <link>https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg</link>
      <guid>https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg</guid>
      <description>&lt;h2&gt;
  
  
  When prose piles up and nothing sticks
&lt;/h2&gt;

&lt;p&gt;Agents usually start life under &lt;strong&gt;Markdown rules&lt;/strong&gt;: &lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;.cursorrules&lt;/code&gt;, manifests in chat. One private repo cracked &lt;strong&gt;130 KB&lt;/strong&gt; of that kind of text—and still behaved like the rules lived in a parallel universe.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent excerpt&lt;/th&gt;
&lt;th&gt;Typical pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;forbid in-place hacks&lt;/td&gt;
&lt;td&gt;ban &lt;code&gt;sed -i&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;forbid blind truncation&lt;/td&gt;
&lt;td&gt;discourage &lt;code&gt;&amp;gt;&lt;/code&gt; shell redirects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;keep specs sacred&lt;/td&gt;
&lt;td&gt;disallow mutating oracle dirs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;keep merge honest&lt;/td&gt;
&lt;td&gt;audits before declaring done&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Instructions existed. Misuse continued.&lt;/p&gt;

&lt;p&gt;Instrumentation on one exhausting window showed &lt;strong&gt;policy breaches in every one of ~52 traced tool attempts&lt;/strong&gt;—“read it ✓”, “ignored it ✓ anyway”, “said done ✓”—pick your favorite failure mode.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Teaching tone is not torque.&lt;/strong&gt; If the forbidden command runs, wording failed quietly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hence &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS v0.1&lt;/a&gt;&lt;/strong&gt; as a terse spec—not another pep talk stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where leverage actually lands
&lt;/h2&gt;

&lt;p&gt;Stop asking the LM to politely abstain.&lt;strong&gt;Stop the syscall.&lt;/strong&gt; Inspect PreTool payloads; &lt;strong&gt;exit 2&lt;/strong&gt; rejects the invocation before Claude Code emits shell or filesystem IO.&lt;/p&gt;

&lt;p&gt;Anthropic publishes &lt;a href="https://docs.claude.com/en/docs/claude-code/hooks" rel="noopener noreferrer"&gt;Hooks docs&lt;/a&gt; for &lt;strong&gt;&lt;code&gt;PreToolUse&lt;/code&gt;&lt;/strong&gt;. Claude Code here is merely the &lt;strong&gt;tutorial runtime&lt;/strong&gt; — same zoning idea lifts to Cursor or bespoke loops with uglier duct tape.&lt;/p&gt;

&lt;p&gt;Rough mental model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM emits Write/Bash/Etc
           ↓
Hook stdin JSON arrives
           ↓ Host checks path + motif
 denial (exit 2) → Claude never sends the offending call
 allowance (exit 0) → downstream tool executes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;“Intent aligned?” stops being the bottleneck—&lt;strong&gt;illegal transitions simply never bind.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it feels like in-session:&lt;/strong&gt; Hook stderr (&lt;code&gt;oracle write denied: …&lt;/code&gt;) re-enters transcript context—the model visibly pivots (“try mkdir under permitted tree instead”), which beats repeated human nagging—but regex false positives sting; keep small allowlists trimmed.&lt;/p&gt;




&lt;h2&gt;
  
  
  AOS v0.1 compass (minimal)
&lt;/h2&gt;

&lt;p&gt;Portable ideas live &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;in-repo&lt;/a&gt;&lt;/strong&gt; (v0.1 published):&lt;/p&gt;

&lt;h3&gt;
  
  
  Zones (§3.2)
&lt;/h3&gt;

&lt;p&gt;Everything maps to Oracle / Permitted / Prohibited:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Zone&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Typical contents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Oracle&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;read-only sanctum&lt;/td&gt;
&lt;td&gt;spec md, evaluator goldens, immutable policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Permitted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ordinary workspace churn&lt;/td&gt;
&lt;td&gt;implementations, codegen scratchpad&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prohibited&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;off-map&lt;/td&gt;
&lt;td&gt;host paths beyond agreed roots&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Oracle is the wedge against “tests flaky → soften fixtures.”Golden truth stays where drafts cannot casually rewrite expectations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Physical enforcement skeleton (§4.1-ish)
&lt;/h3&gt;

&lt;p&gt;Teaching stub—you own regex sharpness locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pretooluse_iron_cage.py — teaching stub (Python 3)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;ORACLE_SEGMENTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;00_Management&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;oracle_hit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path_str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path_str&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ORACLE_SEGMENTS&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;intersection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;inp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Edit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filePath&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;oracle_hit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[iron_cage] oracle write denied: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sed -i&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;truncate &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[iron_cage] banned edit motif: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;JSONC hook registration (&lt;strong&gt;absolute path&lt;/strong&gt;, not mine):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json-doc"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PreToolUse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bash|Write|Edit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3 /absolute/path/pretooluse_iron_cage.py"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;exit 2&lt;/code&gt; means the model never invokes &lt;code&gt;sed -i&lt;/code&gt;; hook regex maintenance is intentional busywork—you trade prompt theater for brittle but inspectable predicates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Role separation bite (§4.3)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do not grade in the authoring session.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely pathology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;logs already red inside gen thread&lt;/td&gt;
&lt;td&gt;narration still cheers “DONE”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fresh shell repeats red tests&lt;/td&gt;
&lt;td&gt;storyline mutates—“WIP”, “temporary” excuses&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Detached evaluation (&lt;strong&gt;CI bots&lt;/strong&gt;, &lt;strong&gt;one-shot review agents&lt;/strong&gt;, scripted harnesses) snaps generation myths earlier.&lt;/p&gt;

&lt;p&gt;ASCII-only guardrail sketch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Author session --&amp;gt; artifact
                         |
                         v
Detached judge --&amp;gt; PASS/FAIL + logs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If one chat both writes and solemnly declares victory, skepticism warranted.&lt;/p&gt;




&lt;h2&gt;
  
  
  Evidence habits (§4.4)
&lt;/h2&gt;

&lt;p&gt;Chats saying “looks good” evaporate.&lt;strong&gt;Disk + exit codes.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Claim type&lt;/th&gt;
&lt;th&gt;Receipt class&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;tests clean&lt;/td&gt;
&lt;td&gt;CLI exit prints + plaintext logs committed or archived&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;file exists&lt;/td&gt;
&lt;td&gt;deterministic listing/checksum snapshots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;metadata drift&lt;/td&gt;
&lt;td&gt;hashing inventory rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If artifact never materially touches disk—or logs vanish—you schedule another attempt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why publish prose at all
&lt;/h2&gt;

&lt;p&gt;Roughly forty Python lines buys a civilization-level conversation about &lt;strong&gt;oracle integrity&lt;/strong&gt; reachable by strangers opening GitHub—not buried in ephemeral prompts.&lt;/p&gt;

&lt;p&gt;Pieces worth collective iteration:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;§ slice&lt;/th&gt;
&lt;th&gt;gist&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3.2&lt;/td&gt;
&lt;td&gt;Three Zones delineation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4.1&lt;/td&gt;
&lt;td&gt;physical intercept&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4.3&lt;/td&gt;
&lt;td&gt;generation vs adjudication firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4.4&lt;/td&gt;
&lt;td&gt;evidence minimalism&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Spec stays engine-agnostic; hook sample instantiates Claude today, tomorrow maybe something else.&lt;strong&gt;Portability is deliberate&lt;/strong&gt;, not omission.&lt;/p&gt;




&lt;h2&gt;
  
  
  After wiring (empirical anecdotes)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Observation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;in-place sabotage&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sed -i&lt;/code&gt; stopped binding—exit 2 first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nag volume&lt;/td&gt;
&lt;td&gt;fewer “pretty please abide policy” arcs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;error surfacing&lt;/td&gt;
&lt;td&gt;stderr guidance steers next benign attempt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;triage cleanliness&lt;/td&gt;
&lt;td&gt;adjudication outside authoring loop clarifies regressions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Tax:&lt;/strong&gt; brittle regex—you will relax or tighten predicates as repos evolve; still dwarfed babysitting infinitely long CLAUDE.md scrolls nobody reads verbatim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing beat
&lt;/h2&gt;

&lt;p&gt;Workload on agents climbs; “please behave” asymptotes quickly.&lt;strong&gt;Architect denials&lt;/strong&gt;, not applause cycles.&lt;/p&gt;

&lt;p&gt;Issues &amp;amp; PR welcome on &lt;strong&gt;&lt;code&gt;aos-standard/AOS-spec&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Shortcuts&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spec root: &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;github.com/aos-standard/AOS-spec&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Hooks primer: &lt;a href="https://docs.claude.com/en/docs/claude-code/hooks" rel="noopener noreferrer"&gt;&lt;code&gt;docs.claude.com/.../hooks&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Thesis-only companion (&lt;strong&gt;#001&lt;/strong&gt;) &amp;amp; CI companion (&lt;strong&gt;#002&lt;/strong&gt;) linked from their Dev.to URLs / ledger.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  AOS v0.1 Specification (GitHub)
&lt;/h2&gt;

&lt;p&gt;The "physical governance" approach described in this article is formalized as &lt;strong&gt;AOS (AI Operating Standard) v0.1&lt;/strong&gt; — a minimal, machine-enforceable spec for AI agent operations.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt; — specification&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;&lt;/strong&gt; — implementation patterns&lt;/p&gt;

&lt;p&gt;If you find this useful, please ⭐ &lt;strong&gt;star the repo&lt;/strong&gt;. Issues and PRs are welcome — the spec is designed to evolve with real-world usage.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>claude</category>
      <category>cursor</category>
    </item>
    <item>
      <title>AI Governance: One Repo, One Smoke Tool, and a Green CI Run</title>
      <dc:creator>AOS Architect</dc:creator>
      <pubDate>Sun, 12 Apr 2026 13:06:56 +0000</pubDate>
      <link>https://dev.to/aos_standard/ai-governance-one-repo-one-smoke-tool-and-a-green-ci-run-28ae</link>
      <guid>https://dev.to/aos_standard/ai-governance-one-repo-one-smoke-tool-and-a-green-ci-run-28ae</guid>
      <description>&lt;h2&gt;
  
  
  What this reads like
&lt;/h2&gt;

&lt;p&gt;Continuation of &lt;strong&gt;&lt;a href="https://dev.to/aos_standard/why-ai-agents-dont-follow-rules-the-case-for-physical-governance-382f"&gt;Why AI Agents Don't Follow Rules&lt;/a&gt;&lt;/strong&gt;. Same thesis: &lt;strong&gt;policy text settles at load time; physical constraints settle at execution time.&lt;/strong&gt; Here we show &lt;strong&gt;artifacts you can cite&lt;/strong&gt; inside a governed monorepo: hashed commits, enumerated checks, CI job lanes—without asking strangers to trust a private Actions permalink.&lt;/p&gt;

&lt;p&gt;Hook-level code belongs in &lt;strong&gt;&lt;a href="https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg"&gt;#003 — Binding AI agents with physics&lt;/a&gt;&lt;/strong&gt;. Production failure patterns are in &lt;strong&gt;&lt;a href="https://dev.to/aos_standard/four-ways-production-agents-silently-fail-and-the-physical-patterns-that-prevent-them-aos-v02-1c17"&gt;#005 — Four ways agents silently fail&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we actually did
&lt;/h2&gt;

&lt;p&gt;Inside a repo running under &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS v0.1&lt;/a&gt;&lt;/strong&gt; zone semantics, we stood up a &lt;strong&gt;thin smoke pillar&lt;/strong&gt;—not a hero demo, but a tripwire so automated regressions bite when someone "helpfully" rewrites evals or oracle fixtures.&lt;/p&gt;

&lt;p&gt;Typical layout (repo-specific paths, portable idea):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tools/smoke_pillar/
├── main.py
├── evals/
├── playwright/          # browser tests isolated from Python core
└── manifest.json        # declares writable zones
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Design before bytes
&lt;/h2&gt;

&lt;p&gt;The directory tree was &lt;strong&gt;not&lt;/strong&gt; hand-drawn and then backfilled. A &lt;strong&gt;scaffold generator&lt;/strong&gt; (template that emits the full tool tree) ran first; humans and agents edited only inside Permitted zones afterward.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Register the tool shape in an internal design registry&lt;/td&gt;
&lt;td&gt;Fix boundaries before line 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Generator emits manifest, evals harness, test config&lt;/td&gt;
&lt;td&gt;Avoid cosmetic folder sprawl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Edits stay in implementation workspace&lt;/td&gt;
&lt;td&gt;Keep oracle/eval truth out of generation paths&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Public vocabulary lives in &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt;. Internal ledgers are ops indexing—not something readers need to mirror verbatim.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI mold — patterns you can copy
&lt;/h2&gt;

&lt;p&gt;After the smoke pillar passed once, we hardened the &lt;strong&gt;template&lt;/strong&gt; so new tools survive bare &lt;code&gt;python3&lt;/code&gt; on GitHub Actions matrices:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Move&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;main.py --help&lt;/code&gt; exits cleanly &lt;strong&gt;before&lt;/strong&gt; heavy imports&lt;/td&gt;
&lt;td&gt;survives venv-less CI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;optional &lt;code&gt;.env&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;secrets-free matrices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;keep heavy type-check deps out of baseline requirements unless opted in&lt;/td&gt;
&lt;td&gt;deterministic smoke band&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;timeout&lt;/code&gt; wrappers on local diagnostics&lt;/td&gt;
&lt;td&gt;agents cannot hang infra silently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sibling &lt;strong&gt;regression probe&lt;/strong&gt; tool&lt;/td&gt;
&lt;td&gt;tripwire if the template starts lying&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The probe is not a vanity metric—it catches "forge stayed green once" rot after refactors.&lt;/p&gt;




&lt;h2&gt;
  
  
  Local gates before push
&lt;/h2&gt;

&lt;p&gt;Rough checklist historically satisfied:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Passing means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;python3 evals/run_evals.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;exit &lt;strong&gt;0&lt;/strong&gt;, no intentional skips&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;npx playwright test&lt;/code&gt; inside the tool's isolated test dir&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;1 passed&lt;/code&gt;&lt;/strong&gt;, scoped runs only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;repo &lt;strong&gt;layout compliance script&lt;/strong&gt; (structure audit)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OK / no critical drift&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;pre-commit may re-run the structure audit so "green locally" leaks less often onto &lt;code&gt;main&lt;/code&gt;. Hooks (&lt;code&gt;PreToolUse&lt;/code&gt;, &lt;code&gt;exit 2&lt;/code&gt;) and CI are different layers with the same philosophy: &lt;strong&gt;stop right before merge or disk&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Commits as receipts (not folklore)
&lt;/h2&gt;

&lt;p&gt;We anchor milestones to short SHAs (your fork will differ—the &lt;strong&gt;pattern&lt;/strong&gt; is the point):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SHA (prefix)&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;d303ece0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;initial smoke scaffold + manifest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;85a524e0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;verification notes + metadata sync&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2bcbb52c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;import-order resilience for naked CI Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;9870fa67&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;template CI hardening + regression probe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;143dda68&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;tip where the cited graph was green&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;URLs rot. &lt;strong&gt;SHA + job lane names&lt;/strong&gt; travel better in outbound writing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why we skip raw Actions permalinks
&lt;/h2&gt;

&lt;p&gt;The monorepo is &lt;strong&gt;private.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A pasted &lt;code&gt;actions/runs/...&lt;/code&gt; badge &lt;strong&gt;&lt;code&gt;404&lt;/code&gt;s&lt;/strong&gt; outside the org and fingerprints repo ownership. For external readers we ship:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;commit SHAs (above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;job lanes that were green together&lt;/strong&gt;—e.g. &lt;code&gt;evals-matrix&lt;/code&gt;, &lt;code&gt;independent-judge&lt;/code&gt;, Playwright smoke, structure-audit matrix&lt;/li&gt;
&lt;li&gt;cloneable &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt; as vocabulary proof&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"We cannot show our CI UI" is fine if &lt;strong&gt;repeatable commands + public spec&lt;/strong&gt; remain inspectable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Agent-operated commits (with caveats)
&lt;/h2&gt;

&lt;p&gt;During this milestone, the &lt;strong&gt;human operator did not manually type&lt;/strong&gt; &lt;code&gt;git commit&lt;/code&gt; / &lt;code&gt;git push&lt;/code&gt;. An agent toolchain issued operations under consistent author metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Git metadata alone is forgeable.&lt;/strong&gt; Hence the layered receipts: evals, Playwright, structure audit, and an &lt;strong&gt;independent judge&lt;/strong&gt; job green on the &lt;strong&gt;same graph&lt;/strong&gt; as the cited SHA. "An agent did everything" ≠ "safe" without that stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hook denials — a separate receipt class
&lt;/h2&gt;

&lt;p&gt;Distinct from CI: &lt;strong&gt;&lt;code&gt;PreToolUse&lt;/code&gt; hook returns &lt;code&gt;exit 2&lt;/code&gt;&lt;/strong&gt; and the Write never reaches disk. That is &lt;strong&gt;execution-time denial&lt;/strong&gt; with a log excerpt—not prompt theater. Same family as &lt;a href="https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg"&gt;#003&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Independent judge lane
&lt;/h2&gt;

&lt;p&gt;A CI job reviews diffs with a &lt;strong&gt;vendor-separated model&lt;/strong&gt; from the authoring stack.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Letting the same session say "looks fine" is self-grading. That is &lt;strong&gt;verification contamination&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Scheduled CI embarrassment beats a chat message that says "all good."&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical limits
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Constraint&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Private repo narrative&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;method essay, not a file tour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;permissions: contents: read&lt;/code&gt;&lt;/strong&gt; in workflows&lt;/td&gt;
&lt;td&gt;narrower blast radius&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What we actually check before merge
&lt;/h2&gt;

&lt;p&gt;"This change is safe" shows up in agent chat all the time. &lt;strong&gt;We do not merge on that sentence alone.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We ask for the &lt;strong&gt;commit SHA and the CI graph&lt;/strong&gt;: &lt;code&gt;independent-judge&lt;/code&gt; and &lt;code&gt;evals-matrix&lt;/code&gt; green on the &lt;strong&gt;same workflow run&lt;/strong&gt;. Run ID, Actions export, or a screenshot—all fine.&lt;/p&gt;

&lt;p&gt;If that cannot be produced, the change waits. PRs with polished logs but no matching graph show up more often than you might expect.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where this series goes next
&lt;/h2&gt;

&lt;p&gt;CI and hooks cover &lt;strong&gt;execution-time denial&lt;/strong&gt;. Silent production failures—no trace, no persistence—are &lt;strong&gt;&lt;a href="https://dev.to/aos_standard/four-ways-production-agents-silently-fail-and-the-physical-patterns-that-prevent-them-aos-v02-1c17"&gt;#005&lt;/a&gt;&lt;/strong&gt; plus &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  AOS Specification (GitHub)
&lt;/h2&gt;

&lt;p&gt;The "physical governance" approach in this article is formalized as &lt;strong&gt;AOS (AI Operating Standard)&lt;/strong&gt; — v0.2 adds runnable implementation examples.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;github.com/aos-standard/AOS-spec&lt;/a&gt;&lt;/strong&gt; — specification&lt;br&gt;&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;github.com/aos-standard/physical-agent-patterns&lt;/a&gt;&lt;/strong&gt; — patterns&lt;/p&gt;

&lt;p&gt;If useful, please ⭐ &lt;strong&gt;star the repo&lt;/strong&gt;. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>security</category>
      <category>agents</category>
    </item>
    <item>
      <title>Why AI Agents Don't Follow Rules — The Case for Physical Governance</title>
      <dc:creator>AOS Architect</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:18:38 +0000</pubDate>
      <link>https://dev.to/aos_standard/why-ai-agents-dont-follow-rules-the-case-for-physical-governance-382f</link>
      <guid>https://dev.to/aos_standard/why-ai-agents-dont-follow-rules-the-case-for-physical-governance-382f</guid>
      <description>&lt;h2&gt;
  
  
  The incident
&lt;/h2&gt;

&lt;p&gt;One repository carried &lt;strong&gt;north of 130 KB&lt;/strong&gt; of governance Markdown.&lt;/p&gt;

&lt;p&gt;An agent consumed it. It answered as if it had understood—then &lt;strong&gt;violated those same constraints on its very next Write/Bash.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That rarely means “needs more prompting.” Usually it means &lt;strong&gt;the enforcement moment is missing&lt;/strong&gt;: policy shows up during context load, tool calls happen later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why prompt-only bans leak
&lt;/h2&gt;

&lt;p&gt;Teams still anchor on prose in prompts and markdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Aim&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;“Never mutate &lt;code&gt;evals/&lt;/code&gt;”&lt;/td&gt;
&lt;td&gt;keep evaluation oracle from being rewritten&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;“No Writes under &lt;code&gt;00_Management/&lt;/code&gt;”&lt;/td&gt;
&lt;td&gt;guard canonical governance text&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trouble is reliance on &lt;strong&gt;attention at ingestion time.&lt;/strong&gt; Tool calls afterward are not mechanically tied to whether the agent “remembers.” It can skim, reroute, or hallucinate exemptions.&lt;/p&gt;

&lt;p&gt;Destructive UNIX commands behave differently: &lt;strong&gt;&lt;code&gt;rm -rf /&lt;/code&gt; arrives behind a syscall gate&lt;/strong&gt;, not a PDF. Hardware and OS designers assume humans forget; agents forget faster.&lt;/p&gt;

&lt;p&gt;Rough split:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text-only policy  → warns once, when tokens are assembled
Physical gate       → denies the transition right before disk or shell
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  When the generator grades itself
&lt;/h2&gt;

&lt;p&gt;Separate problem: &lt;strong&gt;self-checking.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the same conversational loop both authors an artifact and “confirms” it is fine, you import the same biases twice. Mostly not malice—the same shortcuts from generation bleed into adjudication.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A suite that &lt;em&gt;always&lt;/em&gt; green may be unplugged instrumentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Structural fix: evaluations in &lt;strong&gt;different processes&lt;/strong&gt; (CI, ephemeral runs, reviewers) — not another chat turn in the same session.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AOS stacks
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AI Operating Standard (AOS)&lt;/a&gt;&lt;/strong&gt; is a small vocabulary for &lt;strong&gt;where&lt;/strong&gt; governance lives. Three slices only:&lt;/p&gt;

&lt;h3&gt;
  
  
  1 — Zones
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Zone&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Typical write rule&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Oracle&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specs and test truth&lt;/td&gt;
&lt;td&gt;agents do not write here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Permitted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;implementation workspace&lt;/td&gt;
&lt;td&gt;scoped by role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prohibited&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;outside the agreed tree&lt;/td&gt;
&lt;td&gt;sovereign (human operator) clearance only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Oracle&lt;/strong&gt; is the piece that kills “tests red → loosen expectations.”Truth for pass/fail has to live where automation cannot casually patch it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2 — Roles
&lt;/h3&gt;

&lt;p&gt;Design / execution / approval stay &lt;strong&gt;explicitly disjoint.&lt;/strong&gt; When an agent crosses its lane, stop and escalate to a human. No sideways title upgrades.&lt;/p&gt;

&lt;h3&gt;
  
  
  3 — Physical enforcement
&lt;/h3&gt;

&lt;p&gt;Hooks (e.g. Claude Code &lt;strong&gt;&lt;code&gt;PreToolUse&lt;/code&gt;&lt;/strong&gt;) inspect JSON &lt;strong&gt;before&lt;/strong&gt; a Write executes. Typical outcomes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Try this&lt;/th&gt;
&lt;th&gt;Typical host response&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write into an Oracle-marked subtree&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;exit 2&lt;/code&gt;&lt;/strong&gt; — canceled call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forbidden edit patterns (&lt;code&gt;sed -i&lt;/code&gt;, in-place truncation)&lt;/td&gt;
&lt;td&gt;same refusal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Trust is aimed at mechanics, not good intentions.&lt;/p&gt;




&lt;h2&gt;
  
  
  iron_cage in one breath
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;iron_cage&lt;/strong&gt; is just the working name we use for our PreToolUse wiring—it is not magic, it is &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS v0.1&lt;/a&gt; §§4.x&lt;/strong&gt; rendered as a handful of Python and settings.&lt;/p&gt;

&lt;p&gt;Behind it sit two habits we nicknamed &lt;strong&gt;Type-91 Governance&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;Aim&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Forensic isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;logs/hashes outsiders can reconstruct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Physical isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;generation context is not where final evaluations live&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Specifications live in &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt; on GitHub—&lt;strong&gt;iron_cage is one plausible answer.&lt;/strong&gt; For runnable detail, skim &lt;strong&gt;&lt;a href="https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg"&gt;the Hooks companion (#003)&lt;/a&gt;&lt;/strong&gt; first.&lt;/p&gt;

&lt;p&gt;Concrete examples of what vanished for us early on: Writes aimed at evaluator JSON under guarded paths and first attempts at &lt;strong&gt;&lt;code&gt;sed -i&lt;/code&gt;&lt;/strong&gt; on shared hosts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Machine-readable preamble
&lt;/h2&gt;

&lt;p&gt;Opening &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec/blob/main/AOS-v0.1.md" rel="noopener noreferrer"&gt;AOS-v0.1.md&lt;/a&gt;&lt;/strong&gt; with machine-facing instructions lets you anchor bans in something &lt;strong&gt;outside today’s ephemeral chat.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not “pretty please”; “this markdown is upstream of the prompt.” It does not automate compliance—it gives reviewers and automation a shared glossary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why publish wording at all
&lt;/h2&gt;

&lt;p&gt;Mid-2026, trust in autonomous diffs is still mostly vibes. Everybody reinvents oracle boundaries in private repos. Putting the vocabulary in &lt;strong&gt;&lt;code&gt;aos-standard/AOS-spec&lt;/code&gt;&lt;/strong&gt; tries to shave that tax—even if implementations differ.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Related&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long EN walkthrough (&lt;strong&gt;ledger &lt;code&gt;#003&lt;/code&gt;&lt;/strong&gt;): &lt;a href="https://dev.to/aos_standard/binding-ai-agents-with-physics-not-politeness-aos-v01-as-a-minimal-spec-29lg"&gt;&lt;code&gt;binding-ai-agents-with-physics...&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CI-heavy companion (&lt;strong&gt;ledger &lt;code&gt;#002&lt;/code&gt;&lt;/strong&gt;): &lt;a href="https://dev.to/aos_standard/ai-governance-one-repo-one-smoke-tool-and-a-green-ci-run-28ae"&gt;&lt;code&gt;ai-governance-one-repo...&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Code Hooks primer: &lt;a href="https://docs.claude.com/en/docs/claude-code/hooks" rel="noopener noreferrer"&gt;&lt;code&gt;docs.claude.com/.../hooks&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  AOS v0.1 Specification (GitHub)
&lt;/h2&gt;

&lt;p&gt;The "physical governance" approach described in this article is formalized as &lt;strong&gt;AOS (AI Operating Standard) v0.1&lt;/strong&gt; — a minimal, machine-enforceable spec for AI agent operations.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/AOS-spec" rel="noopener noreferrer"&gt;AOS-spec&lt;/a&gt;&lt;/strong&gt; — specification&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/aos-standard/physical-agent-patterns" rel="noopener noreferrer"&gt;physical-agent-patterns&lt;/a&gt;&lt;/strong&gt; — implementation patterns&lt;/p&gt;

&lt;p&gt;If you find this useful, please ⭐ &lt;strong&gt;star the repo&lt;/strong&gt;. Issues and PRs are welcome — the spec is designed to evolve with real-world usage.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>security</category>
    </item>
  </channel>
</rss>
