DEV Community

AOS Architect
AOS Architect

Posted on • Edited on

AI Governance: One Repo, One Smoke Tool, and a Green CI Run

What this reads like

Continuation of Why AI Agents Don't Follow Rules. Same thesis: policy text settles at load time; physical constraints settle at execution time. Here we show artifacts you can cite inside a governed monorepo: hashed commits, enumerated checks, CI job lanes—without asking strangers to trust a private Actions permalink.

Hook-level code belongs in #003 — Binding AI agents with physics. Production failure patterns are in #005 — Four ways agents silently fail.


What we actually did

Inside a repo running under AOS v0.1 zone semantics, we stood up a thin smoke pillar—not a hero demo, but a tripwire so automated regressions bite when someone "helpfully" rewrites evals or oracle fixtures.

Typical layout (repo-specific paths, portable idea):

tools/smoke_pillar/
├── main.py
├── evals/
├── playwright/          # browser tests isolated from Python core
└── manifest.json        # declares writable zones
Enter fullscreen mode Exit fullscreen mode

Design before bytes

The directory tree was not hand-drawn and then backfilled. A scaffold generator (template that emits the full tool tree) ran first; humans and agents edited only inside Permitted zones afterward.

Step Action Why
1 Register the tool shape in an internal design registry Fix boundaries before line 1
2 Generator emits manifest, evals harness, test config Avoid cosmetic folder sprawl
3 Edits stay in implementation workspace Keep oracle/eval truth out of generation paths

Public vocabulary lives in AOS-spec. Internal ledgers are ops indexing—not something readers need to mirror verbatim.


CI mold — patterns you can copy

After the smoke pillar passed once, we hardened the template so new tools survive bare python3 on GitHub Actions matrices:

Move Purpose
main.py --help exits cleanly before heavy imports survives venv-less CI
optional .env secrets-free matrices
keep heavy type-check deps out of baseline requirements unless opted in deterministic smoke band
timeout wrappers on local diagnostics agents cannot hang infra silently
sibling regression probe tool tripwire if the template starts lying

The probe is not a vanity metric—it catches "forge stayed green once" rot after refactors.


Local gates before push

Rough checklist historically satisfied:

Check Passing means
python3 evals/run_evals.py exit 0, no intentional skips
npx playwright test inside the tool's isolated test dir 1 passed, scoped runs only
repo layout compliance script (structure audit) OK / no critical drift

pre-commit may re-run the structure audit so "green locally" leaks less often onto main. Hooks (PreToolUse, exit 2) and CI are different layers with the same philosophy: stop right before merge or disk.


Commits as receipts (not folklore)

We anchor milestones to short SHAs (your fork will differ—the pattern is the point):

SHA (prefix) What changed
d303ece0 initial smoke scaffold + manifest
85a524e0 verification notes + metadata sync
2bcbb52c import-order resilience for naked CI Python
9870fa67 template CI hardening + regression probe
143dda68 tip where the cited graph was green

URLs rot. SHA + job lane names travel better in outbound writing.


Why we skip raw Actions permalinks

The monorepo is private.

A pasted actions/runs/... badge 404s outside the org and fingerprints repo ownership. For external readers we ship:

  • commit SHAs (above)
  • job lanes that were green together—e.g. evals-matrix, independent-judge, Playwright smoke, structure-audit matrix
  • cloneable AOS-spec as vocabulary proof

"We cannot show our CI UI" is fine if repeatable commands + public spec remain inspectable.


Agent-operated commits (with caveats)

During this milestone, the human operator did not manually type git commit / git push. An agent toolchain issued operations under consistent author metadata.

Git metadata alone is forgeable. Hence the layered receipts: evals, Playwright, structure audit, and an independent judge job green on the same graph as the cited SHA. "An agent did everything" ≠ "safe" without that stack.


Hook denials — a separate receipt class

Distinct from CI: PreToolUse hook returns exit 2 and the Write never reaches disk. That is execution-time denial with a log excerpt—not prompt theater. Same family as #003.


Independent judge lane

A CI job reviews diffs with a vendor-separated model from the authoring stack.

Letting the same session say "looks fine" is self-grading. That is verification contamination.

Scheduled CI embarrassment beats a chat message that says "all good."


Practical limits

Constraint Meaning
Private repo narrative method essay, not a file tour
permissions: contents: read in workflows narrower blast radius

What we actually check before merge

"This change is safe" shows up in agent chat all the time. We do not merge on that sentence alone.

We ask for the commit SHA and the CI graph: independent-judge and evals-matrix green on the same workflow run. Run ID, Actions export, or a screenshot—all fine.

If that cannot be produced, the change waits. PRs with polished logs but no matching graph show up more often than you might expect.


Where this series goes next

CI and hooks cover execution-time denial. Silent production failures—no trace, no persistence—are #005 plus physical-agent-patterns.


AOS Specification (GitHub)

The "physical governance" approach in this article is formalized as AOS (AI Operating Standard) — v0.2 adds runnable implementation examples.

👉 github.com/aos-standard/AOS-spec — specification

👉 github.com/aos-standard/physical-agent-patterns — patterns

If useful, please ⭐ star the repo. Issues and PRs welcome.

Top comments (0)