AOS Architect

Posted on May 17 • Edited on Jun 21

A mocked ad-copy CLI and thirty repeated test cycles — why repeatable demos beat one-off runs

#ai #claude #agents #playwright

What this is

Earlier posts in this series were mostly why agent work needs hard boundaries—not politeness, but paths, CI, and tooling you can point at. Here I stay on the boring side: a small repo-local CLI that prints JSON ad variants under mocks, with evaluators and the same scenario bundle replayed thirty times layered on top.

I am not selling model magic. Same inputs, same stubbed copies / best; that is the point when you need to explain behavior to someone who was not in the room when the demo ran.

Our public article ledger lists this draft as #004 next to the Japanese Zenn manuscript.

What the tool actually does

Inputs look like product, audience, and channel. Outputs are JSON: multiple copies, a picked best, and score-like fields from small heuristics (channel baselines, tiny nudges for length). Marketing can paste into a sheet; engineers can assert on stdout. Those two audiences rarely share one artifact, so fixing the boundary as JSON saves a lot of arguments later.

With --mock (and the bypass flag we use for local runs), the CLI does not call a remote LLM. A hash from the input tuple pins the stub, so the demo payload and the regression payload are the same object. When you show it externally, repeatable bytes beat a one-off “wow” completion.

DESIGN.md in the repo splits payment gates, outbound calls, and filesystem writes so static checks can police them. If you only take one idea from AOS here, it is the Oracle / Permitted / Prohibited split; the full contract lives in the GitHub spec linked at the bottom.

Walk-through with fixture-shaped inputs

Roughly what the evaluators exercise:

Field	Sample
Product	Migration enablement SaaS
Audience	Mid-market IT leaders
Channel	google

You always get copies, best, and deterministic scores with no outbound model call on that path. Later you can swap in a real generator behind the same shape; the lesson I care about is that the contract is narrower than the prose. JSON in logs beats parsing Markdown when you want spreadsheets, dashboards, or a second agent to judge output.

What the evals check

Three buckets, nothing fancy:

--mock path: stdout contains copies and best inside a success envelope (we use --bypass-payment where payment is not the subject of the test).
Static hygiene: small AST-based scripts reject “always true” assertions that look like coverage theater.
pytest adversarial marks: tests tagged @pytest.mark.adversarial must not be skipped by accident (pytest … -m adversarial). If a test is supposed to hurt, it should stay in the default pain path.

Most of this reads as busywork until the repo grows faster than your memory. Then you want failures to show up without someone remembering to tick the scary suite.

Thirty Playwright cycles (five checks each)

CLI tests alone miss a lot of wiring: paths, permissions, timers, how the process is launched. So we also run Playwright: one bundle of five checks, thirty times, all green in the report trail (payment refused without a real transaction, mock path succeeds, keywords in stdout—exact list is in the shipped log).

150 green runs sounds like a vanity stat. I use it differently: one lucky pass is cheap; many identical passes say the environment story is not a fluke. After you have watched CI go green for the wrong reason once, you start wanting volume, not a single badge.

If you want to copy the pattern into your own codebase, four questions are enough: Can you freeze the output shape (here, JSON)? Can you replay without the model? Can you hit it from CI and from a browser driver? Does your eval layer make “quietly skipped hard tests” awkward? This repo is a minimal yes on all four.

Trying it in practice

The AOS specification is open; the curated implementation bundle is not thrown on npm as a product. If you want to try it internally or talk through a serious eval, leave a short note on the companion Zenn post (Japanese) or open a scoped issue on aos-standard/AOS-spec and say what you are trying to do. I will answer where it is practical and keep spec debate separate from “can we ship you a build.”

AOS v0.1 Specification (GitHub)

The "physical governance" approach described in this article is formalized as AOS (AI Operating Standard) v0.1 — a minimal, machine-enforceable spec for AI agent operations.

👉 AOS-spec — specification
👉 physical-agent-patterns — implementation patterns

If you find this useful, please ⭐ star the repo. Issues and PRs are welcome — the spec is designed to evolve with real-world usage.

DEV Community