DEV Community

AOS Architect
AOS Architect

Posted on

I built an agent health checker, then it flunked itself — here's the audit

What you get: The AOS v0.2 post named four ways production agents fail quietly—and patterns to stop them. This follow-up ships a CLI that scores agents 0–100 on four axes, then shows the real stdout when that scanner failed its own directory. No slide-deck scores; numbers from a live run.


Local green, production silent — why you want a meter

I've lost count of how many times I've seen this play out: teams are building LLM agents, everything looks good on the surface.

  • Local smoke tests pass with flying colors.
  • Your CI/CD pipeline keeps flashing green.
  • But in production, you're still hitting those invisible, structural holes. Maybe there are no timer units to wake it up, no rebirth loop files to bring it back from the brink, or just no persistent evidence on disk that it's actually doing anything. Sound familiar?

Just being polite with your prompts won't surface these deeper issues. What you really need is something that can walk through your agent's directories, read-only, and give you a concrete number. A kind of x-ray vision for your agent's health, if you will.

What tool 1066 does

That's precisely what the AOS Agent Health Reporter (internal id 1066) does. It scans an agent folder and spits out an AOS score (0–100) and a certified status (true when score ≥ 80), delivered either as Markdown or JSON.


Four axes × 25 points

Now, about how we tally those points. The scoring is heuristic, designed to align directly with the AOS v0.2 §10 implementation patterns. We're not aiming for a formal proof of compliance here; think of it more as a practical flashlight guiding you to potential issues, rather than a courtroom audit.

Here’s a quick overview of what each section scrutinizes:

Section What it checks (summary)
manifest_declared Checks for a manifest.json present +12.5, and then for an aos_compliance or aos_compliant field +12.5. (That's 25 total, in two distinct steps.)
systemd_runtime Looks for a .service or .timer file within services/ or playwright/. (It's either 25 or 0 points here.)
immune_loop Scans for a death_detector.py or any rebirth_ritual* file on disk. (Again, a binary 25 or 0.)
physical_evidence Verifies at least one .md file exists under docs/reports/. (Yep, 25 or 0.)

You'll notice that only manifest_declared is staged, meaning it can award partial points. The other three axes are binary—you either get the full 25 points or none at all. This is by design. For example, in the code, score_manifest() first gives 12.5 points just for the file being present, and another 12.5 when a compliance field is correctly set. So, if you see live totals like 37.5 (12.5 + 25) in the self-audit table, don't be surprised; that's exactly what we expect, not a bug.

The core idea in the code is straightforward:

def score_sections(tool_dir: Path) -> dict[str, float]:
    return {
        "manifest_declared": score_manifest(tool_dir),
        "systemd_runtime": score_systemd(tool_dir),
        "immune_loop": score_immune(tool_dir),
        "physical_evidence": score_physical_evidence(tool_dir),
    }
Enter fullscreen mode Exit fullscreen mode

Before we get too excited, I always find it crucial to set the stage with a few hard truths about what this tool doesn't do. For instance, systemd_runtime strictly looks at unit files inside the agent tree – it won't peek into your user-level systemd. And with immune_loop, remember, we're just checking for filename presence; it's not proof a rebirth actually ran. Honestly, I've seen folks trip up on this, expecting an oracle when it's really just an honest meter.


The punchline: self-scan scored 50/100

With those caveats in mind, I put the tool to a real test. My self-scan scored 50/100—not a mock run, but a live walk on 2026-06-18 with no --mock:

**AOS Score: 50.0/100** | Tool: 1066 | Scanned: 2026-06-18T12:06:00+00:00

## Section Scores

- **manifest_declared**: 25.0/25
- **systemd_runtime**: 0.0/25
- **immune_loop**: 0.0/25
- **physical_evidence**: 25.0/25
Enter fullscreen mode Exit fullscreen mode

Not certified (< 80).

I built the health gate; the first patient failed the exam. Reproduce locally:

python3 main.py --bypass-payment --tool-id 1066
Enter fullscreen mode Exit fullscreen mode

(--bypass-payment is for local trials; production wiring uses a separate payment path.)

Why systemd and immune_loop are zero

When I look at our internal 1066 system, these two axes always jump out. Honestly, the "zero" scores might surprise you at first glance. But there's a good reason for it, as I've tried to capture below:

Axis Physical fact on 1066 How I read it in the open
systemd_runtime Zero .service / .timer under services/ or playwright/ 1066 is an on-demand MCP/CLI tool, not one of those always-on timer agents. The rubric, as I understand it, targets long-running production workhorses.
immune_loop Zero death_detector.py / rebirth_ritual* files We preach death→rebirth in v0.2, yet our read-only scanner doesn't implement that loop yet. Yeah, it's straight technical debt, plain and simple.

I've personally made sure the manifest declares aos_compliance: "v0.1", which does get us full manifest points. And yes, reports exist under docs/reports/, giving us full evidence points there. But here's the kicker: the two axes we talk about most are the two we score zero on ourselves. Isn't that a bit of a wry chuckle? It's the honest article, not a bug in the scorer.


--self-audit: one table for the whole repo

I've found single-tool scans useful, but bulk audit is the hook that maps to your monorepo mental model:

python3 main.py --bypass-payment --self-audit
Enter fullscreen mode Exit fullscreen mode

Real filesystem walk (excerpt, same date):

**AOS Self-Audit** | total_tools: 68 | avg_score: 39.2/100

| tool_id | aos_score | certified |
|---------|-----------|-----------|
| 0051 | 37.5 | no |
| 0058 | 37.5 | no |
| 1064 | 37.5 | no |
| 1066 | 50.0 | no |
Enter fullscreen mode Exit fullscreen mode

When I look at the scores, a row showing 37.5 (for example, tool 0051) usually tells me we're looking at manifest half-credit (that's 12.5 points for a missing or incomplete compliance field) alongside full physical evidence (25 points). A 50.0 on 1066, on the other hand, means we've hit full manifest (25) and full evidence (25), even if systemd and immune are still sitting at zero.

Honestly, 1066 isn't some unique outlier with a special problem. What I've found is that the average across 68 tools hovers around 39.2, and frankly, nobody is certified yet. It's a stark reminder that the rubric we're aiming for is, well, pretty aspirational compared to how our agents are currently configured. Are we just chasing a phantom, or is this a roadmap?

Do not mix --mock with this story

Here's a common trap I've seen people fall into, and it's vital to get this straight: don't confuse --mock data with what's happening in the real world. Our CI pipelines and Playwright tests rely on --mock—it generates deterministic stub scores, essentially a fixed score derived from a hash of the tool_id. So, if you're looking at a thirty-cycle report and see tool 0051 at a high 94 under mock, understand this clearly: that's not the live environment. It's a snapshot, a simulation, not the pulse of the actual system.

Why does this matter so much? Because when we talk about audits or the data in this very post, we're always looking at the live scan.

Mode Use Example for 1066
--mock CI / fixed regression hash-driven (e.g. 41.0)
Live scan audits, this post 50.0 (canonical here)

To be absolutely clear, every number and observation I've shared so far comes directly from a live scan.


precedent_id — provenance on the report

Whenever I'm digging into a report, one of the first things I look for is the precedent_id. It's essentially the provenance for the data, giving us the full lineage of that particular run. Each execution attaches this crucial piece of metadata:

{
  "precedent_id": "07f3c9ae-dd41-494a-8e77-43a1e9c6a72c",
  "tool_id": "1066",
  "created_date": "2026-06-18T12:06:01+00:00",
  "source_signal_hash": "eab8ff11"
}
Enter fullscreen mode Exit fullscreen mode

Structured fields beat "trust me, I ran a check" in chat logs. While we're not talking a full chain-of-custody system just yet, it's a crucial step away from simply taking an agent's self-report at face value. What's been your experience trying to verify agent claims without solid data?


Thirty Playwright cycles (30/30)

When we started this, I approached it with the same philosophy that guided our 1027 ad-copy demo (#004). It's all about robust CLI evaluations, and in this case, we replayed the exact same scenario bundle for thirty full cycles. The result? Every single one came up green, which honestly, was a relief.

Item Value
Cycles 30
Successes 30
Failures 0
Scenarios per cycle 6 (SC1–SC5 + stub)

I've seen too many 'one-off' demos fall apart, so hitting these benchmarks was key. Specifically, we confirmed that payment gate rejections happened without any sneaky bypasses, that our mock path worked exactly as expected, and that the stdout always contained both "AOS Score" and "total_tools". Even the precedent_id consistently matched the expected UUID shape. Look, anyone can get one green run on their laptop once; achieving 30/30 is how we truly de-risk things and move beyond wishful thinking.


Takeaways

Through this process, a few critical takeaways became crystal clear for me.

First, after you've laid out your specifications, you absolutely must add measurement. Those v0.2 patterns we've been talking about? They only truly mean something when a simple directory walk actually spits out a verifiable score. What's the point of defining good if you can't measure it?

Second, and this one's a bit of a gut-check: you have to eat your own rubric. Our scanner itself landed at 50/100, and honestly, that's the real story. Trying to hide that number or pretend it was perfect would have been far worse than owning it.

Third, it's crucial to keep your mock environments distinct from live ones. Demo stubs are not the same as actual production audit stdout. Mixing them up can lead to some truly misleading results, and I've seen that bite teams before.

Finally, when you're looking for insights, always prefer the structured table over a single 'hero' tool id. A proper self-audit, laid out clearly, gives you a far better picture of where your repo might be weak than just pointing to one shining example.

We've even got an MCP package (aos-health-mcp) for easier editor integration, but remember, the CLI we discussed earlier is always your reproducible entry point.


Wrapping up

Looking back, v0.2 did an excellent job describing common failure modes and outlining physical fixes. But the real challenge, what 1066 aims to tackle, is the next layer: “does this tool actually embody those patterns we're pushing?” It's one thing to write it down, another to live it.

Our first subject for this deeper dive was the scanner itself. We gave it 50 points, not certified. Yes, it scored a zero on systemd and immune while we're still out there recommending those very patterns. But for me, that gap isn't a reason to distrust the numbers; it is the roadmap. It clearly shows us where we need to focus our efforts next, and honestly, that clarity is invaluable.


Next layer — external MCP blast radius (1067)

1066 scores your own agent directories against structural patterns (manifest, systemd units, immune-loop files, physical reports). A different question shows up when you pip install someone else's MCP server: does shipping code match what you assumed? A "filesystem" server might still call requests.get() or subprocess.run() in src/.

MCP Blast-Radius Auditor (PyPI: mcp-blast-radius; internal ID 1067) answers that with static analysis — capability inventory with file/line references even when no manifest exists; divergence detection when one does.

pip install mcp-blast-radius
mcp-blast-radius-gate --target-dir /path/to/mcp-server/src --gate-mode advisory
Enter fullscreen mode Exit fullscreen mode

If 1066 is structural health inside your repo, 1067 is pre-install blast-radius for third-party MCP servers. A follow-up post will run this gate on a real external server and share the audit as a public GitHub issue.


AOS specification (GitHub)

Our heuristic rubric is directly derived from AOS-spec v0.2.

👉 AOS-spec
👉 physical-agent-patterns

We're building this in the open, so please, throw us some ⭐ stars, open issues, or send in your PRs. We genuinely welcome your contributions. And just to be clear, you won't find any paid checkout links here; it's all about the work.

Top comments (0)