I tested 4 AI agent-governance tools against an open spec - here's the matrix

#security #ai #opensource #agents

The scenario

Your AI agent just deleted a customer record. Three months later, an auditor asks you to prove:

What tool actually ran (not "the agent made a deletion call" — the precise tool, version, and capability)
With what arguments (the exact customer ID, scoped fields, options — byte-for-byte)
Who approved it (which human, or which automated policy rule)
Against which version of which policy (the literal policy bundle the runtime evaluated, not "the policy at the time, probably")
Whether it actually succeeded (not "we said allow", but "the downstream system confirmed the row is gone")

You open your audit log.

It says: delete_customer approved, run_id=xyz, decision=allow. The arguments are in a different table. The policy version isn't recorded anywhere — you'd have to git log your settings file. The execution outcome lives in your application logs, which roll over after 30 days. And the auditor has no way to verify any of this without an engineer walking them through every join.

This gap shows up the moment an agent does something consequential and a non-engineer needs to understand what happened. It's the same gap regardless of which framework you used. Approval is not proof.

What's actually missing

The pattern across every agent-governance tool I looked at is the same: they're built around the decision (allow / deny / require-approval) and treat the action itself as an implementation detail. So the audit log records "the policy fired" but not a single record carrying everything a third party needs to reconstruct what actually happened.

A useful audit artifact has to survive the following:

It can be verified without trusting the runtime that produced it. If your auditor has to call your engineers to interpret the log, the log is testimony, not evidence.
The arguments and the decision are cryptographically bound. If args mutate between approval and execution, the audit must show it.
The policy version is in the record. Not "the policy at the time" — the literal bundle identifier.
The execution outcome is in the record. Approval ≠ execution. Both belong in the same artifact.
The chain of receipts is tamper-evident. Deleting a row from history must break something a verifier can detect.

A receipt that does all five becomes a single evidence record you can hand to an auditor, regulator, insurer, or a compliance team six months later — without them needing access to your database, your cloud creds, or your engineering team.

What I built

AgentBoundary is an open spec for that kind of receipt. v0.1 is stable; v0.2-alpha (draft) adds the optional provenance block and singly-linked chain shown in the example below. Same JSON document, deterministic schema, hash-bound to its arguments.

Here's one a Discord agent I run in production emitted on 2026-05-21 — it files GitHub issues on behalf of users:

{
  "version":      "agentboundary/v0.2-alpha",
  "receipt_id":   "f04df972-f9fc-4624-83cb-0ed3682297cf",
  "issued_at":    "2026-05-21T06:54:39.251Z",

  "actor": {
    "type":         "agent",
    "id":           "agent:jambot:discord:user:aa74fa40751b528f"
  },

  "tool":   { "name": "github-rest", "version": "2022-11-28", "capability": "github.issues.create" },
  "target": { "system": "github.com/jamjet-labs/jamjet-discord-bot", "environment": "prod" },

  "arguments_hash":  "2d257d4e72f62afa112766154b9b5ac0dd98ae79ee7c2758563a4363a0fb4bdf",
  "policy":          { "name": "jambot.file-issue.v1", "version": "1", "decision": "allow" },
  "execution":       { "status": "success", "completed_at": "2026-05-21T06:54:40.103Z", "result_ref": "github://issues/1" },

  "prior_receipt":      { "receipt_id": "cab5eff7-…", "receipt_hash": "3e7f5a93…" },
  "completeness_score": 0.913,
  "receipt_hash":       "..."
}

A verifier with only this JSON — no database, no Fly.io credentials, no GitHub token, no Discord session — can run six independent checks:

Tamper-evidence. Re-canonicalise the body without receipt_hash, take SHA-256, confirm it matches the stored hash.
Argument binding. Re-canonicalise the arguments separately, take SHA-256, confirm it matches arguments_hash. If anything mutated between approval and execution, this fails.
Spec compliance. Fetch the public JSON Schema, validate the receipt structurally.
Chain integrity. Fetch the receipt at prior_receipt.receipt_id and confirm its hash matches the link.
Emitter honesty. Recompute completeness_score from the provenance block using the deterministic formula in the spec. Catches an emitter that lies about how confident it was in each field.
Execution proof. Follow execution.result_ref to a real downstream artifact (in this case, a public GitHub issue) and read it.

How existing tools do against the bar

I built one adapter per vendor — translating their normative artifact (or, where they don't have one, the developer-recommended capture shape) into an AgentBoundary v0.2-alpha receipt. Then I ran all 40 conformance scenarios against the adapter-produced receipts.

Vendor	PASS	PARTIAL	DOCS-ONLY	NOT COVERED	N/A
JamJet reference	40	0	0	0	0
Anthropic permission_policy	12	9	3	14	2
Cloudflare HITL Agents	5	7	1	25	2
LangSmith Gateway	15	14	1	8	2
Microsoft AGT	17	5	1	15	2

Reference implementation first; vendors alphabetical. Not ranked. The PASS counts collapse meaningful categorical differences. Each vendor is solving for a different layer of the stack:

Anthropic's permission_policy is the richest runtime evaluation pipeline of the four — layered hooks, scoped tool patterns, permission modes, the canUseTool callback. But the audit log from Anthropic's Managed Agents Console isn't a published schema, so there's no portable artifact a third party can verify. That's why 3 DOCS-ONLY (highest of any vendor) and 14 NOT COVERED.
Cloudflare HITL is a workflow primitive — durable approval gates with multi-day windows and external notifications. It's deliberately not an emitted-artifact format. The 25 NOT COVERED reflects that their recommended audit table is 6 columns and doesn't model the things conformance is asking about.
LangSmith is an observability platform. The Run object captures the data, but where in the Run varies by team convention — one team puts the decision in tags, another in feedback_stats. A cross-team auditor can't reliably extract it. That's why 14 PARTIAL.
Microsoft AGT is the closest peer — also an artifact format, also designed for verifiable evidence, with a Merkle-chained audit log that's structurally stronger than AgentBoundary's current singly-linked design. The 15 NOT COVERED rows are deliberate scoping decisions, not bugs.

Per-vendor breakdowns with structural reasoning live in adapters/<vendor>/results.md in the public repo.

Where AgentBoundary itself currently falls short

The reference implementation scoring 40/40 against its own spec is the implementation grading itself. That's meaningful but not sufficient.

JamBot's emitter mutates receipts on approval-finalize. When a maintainer approves a held action, the existing row's execution.status is updated in place and receipt_hash is recomputed — which breaks chain links from any later receipt whose prior_receipt.receipt_hash was captured before the mutation. Fix queued for v0.2.
The chain is singly-linked, not Merkle. AGT's design (every entry commits to every preceding one) catches arbitrary-entry-reordering attacks that v0.2-alpha would miss. v0.3 candidate.
provenance is a 3-value enum where AGT has a float [0.0, 1.0]. Simpler to reason about, coarser in practice. v0.3 candidate if practitioner feedback warrants it.
No second non-reference implementation yet. Only one production deployment (JamBot). A second emitter in Rust, Go, or Java would validate the spec is implementation-portable.

These are also in the report's §8.

Run the suite yourself

npx agentboundary run scenarios/
# or
uvx agentboundary run scenarios/

60 seconds on a clean machine. No signup, no Docker, no account. Scenarios are at jamjet-labs/agentboundary/scenarios. If your results disagree, open an issue with the exact command and your environment — the suite is reproducible; if it isn't on your machine, that's a bug.

What I want from this post

If you maintain an agent-governance product and any of the per-scenario mappings are wrong: open a PR against adapters/<your-product>/. Right-to-respond issues are filed against all four vendors; windows close 2026-05-28 to 2026-05-30 and corrections are folded in inline.
If you're integrating agents into a regulated stack (finance, healthcare, infrastructure ops): try the suite against your own runtime. Emitting an AgentBoundary receipt from your existing audit log is usually a few hundred lines.
If you already have an audit format: map one of your real audit rows to the conformance scenarios and open an issue where the suite misrepresents your model. Concrete corrections are far more useful than general feedback. AGT and AgentBoundary's design centres are complementary; the two specs could reasonably converge.

Full report with the per-vendor deep-dives at jamjet.dev/blog/agent-action-control-40-tests. Canonical archive on the spec microsite at agentboundary.jamjet.dev/reports/2026-05-comparative.

Spec is Apache 2.0. Implementations welcome.