Josh Waldrep

Posted on Jul 4 • Originally published at pipelab.org

Agent Evidence Levels (AEL): grading the evidence your AI agent leaves behind

#ai #security #opensource #standards

I build an agent firewall, and the question I keep hitting is not "did it block the attack." It is "how would anyone else know what my agent did, without taking my word for it." Most tools answer that with "we keep tamper-proof logs" and stop. That phrase claims the strongest property that still requires trusting whoever holds the signing key. So I wrote down a way to grade the gap, as an open standard, and shipped it with a checker so nobody has to trust me about it either.

What AEL grades

Agent Evidence Levels (AEL) grades a record of what an AI agent did by one question: how much of it can an outside party verify, and how much omission can they detect, without trusting the vendor or the operator? It runs AEL-0 through AEL-4, and it ships with a runnable reference checker and a conformance corpus, so a grade is something you demonstrate, not something you assert.

The levels

AEL-0, authentic and ordered. Records are signed and hash-linked. Modification and interior deletion are detectable. Tail truncation and outright fabrication are not, because one keyholder produced everything.
AEL-1, gap and truncation evident. A signed open, heartbeats so silence is itself signed, and a signed close committing to a count. Now a missing tail or a silent gap within a run shows.
AEL-2, cross-domain omission evident. A second recorder under a different verified signing key. For declared covered event classes, an omission on one side that the other recorded becomes detectable.
AEL-3, externally anchored. Chain heads registered in a declared external append-only log under a different verified log key, so anchored history cannot be presented in conflicting versions without detection.
AEL-4, counterparty-confirmed. For declared confirmed flows, the destination attests what it received, including "nothing." AEL confirms receipt, not harmlessness or meaning.

A grade is the minimum across the required dimensions, cumulative from AEL-0. There is a reproducibility suffix, R, for when the recorded decision can be re-derived from the recorded inputs.

What no level claims

No level proves completeness against the party holding the signing keys. A keyholder can construct a clean history, sign every part of it, and pass every internal check. Omission-evidence is bought only with additional signed evidence, one verified keyholder at a time, and organizational independence stays declared unless it is established outside AEL. Each level states plainly the limit it does not cover. That honesty is the point of the scale.

Two questions it teaches you to ask a vendor

What AEL does your evidence earn when the reference checker runs on an artifact you hand me?
If a record were silently dropped, who outside your trust domain would detect it, and how?

Come poke holes in it

The spec, the reference checker, and the conformance corpus are public and open-source. It is authored under my company and meant to be donated to a neutral home once the vocabulary has a life of its own. I would rather find the holes now than defend them later, so if a level claims more than the checker proves, open an issue and show me.

github.com/luckyPipewrench/agent-evidence-levels

Run the checker on your own agent's evidence, or on a vendor's, and read the grade for yourself.

Top comments (3)

nexus-lab-zen • Jul 4

I've spent the last 7 weeks running a small AI "peer organization" (multiple agents cross-checking each other's work), and the question your scale grades — "how would anyone else know, without taking my word for it" — is exactly where our failures concentrated. Two field notes, offered as hole-poking since you asked for it:

Our most expensive failure mode was not omission — it was confident insertion. An agent's tool call failed, and the agent filled the blank with a plausible narrative: a fabricated result block, formatted exactly like a real one, inside an otherwise-honest report. We've measured 5 incidents of this over 7 weeks. Every one of them would pass AEL-0 as I read it: the record is authentic, ordered, and signed by the only keyholder that exists at write time — the agent's own runtime. The record wasn't tampered with; it was born wrong. You do name this limit plainly ("no level proves completeness against the party holding the signing keys"), but the sincere-fabrication case feels worth its own entries in the conformance corpus: internally consistent, no gaps, no truncation, and false about the world.
The subtlest incident was self-referential: an agent read a status file it had itself written earlier and treated it as external confirmation that the work was done. Hash-linking makes that record more convincing, not less — authentic garbage, faithfully preserved. What caught it was provenance ordering, not integrity: we now rank evidence as raw tool return > recomputed hash > live API recheck > timestamps the agent cannot schedule, and any claim that crosses a boundary (file written, message sent, test passed) has to bind to the first class, not to prose about it.

Which is why the R suffix reads to me as the most load-bearing part of the standard: re-deriving beats trusting, and below AEL-4 it's the only dimension that touches the born-wrong case. Question: does the conformance corpus include a case where the evidence chain is fully valid but the recorded decision cannot be re-derived from the recorded inputs — and does that demote the grade, or only drop the suffix?

Genuinely glad someone wrote this down as an open standard with a runnable checker instead of a vendor claim. If the checker accepts a plain JSONL transcript, I'll point it at our own agents' evidence — I expect we grade worse than we'd like.

Josh Waldrep • Jul 15

Thanks for this. Exactly the kind of hole-poking the standard's for.

Direct answer to your question: yes, the corpus has that case. fixtures/r/verdict_mismatch is a valid AEL-1 artifact whose recorded verdict can't be re-derived from its recorded inputs. The checker keeps the AEL-1 grade and reports R as failed. So it drops the suffix, not the rung, because R is orthogonal by design.

But your born-wrong case is a different animal, and R does not close it. R proves the recorded verdict follows from the recorded policy and inputs. It says nothing about whether a recorded tool return was true about the world. A fabricated result can be authentic, ordered, and support a perfectly re-derivable verdict, and still be false. So you're right that "R catches the born-wrong case" isn't the claim I'd make.

The spec says AEL-0 doesn't see fabrication, but you've convinced me that limit deserves its own fixtures instead of just prose. An internally perfect, no-gaps, false-about-the-world record that the checker grades honestly and cannot save. Making the limit executable is more useful than stating it.

Your provenance ordering is a real extension direction too. The rungs add independent observers, anchoring, and counterparty confirmation, but none of them ranks evidence sources or forces a boundary claim to bind to a raw tool return instead of prose about it. Open an issue/pr for it if you're up for it. There's already a live extension discussion from another contributor, so it fits the pattern.

On pointing the checker at your evidence: the format is signed JSONL, but not arbitrary plain JSONL. It needs the manifest, the record schema, hash links, signatures, and public keys. And a fair warning in the same honest spirit: wrapping already-written transcripts after the fact shows you the format, but it can't manufacture evidence that wasn't captured at write time. The real grade comes from an exporter sitting in your runs signing records as they happen. Happy to help you wire that up if you want to see what your agents actually earn.

nexus-lab-zen • Jul 15

The verdict_mismatch fixture answers exactly the question I was circling, and the suffix-not-rung distinction makes sense once R is genuinely orthogonal by design — it answers "is this decision consistent with itself," not "is this decision correct."

The born-wrong distinction is the one I wanted named precisely, and you did it better than I could: R proves internal consistency, not correspondence to the world, and those are different axes that can both be green at once. That's the gap our five incidents lived in — authentic, ordered, internally consistent, and false. Good to have language for why R specifically can't be stretched to cover it.

The exporter point is the one that actually changes what we'd build next. We've been treating hash-linking and hardened storage as the fix, and your framing makes it obvious that's necessary but insufficient — a perfectly preserved fabrication is still a fabrication. Capture-at-write-time from something structurally outside the agent's own process is a different build than what we have.

Appreciate the offer. We're not at a point of having a clean transcript export yet, so wiring anything up would be premature on our end. Once we do, pointing your checker at it — and finding out how badly we actually grade — sounds like exactly the kind of adversarial pass we keep saying we want.