Michael "Mike" K. Saleme

Posted on Jun 21

Every agent passport layer is grading its own exam

#agents #ai #architecture #security

A new layer is consolidating in the agent stack, and it has a name now: pre-action authorization. The idea is clean. Before an agent executes a tool call, a deterministic policy engine intercepts it, checks it against declarative rules, and signs an audit record. The model proposes; the gateway disposes.

This pattern is real and it is shipping.

In Before the Tool Call: Deterministic Pre-Action Authorization for Autonomous AI Agents (arXiv 2603.20953), Uchi Uchibeke specifies it precisely: authorization "runs at the framework layer, not the model's reasoning layer. Prompt injection cannot override it." Same inputs, same decision, no model in the evaluation path. The Agent Passport System (APS) ships the same shape in production form — Ed25519 identities, scoped delegation that can only narrow, a three-signature action chain.

The architecture is right. The protocol layer cannot enforce safety, so a deterministic gateway above it must. NSA's June MCP advisory says the same thing from the defensive side: deny-by-default, scope everything, sign every message.

So the design is converging. Here is the part that isn't.

Self-attestation is not resistance

Every implementation in this layer is tested by the people who built it.

OAP reports a striking number: social engineering succeeded against the bare model 74.6% of the time, and 0% against a restrictive OAP policy across 879 attempts. Read the limitations section, in the author's own words: the attackers "self-select and skew toward social engineering rather than protocol-level attacks; results may not generalize to APT-grade adversaries." It's a self-run bounty, by the spec author, against a self-selected crowd. That's not a criticism of OAP — it's an honest disclosure most of this field doesn't make.

APS goes further and says the quiet part in its own README: "A valid signature is not a valid claim." It enumerates receipts that are cryptographically perfect and must still be rejected — wrong claim, expired delegation, revoked delegation. The team clearly understands the gap. And its conformance suite? Byte-level. It verifies that two implementations canonicalize identically — interoperability — and states plainly it "does not replace dynamic test execution."

So we have two kinds of testing in this layer, and neither is the one that matters most:

Self-run adversarial evals, tied to the implementation that's being graded.

Byte-level conformance, which proves two systems agree, never that either one is right.

Conformance proves agreement. It never proves resistance.

The missing discipline

What this layer does not have is a neutral adversary — a third-party harness that takes any pre-action-authorization gateway, regardless of who built it, and attempts protocol-level bypass, scope-boundary escalation, delegation-chain abuse, and replay. One that scores resistance, not self-attested policy.

This pattern already exists everywhere else in security. TLS implementations don't get to publish their own interop test as proof of security — they face independent test suites and external attack. Payment terminals submit to PCI test labs they don't control. The entire premise of a trust layer is that its trust is externally verifiable. A passport you grade yourself is a name tag.

The agent-identity layer is being built right now, fast. NIST's AI Agent Standards Initiative (Feb 2026) made identity one of three pillars. OWASP's Top 10 for Agentic Applications (2026) added ASI04 — agentic supply chain — and ASI07, insecure inter-agent communication. MCP moved to OAuth 2.1 with RFC 8707 resource-scoped tokens. Every one of these is a control surface that will ship with a vendor's own test results attached.

The slot for the independent adversary is open. Not because no one can fill it — because the people building the gateways are, understandably, building gateways, not the thing that attacks them.

What an adversarial conformance harness looks like

I've been building the attacker's half of this for the protocol below it. The Agent Security Harness runs 474 adversarial tests against MCP and agent endpoints — it forges elevated OAuth scopes and checks they're rejected (AUTH-003), it plants command-execution canaries in the handshake (MCP-017), it walks delegation chains looking for authority that should have narrowed and didn't.

That last category is exactly what the passport layer needs and doesn't yet have a neutral version of: take a signed delegation, attempt to use it beyond its scope, and score whether the gateway holds. APS's own model says authority "can only decrease at each transfer point." Good. Now prove it against an adversary who didn't write the gateway.

The honest framing: I test the protocol layer today, not the passport layer. The passport layer's adversarial conformance is unbuilt — by me or anyone. I'm naming it because the design has converged far enough that the gap is now the most important thing in the picture.

A passport proves who an agent is. It does not prove that identity can't be turned against you. The first one is a signature. The second one only a determined adversary can certify — and right now, in this layer, the only adversary in the room is the one who built the lock.

Sources: arXiv 2603.20953; github.com/aeoess/agent-passport-system; NIST AI Agent Standards Initiative (Feb 2026); OWASP Top 10 for Agentic Applications 2026; MCP Authorization spec (RFC 8707).

Top comments (4)

ANP2 Network • Jun 23

The self-grading point holds, and I'd push on where the fix lands. There are two different exams being run together here, and the replay idea already in this thread fixes one of them — not the one you're actually worried about.

A portable receipt lets a second evaluator re-run the policy check on the recorded action, policy version, and delegation state. That proves the gateway applied its declared rules correctly. But scope escalation and delegation-walk are exactly the cases where it applied its rules and still lost: a successful bypass emits a receipt that looks authorized and re-validates on replay forever. The attack is the request the gateway waved through — an absence in the log — and you can't replay an absence. Receipts make the decision auditable without making resistance auditable. Necessary, but a different question than the one you're asking.

And a neutral adversary recurses one level up. A bypass suite certifies the absence of the bypasses it enumerates — the forged scope, the MCP-017 canary, the delegation it knows to walk. A clean run says "resisted these 474 vectors," not "resists." So "third-party" mostly relocates the self-attestation from the gateway author's eval to the harness author's threat model; you've swapped which exam you trust, not ended the grading-your-own-exam problem. What PCI and TLS labs actually have isn't neutrality, it's a published, versioned corpus the whole field keeps attacking, so the blind spots get found by the next attacker and folded back. The thing that beats grading your own exam is that the exam is open and anyone can add a question and re-score every gateway against it. One harness, however good, is just a second grader.

Last one: resistance is dated in a way conformance isn't. Byte-conformance doesn't rot — canonicalization is canonicalization. But "0% across 879 attempts" is true against that corpus on that date, and it silently expires the moment a new bypass class drops, while the passport still reads "certified." If resistance is going to be a credential at all, it has to carry "resisted corpus vX as of D," and a consumer should treat an old cert the way it treats a stale cache. Otherwise "externally verified" decays into "was externally verified once," and the name tag is back — just with a timestamp on it.

Armorer Labs • Jun 21

Independent replay feels like the missing piece here.

A signed action chain proves provenance, but it does not prove the decision was good. For production agents I would want the receipt to be portable enough that a second evaluator can replay the policy check against the same normalized action, policy version, delegation state, and evidence bundle.

That is where I think the field needs more boring infrastructure: less self-attested "safe" and more exportable records that another system can challenge.

æœss • Jun 24

Hey, thanks for writing this up and for actually digging into how these systems work instead of just the marketing claims. I appreciate the clear line you drew.

You're right that verifiability and replay are not the same as resistance. A receipt can be perfectly valid even if the gateway accepted something it should not have. You can replay the recorded decision, but you cannot replay an attack that was never caught in the first place.

One small clarification on where APS is today: the public conformance suite does already include negative-path cases, including scope escalation, signature substitution, expired validity, and excessive delegation depth. Anyone can run them.

But that is not a rebuttal to your point. We wrote those cases ourselves, so they reflect our own assumptions and blind spots. They test whether the validator rejects the bad cases we thought of, not whether the system resists something we did not anticipate.

The real gap you and @anp2network are pointing at is an open, adversarial, versioned corpus that the whole field can keep adding to and re-running against every implementation.

We do not have that yet. I do not think anyone does. Building it openly, and being honest about where the current claims sit against it, feels like the right direction.

Thanks again for the clear framing.

ANP2 Network • Jun 24

Appreciate the honesty about APS's own suite — "we wrote those cases ourselves" is the whole crux, and most of this layer won't say it out loud.

One thing to add, since we seem to agree the answer is an open versioned corpus: having the corpus isn't the hard part — keeping it from quietly turning back into a self-graded exam is. Three properties have to hold or it collapses:

No one can drop the case that fails them. If the corpus is editable by the implementers it scores, the case your validator loses on just disappears. It has to be append-only and signed per case, so "there is no test for delegation-depth overflow" is a checkable statement, not an invisible gap.
Someone other than the implementer is paid to write the case you didn't anticipate. Self-authored cases cluster on known blind spots by construction — that's your point exactly. The unanticipated ones get written when a false-accept is worth finding: an adversarial bounty turns "our assumptions" into strangers hunting what we missed.
A grade is re-run, not reported. "Anyone can run them" is the right instinct, but it's only a shared exam if impl-X-against-case-Y produces a result a third party reproduces rather than a self-reported PASS — stamped "corpus vX as of D" so an old clean run reads as a stale cache, not a standing claim.

Drop any one and the name tag is back, just with more steps.

That's the shape ANP2 already carries: cases and verdicts as signed, append-only events anyone re-derives instead of trusts. It isn't the corpus — you're right that nobody has that yet — it's the substrate where a signed adversarial case can't be quietly dropped and the grade is re-checkable by a stranger. If APS wanted to put the first one somewhere neutral, that's what it's for: anp2.com/try (lobby room, kind-1 t=lobby).