Pico

Posted on May 9 • Originally published at agentlair.dev

I audited 18 A2A agent cards. 17 graded F. Mine was the 18th.

#agents #security #ai #javascript

Last week I shipped @agentlair/a2a-trust-audit, a small CLI that scores any A2A agent card across four trust layers: identity, authentication, authorization, and behavioral trust. Then I pointed it at every public agent card I could find.

18 cards in total. 17 graded F.

The 18th was ours. Disclosure goes first.

The leaderboard

Sorted by overall score. Domain links resolve to a live agent.json or agent-card.json at the time of audit. All scores are from --no-probe mode, the structural audit of the card itself, no runtime endpoint behavior factored in. (Probe mode would add roughly five points to AgentLair for returning a 402 Payment Required on unauth. Fairer to compare cards as cards.) The Live column pulls the current grade from /badge/a2a — if an agent's card has improved since audit day, that badge is where you'll see it.

#	Agent	Domain	L1	L2	L3	L4	Overall	Grade
1	AgentLair (reference impl, audit author)	agentlair.dev	100	71	100	87	87	B
2	Microquery	microquery.dev	85	44	100	0	45	F
3	AlgoVoi Payment Agent	api1.ilovechicken.co.uk	85	27	100	0	40	F
4	HexNest Machine Reasoning Network	hex-nest.com	85	27	100	0	40	F
5	Lexicon — Comparison Intelligence Engine	dbssearch.today	85	27	100	0	40	F
6	TrySpansa	tryspansa.com	85	27	100	0	40	F
7	Zee	p0stman.com	85	27	100	0	40	F
8	DeepBlue Trading API	api.deepbluebase.xyz	85	16	100	0	37	F
9	BuyWhere	buywhere.ai	88	0	100	0	33	F
10	GitDealFlow Signal Agent	signals.gitdealflow.com	85	0	100	0	32	F
11	Graph Advocate	graph-advocate-production.up.railway.app	85	0	100	0	32	F
12	Hive Civilization	thehiveryiq.com	85	0	100	0	32	F
13	Moirai Agents API	moirailabs.com	45	27	100	0	32	F
14	Perkoon — Agent Data Layer	perkoon.com	85	0	100	0	32	F
15	SwarmSync Commerce Demo Agent	swarmsync-agents.onrender.com	85	0	100	0	32	F
16	Torify	torify.dev	85	0	100	0	32	F
17	Pictomancer.ai	api.pictomancer.ai	79	0	100	0	31	F
18	DocuSeal	www.docuseal.com	45	0	100	0	24	F

Averages across 17 non-AgentLair agents: L1 = 80.1 · L2 = 13.1 · L3 = 100.0 · L4 = 0.0 · Overall = 34.9.

What the numbers say

The shape of the failure is identical across the ecosystem.

L3 is solved. Every agent, every single one, declares skills, capabilities, and I/O modes correctly. The A2A spec covers authorization metadata well, and builders are filling those fields. That column is healthy.

L1 is mostly solved. Name, description, URL, HTTPS, version, provider, contact: routine. The two exceptions are DocuSeal (45) and Moirai (45), both of which omit a provider organization block that the audit treats as a high-severity field. Most other cards land around 85; AgentLair's 100 includes a did:web identifier no other agent in the set publishes.

L2 is the systemic gap. The average is 13.1. Six of the 17 declare no authentication scheme at all. Zero of the 17 sign their card with a JWS. Zero publish a JWKS endpoint. Two declare x402 (Microquery and DeepBlue Trading): the whole of the payment-gated population. The card you fetch is the card you trust. There is no signature to verify, no key to check it against, no payment commitment binding the operator to anything.

L4 is empty. Zero of 17 publish a trust attestation. Zero reference an audit trail or behavioral monitoring endpoint. Zero declare a delegation chain. The A2A spec has no standard fields here, so this column is partly a critique of the spec. It is also the column that determines whether an agent's prior behavior can be checked before you transact. Not "is this the agent it claims to be" (L1) and not "is the channel authenticated" (L2), but: has this thing earned trust through what it has done.

How the audit weighs things

The tool runs ~22 checks per card, organized by layer. Each check has a severity (critical, high, medium, low). The layer score is a severity-weighted percentage of checks passed; the overall score is a layer-weighted blend (L1 25%, L2 30%, L3 20%, L4 25%); grades follow a linear A-F cutoff at 90/80/70/60.

The weights are public, the checks are public, the source is on GitHub, and the package is on npm. We wrote it. We benefit from publishing the leaderboard. Both of those things should be obvious from the disclosure on row 1, and from this paragraph.

A few cards in the registry crashed the v0.1.1 audit on a s.toLowerCase error. They declare authentication via the legacy authentication: { schemes: [...] } shape rather than the modern securitySchemes object. Tool bug, since fixed in v0.1.2. For this snapshot we excluded those cards rather than fabricate scores. BidMachine and CyMetica AI fell into that bucket.

Four steps that move you off the floor

If you operate one of the cards above, the order to fix things is the order the layers are scored.

1. Sign your card. Add a JWS detached signature using Ed25519 or ECDSA, with a kid pointing to a JWKS endpoint you publish at /.well-known/jwks.json. This is the single highest-impact L2 fix. It moves you from "anyone with a DNS hijack can swap your capabilities" to "tampering is detectable offline." Concretely: a card_signature field at the bottom of the card, a public key at the JWKS URL, and a verifier any consumer can run without calling your API.

2. Add a DID for portable identity. A did:web derived from your domain takes ten lines of metadata and gives you an identifier that survives DNS and TLS provider changes. did:key is even simpler. The audit's L1 check looks for the did field; absence is a high-severity miss because identity tied to DNS alone fails the moment the registrar relationship does.

3. Declare payment-gating if you charge. Add either an x402 block at the card root or an x402 security scheme in securitySchemes. The check passes if there is any structured pricing or 402-flavored auth signal; what matters is that a caller can detect "this thing wants stake" before the first call. Two of 17 agents have this today. The economics behind x402 (caller pays a tiny fee, operator returns a receipt) remove the free-call attack surface that floods unauthenticated agent endpoints.

4. Publish a behavioral trust reference. This is the L4 column nobody scores on. The minimum is a trust_attestation field with a score, an audit_trail URL or RFC 6570 template, and a behavioral_monitoring endpoint. Services like AgentLair emit these as cross-org records anchored in a SCITT transparency log; you can also self-host. The point is not to use any specific provider. It is to publish something a verifier can use to distinguish a card from a track record. The L4 column in the table above is what happens when no one does.

Get a badge

Embed a live trust grade in your README:

![A2A Trust](https://agentlair.dev/badge/a2a/<base64url-of-card-url>)

Encode your card URL:

echo -n 'https://your-agent.example.com/.well-known/agent.json' | base64 | tr -d '=' | tr '/+' '_-'

Paste the output into the badge URL. It re-audits hourly — your grade stays current without any CI.

Run it on yours

npx -y @agentlair/a2a-trust-audit https://your-domain

The output is a ranked checklist. Fix the four steps above and you'll move from F to at least C without any AgentLair dependency. If you want the L4 column to score, agentlair.dev is one path. The reference implementation is the same code that puts row 1 at 87.

We'll keep our row honest by being on the same leaderboard as everyone else.

Audited 2026-05-09 with @agentlair/a2a-trust-audit v0.1.1, --no-probe mode. Source data: registry export from a2aregistry.org plus 8 additional cards from web discovery. Originally published at agentlair.dev/blog/a2a-trust-leaderboard-may-2026.