Before an agent delegates work — to a tool, a skill, or another agent — it usually sees a name, a description, sometimes a rating. What it does not usually see is what happened the last few hundred times someone called the same candidate. That gap matters more when the call costs money and the skill is closed-source. Three pieces of public work landed recently toward closing it: an individual Internet-Draft for a signed-receipt wire format, xaip-sdk@0.5.0 with a precheck() helper, and two browser demos that make the contrast visible.
A small scene
An AI agent is asked to translate a document into Japanese. The agent looks at the closest paid skill marketplace it has access to. Three candidates appear. Each has a polished listing. Each has a five-star rating. The prices differ by a few cents per call.
From the listing alone, all three are interchangeable. The agent could pick any of them. It could ask the user for help. It could just take the cheapest. Whatever it does, it is making a choice with no basis other than someone else's published metadata.
This is not specific to translation skills, and not specific to any one marketplace. It is the same shape every time an agent has to delegate work to an external tool, skill, or service it cannot inspect.
What the agent currently sees
Across runtimes — MCP servers, LangChain tools, OpenAI tool-calling loops, HTTP APIs, paid skill marketplaces — the candidates an agent picks between are typically described by a thin slice of information:
- a name or slug
- a one-line description
- maybe a category tag or capability list
- maybe a rating, a review count, or a "popular" badge
- if it costs money, a price per call
All of that is publisher-supplied metadata. None of it is independent evidence of what the candidate actually does when it is called.
Receipts, not ratings. That is the gap.
The gap is annoying in the free case. If an agent picks the wrong MCP server and the call fails, the cost is a retry and some latency. The gap becomes more painful in the paid case. If an agent picks a closed-source skill, pays per execution, and the skill misbehaves or fails, the cost is real and it accumulates. If the same closed-source skill misbehaves in a way the agent does not even detect, the cost is worse.
This is the structural problem worth naming out loud: agents are increasingly delegating work to opaque candidates under uncertainty, and the inputs to their delegation decision are largely metadata that the candidate itself published.
What is missing: portable execution evidence
The piece that is missing is observable, signed, portable evidence of what happened the last N times this candidate was actually called. Concretely:
- Signed, so a verifier can tell who made each claim and that it was not fabricated by a third party.
- Observable, so anyone holding the record can verify it without consulting a registry or central intermediary.
- Portable, so the evidence about a tool, skill, or agent moves with that identity across runtimes and marketplaces, rather than living inside one platform's private database.
That is not a trust system. It is not an approval system. It is not a sandbox. It is, much more modestly, a record of attempts: a wire format for "what was called, by whom, on whose behalf, with what outcome, how long it took, and how the inputs and outputs are identified by hash."
If that wire format exists, a caller who is about to delegate to a candidate can ask a simple question before committing: what evidence is available about this candidate already? The answer is not a verdict. The answer is a record they can read with their own eyes — or their own policy.
What landed recently
Three pieces of public work landed toward this gap over the past stretch. They are not a complete solution. They are a starting point that a wider set of contributors can build on.
1. An individual Internet-Draft for the receipt wire format.
The format is posted at IETF Datatracker as draft-xkumakichi-xaip-receipts-00. It defines a JSON wire format for one signed execution receipt: who acted (agentDid), who delegated (callerDid), what tool was called, whether the call succeeded, how long it took, and how the inputs and outputs are identified by hash. Signatures are Ed25519, with optional co-signature by the caller. Identities are W3C Decentralized Identifiers, with no constraint on the DID method. The draft is intentionally narrow: it covers the wire format only. Scoring models, aggregation topologies, and decision logic are deployment-policy concerns and explicitly out of scope.
It is worth being precise about what this is and is not. It is an individual Internet-Draft. It is not an IETF standard. It is not IETF-approved. It has no formal standing in the IETF standards process. The value of being on Datatracker is having a citable URL whose content can be referenced by other individual drafts, papers, or implementations.
2. xaip-sdk@0.5.0 on npm, with a precheck() helper.
precheck() is a thin SDK wrapper over a public Trust API endpoint that consumes the receipt graph. Given a task description and a list of candidate slugs, it returns ranked execution evidence — receipt counts, observed success rates, risk flags, and an eligibility flag the SDK computes from the caller's policy. The SDK does not invoke the candidate. The SDK does not pay for anything. It returns evidence; the caller's own logic decides what to do with that evidence.
3. Two browser demos.
The first, "Trust Evidence Before Delegation," covers the free case: three contrasting MCP candidates against the live trust scores. The second, "Before Payment Evidence Demo," covers the paid closed-source case: three fictional translation skills with deliberately indistinguishable marketplace listings but different execution-evidence profiles. Both are static, buildless, and read-only. The paid demo is seeded with synthetic fixture data because there are no real paid-skill receipts in the public graph yet — that is itself one of the open questions.
These three pieces are intended to be useful independently. The Internet-Draft is useful even if you never install the SDK. The SDK is useful even if you never read the draft. The demos are useful even if you never write any code yourself.
Using precheck() in a few lines
The smallest useful call:
import { precheck } from "xaip-sdk";
const result = await precheck({
task: "Translate a document into Japanese",
candidates: ["skill:translator-alpha", "skill:translator-beta", "skill:translator-gamma"],
policy: { minReceipts: 10, excludeRiskFlags: ["repeated_timeout"] },
includeDecision: true,
});
The return value is structured, not narrative. The interesting fields:
-
selected— the candidate the SDK picked by applying the supplied policy to the ranked evidence.nullif no candidate was eligible. The SDK recomputes this from the policy; it does not blindly forward the server'sselected. -
ranked— every input candidate, each withscore,receiptCount,confidence,riskFlags,verdict, and aneligibleboolean. -
unscored— convenience list of candidates with no execution evidence at all. -
reason— a controlled string. It is one of exactly two values:"Selected using available execution evidence."or"No eligible candidates based on available execution evidence."It does not vary by case. Consumer code is supposed to read the structured fields, not parse the string. -
decision— optional, present only when the caller opts in withincludeDecision: true. Values are"allow","warn", or"unknown". There is no"block"; blocking is not the SDK's job.
The point of the controlled reason and the missing "block" is that the SDK never positions itself as the one making the call. The SDK surfaces structured evidence. The caller decides whether to pay, to invoke, to ask the user, to fall back, or to escalate.
What XAIP is not
This is a short list. It is short on purpose.
- XAIP is not a sandbox. It does not isolate the execution environment of any tool, skill, or agent.
-
XAIP is not an approval engine. It does not gate calls. It does not have a
"block"decision value. - XAIP is not a payment rail. It does not move money. It does not hold balances.
- XAIP does not make tools safe. Safety is a property of the tool and how the caller uses it, not of the record format.
- XAIP does not guarantee trust. It surfaces evidence, which the caller may use as one of several inputs into their own trust decision.
And one more, which is often what people actually want to ask about:
-
Receipts are the primary artifact. Scores and eligibility are derived views. The
verdict, theconfidence, theeligibleboolean — they are derived from the underlying signed receipts using a stated method. The receipts carry the signature chain and the long-term portability. The scores are a convenience.
If a future caller wants to derive a different score, with a different aggregation method, or weighted differently, they can do that over the same receipt graph without re-emitting anything. That is the whole reason the wire format is the artifact and the score is the view.
Open questions
This is early. Several pieces are explicitly open:
-
Caller diversity. The current public dataset is heavily produced by a small number of callers. Signals derived from the receipt graph become more interesting as more independent observers contribute. There is no theoretical fix for this; it is a question of who actually runs
precheck()and emits receipts. -
Candidate categories. The current SDK treats
candidatesas opaque string slugs. The convention is to use a prefix when useful —tool:,skill:,agent:— but a structured shape ({ id, type }) is deliberately deferred until a second caller asks for it. -
Receipt provenance. As external probe networks, synthetic monitors, and integration tests start emitting receipts, the question of whether a
sourcefield belongs on the receipt —real_agent_callvssynthetic_probevsscheduled_health_check— becomes more relevant. Mixing these in equal weight would distort the signal. This is logged as an open question against the format. -
Co-signature ratio enforcement. The SDK accepts a
requireCoSignatureRatiopolicy field for future use but currently throws if it is greater than zero, because the aggregator does not yet expose per-candidate co-signature ratios. Silently accepting a policy that the SDK cannot enforce would be worse than refusing it. -
Settlement-class tools. Tools whose outputs are externally anchored (for example, on-chain settlement) have very different evidence semantics than retrieval tools. The receipt format permits a
toolMetadatafield for category hints, but standardizing those hints is deferred.
None of these need to be resolved before the receipt format is useful. They are listed so the format is not mistaken for a finished product.
Try it / read it
If you want to read the spec:
If you want to use the SDK:
If you want to see what the contrast looks like in a browser:
- Trust Evidence Before Delegation (free tools)
- Before Payment Evidence Demo (paid closed-source skills, seeded)
If you want to look at the live aggregator output or the repository:
If you want this to work better:
npx --yes xaip-caller
Windows PowerShell:
npx.cmd --yes xaip-caller
What it does, before you run it:
- Requires Node.js 20 or newer.
- Requires network access (it talks to the public XAIP aggregator).
- Creates or reuses a local caller key under your home directory.
- Makes a few real HTTP checks against public read-only endpoints.
- Signs receipts for those calls with the local key.
- Posts the receipts to the live XAIP aggregator.
- No signup or API key required.
It takes about thirty seconds. The receipt format and precheck() help most when many independent observers are watching the same candidates — and the only way that happens is one caller at a time.
Top comments (0)