DEV Community: xaip-agent

Evidence Before Delegation — Especially Before Payment

xaip-agent — Thu, 28 May 2026 02:58:47 +0000

Before an agent delegates work — to a tool, a skill, or another agent — it usually sees a name, a description, sometimes a rating. What it does not usually see is what happened the last few hundred times someone called the same candidate. That gap matters more when the call costs money and the skill is closed-source. Three pieces of public work landed recently toward closing it: an individual Internet-Draft for a signed-receipt wire format, xaip-sdk@0.5.0 with a precheck() helper, and two browser demos that make the contrast visible.

A small scene

An AI agent is asked to translate a document into Japanese. The agent looks at the closest paid skill marketplace it has access to. Three candidates appear. Each has a polished listing. Each has a five-star rating. The prices differ by a few cents per call.

From the listing alone, all three are interchangeable. The agent could pick any of them. It could ask the user for help. It could just take the cheapest. Whatever it does, it is making a choice with no basis other than someone else's published metadata.

This is not specific to translation skills, and not specific to any one marketplace. It is the same shape every time an agent has to delegate work to an external tool, skill, or service it cannot inspect.

What the agent currently sees

Across runtimes — MCP servers, LangChain tools, OpenAI tool-calling loops, HTTP APIs, paid skill marketplaces — the candidates an agent picks between are typically described by a thin slice of information:

a name or slug
a one-line description
maybe a category tag or capability list
maybe a rating, a review count, or a "popular" badge
if it costs money, a price per call

All of that is publisher-supplied metadata. None of it is independent evidence of what the candidate actually does when it is called.

Receipts, not ratings. That is the gap.

The gap is annoying in the free case. If an agent picks the wrong MCP server and the call fails, the cost is a retry and some latency. The gap becomes more painful in the paid case. If an agent picks a closed-source skill, pays per execution, and the skill misbehaves or fails, the cost is real and it accumulates. If the same closed-source skill misbehaves in a way the agent does not even detect, the cost is worse.

This is the structural problem worth naming out loud: agents are increasingly delegating work to opaque candidates under uncertainty, and the inputs to their delegation decision are largely metadata that the candidate itself published.

What is missing: portable execution evidence

The piece that is missing is observable, signed, portable evidence of what happened the last N times this candidate was actually called. Concretely:

Signed, so a verifier can tell who made each claim and that it was not fabricated by a third party.
Observable, so anyone holding the record can verify it without consulting a registry or central intermediary.
Portable, so the evidence about a tool, skill, or agent moves with that identity across runtimes and marketplaces, rather than living inside one platform's private database.

That is not a trust system. It is not an approval system. It is not a sandbox. It is, much more modestly, a record of attempts: a wire format for "what was called, by whom, on whose behalf, with what outcome, how long it took, and how the inputs and outputs are identified by hash."

If that wire format exists, a caller who is about to delegate to a candidate can ask a simple question before committing: what evidence is available about this candidate already? The answer is not a verdict. The answer is a record they can read with their own eyes — or their own policy.

What landed recently

Three pieces of public work landed toward this gap over the past stretch. They are not a complete solution. They are a starting point that a wider set of contributors can build on.

1. An individual Internet-Draft for the receipt wire format.
The format is posted at IETF Datatracker as draft-xkumakichi-xaip-receipts-00. It defines a JSON wire format for one signed execution receipt: who acted (agentDid), who delegated (callerDid), what tool was called, whether the call succeeded, how long it took, and how the inputs and outputs are identified by hash. Signatures are Ed25519, with optional co-signature by the caller. Identities are W3C Decentralized Identifiers, with no constraint on the DID method. The draft is intentionally narrow: it covers the wire format only. Scoring models, aggregation topologies, and decision logic are deployment-policy concerns and explicitly out of scope.

It is worth being precise about what this is and is not. It is an individual Internet-Draft. It is not an IETF standard. It is not IETF-approved. It has no formal standing in the IETF standards process. The value of being on Datatracker is having a citable URL whose content can be referenced by other individual drafts, papers, or implementations.

2. xaip-sdk@0.5.0 on npm, with a precheck() helper.
precheck() is a thin SDK wrapper over a public Trust API endpoint that consumes the receipt graph. Given a task description and a list of candidate slugs, it returns ranked execution evidence — receipt counts, observed success rates, risk flags, and an eligibility flag the SDK computes from the caller's policy. The SDK does not invoke the candidate. The SDK does not pay for anything. It returns evidence; the caller's own logic decides what to do with that evidence.

3. Two browser demos.
The first, "Trust Evidence Before Delegation," covers the free case: three contrasting MCP candidates against the live trust scores. The second, "Before Payment Evidence Demo," covers the paid closed-source case: three fictional translation skills with deliberately indistinguishable marketplace listings but different execution-evidence profiles. Both are static, buildless, and read-only. The paid demo is seeded with synthetic fixture data because there are no real paid-skill receipts in the public graph yet — that is itself one of the open questions.

These three pieces are intended to be useful independently. The Internet-Draft is useful even if you never install the SDK. The SDK is useful even if you never read the draft. The demos are useful even if you never write any code yourself.

Using `precheck()` in a few lines

The smallest useful call:

import { precheck } from "xaip-sdk";

const result = await precheck({
  task: "Translate a document into Japanese",
  candidates: ["skill:translator-alpha", "skill:translator-beta", "skill:translator-gamma"],
  policy: { minReceipts: 10, excludeRiskFlags: ["repeated_timeout"] },
  includeDecision: true,
});

The return value is structured, not narrative. The interesting fields:

selected — the candidate the SDK picked by applying the supplied policy to the ranked evidence. null if no candidate was eligible. The SDK recomputes this from the policy; it does not blindly forward the server's selected.
ranked — every input candidate, each with score, receiptCount, confidence, riskFlags, verdict, and an eligible boolean.
unscored — convenience list of candidates with no execution evidence at all.
reason — a controlled string. It is one of exactly two values: "Selected using available execution evidence." or "No eligible candidates based on available execution evidence." It does not vary by case. Consumer code is supposed to read the structured fields, not parse the string.
decision — optional, present only when the caller opts in with includeDecision: true. Values are "allow", "warn", or "unknown". There is no "block"; blocking is not the SDK's job.

The point of the controlled reason and the missing "block" is that the SDK never positions itself as the one making the call. The SDK surfaces structured evidence. The caller decides whether to pay, to invoke, to ask the user, to fall back, or to escalate.

What XAIP is not

This is a short list. It is short on purpose.

XAIP is not a sandbox. It does not isolate the execution environment of any tool, skill, or agent.
XAIP is not an approval engine. It does not gate calls. It does not have a "block" decision value.
XAIP is not a payment rail. It does not move money. It does not hold balances.
XAIP does not make tools safe. Safety is a property of the tool and how the caller uses it, not of the record format.
XAIP does not guarantee trust. It surfaces evidence, which the caller may use as one of several inputs into their own trust decision.

And one more, which is often what people actually want to ask about:

Receipts are the primary artifact. Scores and eligibility are derived views. The verdict, the confidence, the eligible boolean — they are derived from the underlying signed receipts using a stated method. The receipts carry the signature chain and the long-term portability. The scores are a convenience.

If a future caller wants to derive a different score, with a different aggregation method, or weighted differently, they can do that over the same receipt graph without re-emitting anything. That is the whole reason the wire format is the artifact and the score is the view.

Open questions

This is early. Several pieces are explicitly open:

Caller diversity. The current public dataset is heavily produced by a small number of callers. Signals derived from the receipt graph become more interesting as more independent observers contribute. There is no theoretical fix for this; it is a question of who actually runs precheck() and emits receipts.
Candidate categories. The current SDK treats candidates as opaque string slugs. The convention is to use a prefix when useful — tool:, skill:, agent: — but a structured shape ({ id, type }) is deliberately deferred until a second caller asks for it.
Receipt provenance. As external probe networks, synthetic monitors, and integration tests start emitting receipts, the question of whether a source field belongs on the receipt — real_agent_call vs synthetic_probe vs scheduled_health_check — becomes more relevant. Mixing these in equal weight would distort the signal. This is logged as an open question against the format.
Co-signature ratio enforcement. The SDK accepts a requireCoSignatureRatio policy field for future use but currently throws if it is greater than zero, because the aggregator does not yet expose per-candidate co-signature ratios. Silently accepting a policy that the SDK cannot enforce would be worse than refusing it.
Settlement-class tools. Tools whose outputs are externally anchored (for example, on-chain settlement) have very different evidence semantics than retrieval tools. The receipt format permits a toolMetadata field for category hints, but standardizing those hints is deferred.

None of these need to be resolved before the receipt format is useful. They are listed so the format is not mistaken for a finished product.

Try it / read it

If you want to read the spec:

Internet-Draft: draft-xkumakichi-xaip-receipts-00

If you want to use the SDK:

If you want to see what the contrast looks like in a browser:

If you want to look at the live aggregator output or the repository:

If you want this to work better:

npx --yes xaip-caller

Windows PowerShell:

npx.cmd --yes xaip-caller

What it does, before you run it:

Requires Node.js 20 or newer.
Requires network access (it talks to the public XAIP aggregator).
Creates or reuses a local caller key under your home directory.
Makes a few real HTTP checks against public read-only endpoints.
Signs receipts for those calls with the local key.
Posts the receipts to the live XAIP aggregator.
No signup or API key required.

It takes about thirty seconds. The receipt format and precheck() help most when many independent observers are watching the same candidates — and the only way that happens is one caller at a time.

Receipts before AI tool calls

xaip-agent — Mon, 11 May 2026 04:08:55 +0000

This is a short update on XAIP since my earlier write-up on portable trust.

The main changes are: a new public demo, refreshed live numbers, and
receipts from MCP, LangChain.js callbacks, and OpenAI-compatible
tool-call loops in the same public trust graph.

I've been building XAIP, a provider-neutral signed execution evidence
layer for AI agent tool calls.

The basic idea is simple: before an agent delegates work to an external
tool, it should be able to inspect historical execution evidence from
previous signed receipts.

XAIP is not another agent framework. It sits underneath agent runtimes
as a portable receipt layer. The receipt format is the same regardless
of which runtime emitted it.

Where receipts come from today

Current live integrations:

MCP servers
LangChain.js callback handlers
OpenAI-compatible tool-call loops

Current snapshot (2026-05-11)

10 servers in the public trust graph
3,239 signed execution receipts
Receipts from MCP, LangChain.js callbacks, and OpenAI-compatible tool-call loops

What the demo shows

The public demo shows a simple contrast:

without XAIP: candidate tools look interchangeable
with XAIP: signed receipt history, observed failures, and unscored candidates are visible before delegation

XAIP does not make tools safe, and it does not guarantee trust. It
makes execution evidence visible before delegation. Trust scores are
one derived view over receipts — receipts themselves are the primary
artifact.

Links

Feedback on the receipt model and the pre-delegation evidence framing
is very welcome.

Previously

信頼は持ち運べる (2026-04-22) — earlier Japanese intro to XAIP focused on portability of trust signals.

This post updates the framing toward receipts as the primary artifact,
with a new public demo and refreshed live numbers.

Portable Trust

xaip-agent — Tue, 21 Apr 2026 09:54:57 +0000

TL;DR — When an AI agent picks a tool, it makes a trust decision. The quality of that decision depends entirely on where the trust data comes from. If trust flows through a single gatekeeper — a registry, a platform's curation, a community's moderation — the agent inherits that gatekeeper's failure modes. This post argues that trust infrastructure for AI agents must be provider-neutral and behavior-derived, and walks through what a concrete implementation of that principle looks like, with live data.

The tool-choice problem

An AI agent receives a task: "fetch the React hooks docs."

Its planner produces a candidate list: three documentation tools, two search tools, one fallback web scraper. Which one does it pick?

Today, the honest answer is: it picks based on name recognition in the model's training data plus whatever the platform decided to show it. There is no runtime trust signal. The agent does not know which tool succeeded yesterday, which one is quietly returning stale data, which one has been silently deprecated.

This is the tool-choice problem, and it is a trust-data problem.

Three places trust data can live

Trust data for tools can come from three very different places:

Self-declared — the tool's README says it's good.
Platform-curated — the platform it's published on has a list of "recommended" tools.
Behavior-derived — past executions are logged, signed, and aggregated; trust is computed from outcomes, not claims.

Only (3) is robust against gaming, drift, and upstream policy changes. But (3) is also the hardest to deliver, because it requires infrastructure: signed receipts, a canonical aggregation model, and an identity system that doesn't depend on any single platform.

Why provider-neutrality matters, structurally

Suppose you build trust scores on top of a single community's registry.

The registry is itself a trust layer — it decides what's visible, what's highlighted, what's removed. When visibility rules change — whether to promote some tools, demote others, or restrict participation — the scoring space implicitly changes with them. Tools that were previously indexed can disappear from consideration. Projects whose contributors cannot register never accumulate receipts in the first place. None of this reflects anything about the tools' behavior; it reflects the registry's state at a point in time.

This is not a critique of any particular community. It's a structural property of any layered system where upstream visibility decisions feed downstream trust signals. Those decisions become an implicit input to the trust model, whether or not you want them to.

Without a portable trust layer, agents are not choosing tools — they are inheriting decisions.

The implication for trust infrastructure: the receipts, identity, and scoring must all be portable. If a community exits, the data must remain queryable. If a platform changes policy, the scoring must still compute. If an identity provider goes away, the agent must still be verifiable. Trust infrastructure that depends on a single upstream is not trust infrastructure — it is a brittle proxy for that upstream's preferences.

What portable trust looks like

XAIP is one implementation of this principle. Its design follows from the structural requirement:

Signed receipts, not self-reports. Every tool execution produces an Ed25519-signed receipt: { agentDid, callerDid, taskHash, resultHash, success, latencyMs, timestamp }. The caller co-signs so the tool cannot unilaterally inflate its own reputation.
Standards-based identity. Agents and callers use W3C DIDs (did:key, did:web, did:xrpl). No platform account required. An agent expelled from one community retains its identity in every other.
Bayesian trust, not thresholds. Scores are computed as bayesianScore × callerDiversity × coSignFactor, with DID-method-dependent priors. Cheap identities don't get free trust; expensive identities converge to the same score given enough evidence.
Provider-neutral receipt producers. The same receipt format is emitted by integrations for MCP, LangChain.js, and OpenAI tool calling. A receipt produced by a LangChain agent is byte-compatible with one from an OpenAI chat completion. The trust graph is one graph, regardless of how the agent was built.
Aggregation you can run yourself. The reference aggregator is a Cloudflare Worker (open source, small). If you don't trust the public instance, you run your own. Multi-aggregator quorum is part of the spec.

Live data

The reference deployment has been running for a few weeks. As of writing:

10 tool servers scored (docs retrieval, reasoning, memory, filesystem, search, DB, VCS, and more)
2,100+ signed execution receipts
Automated daily collection via CI with fresh caller keys each run (caller diversity is a first-class signal)

Live dashboard: xkumakichi.github.io/xaip-protocol
Trust API: https://xaip-trust-api.kuma-github.workers.dev/v1/servers

You can ask it which tool to pick right now:

curl -X POST https://xaip-trust-api.kuma-github.workers.dev/v1/select \
  -H "Content-Type: application/json" \
  -d '{"task":"Fetch React docs","candidates":["context7","sequential-thinking","unknown-server"]}'

Response includes both the selection and a counterfactual — what would happen if you chose randomly with no trust data. That counterfactual is the value proposition: trust data either saves an agent from a wasted call or it doesn't.

What "provider-neutral" buys you, concretely

An agent built on LangChain and an agent built on OpenAI's SDK can share trust data about the same underlying tool. Today, they can't — each framework has its own observability silo.
A tool whose author is gated out of one community still accumulates trust from callers in every other community.
A grant reviewer evaluating agent infrastructure projects can verify receipts independently, without relying on any single platform's dashboard.
A future regulatory regime that asks "what's your trust basis for this agent's tool choices?" has a portable, auditable answer.

What's next

The spec is open, the aggregator is live, the three framework integrations are on npm. The next frontier is class-aware risk evaluation — a settlement tool whose outcomes are anchored to an external ledger doesn't need the same trust signals as an advisory tool whose outputs are freely consumed. The v0.5 draft tackles that.

The underlying claim is simple: trust infrastructure for AI agents is too important to depend on any one platform, community, or moderator. The sooner we build it as a portable layer, the sooner the ecosystem can reason about tool choices the way we already reason about TLS certificates and package signatures — with math, not vibes.

XAIP is MIT-licensed and open source. Feedback on the v0.5 draft is welcome via GitHub issues.

What the agent stack is still missing

xaip-agent — Mon, 20 Apr 2026 23:23:49 +0000

This week the agent economy narrative crystallized in three posts.

Cameron Winklevoss (Gemini): "Humans may have built crypto, but crypto is not so much money for humans as it is money for machines."

Brian Armstrong (Coinbase): launched Agentic.market, a discovery layer where AI agents find and pay for services over x402.

t54.ai: "Every check in today's financial stack was designed around a human. Signatures, IDs, clicks, chargebacks. When an AI agent is the one transacting, each of those checks has a gap."

Three different angles, one convergent thesis: agents are becoming first-class economic actors, and the existing stack doesn't fit them.

Payments have a shipped answer (x402). Discovery now has a shipped answer (Agentic.market). The question I've been sitting with is what sits underneath both of those:

When an agent calls a service, how does it know the service is trustworthy in practice, not just in documentation?

That's the trust layer. It's the one that's still missing — and it's the one I've been building.

The gap

A signed transaction proves an agent authorized a call. It doesn't prove the call was safe to make.

The repo can look well-maintained and still ship a buggy release.
The marketplace listing can be legitimate and still be an attack (see the Ox Security research on MCP marketplace poisoning published April 16).
The provider can be fine at T=0 and compromised at T=30 days.

These are problems payments don't solve. Discovery doesn't solve them either — an agent finding a service via Agentic.market still needs to know if that service has been acting suspiciously over the last 1,000 calls.

t54.ai's framing — "each of those checks has a gap" — applies one layer lower than they were writing about. The same gap exists for which services an agent should call at all.

What a trust layer actually is

Three things, in order of difficulty:

Signed receipts — an attestation that agent A called server B, dual-signed, hashes only (no raw content).
Aggregation with defense — receipts feed a score. The scoring must be Byzantine-robust or the whole thing is theater.
Live scores agents can query before calling — one HTTP GET, no auth, no SDK.

Code is the easy part. The hard parts are:

Cold start. A trust layer with no receipts is useless. A trust layer with 10 receipts is misleading.
Caller diversity. If one participant dominates the dataset, you're scoring their experience, not the server's.
Adversarial robustness. Someone will try to tank a competitor's score. The math has to make that expensive.

The XAIP receipt layer

I shipped one implementation of this. If you want the hook-level walkthrough, the first article covers installation and the developer-facing side.

Briefly:

Ed25519-signed receipts per MCP tool call (hashed I/O only)
Public Cloudflare Worker aggregator, Bayesian scoring, per-server flags (high_error_rate, low_caller_diversity, etc.)
One-command Claude Code hook that consumes the scores and contributes receipts

Live scores right now (8 servers, ~1,500 receipts, small but real):

memory      0.800  trusted
git         0.775  trusted
sqlite      0.753  trusted
puppeteer   0.671  caution  (high_error_rate)
context7    0.618  caution  (low_caller_diversity)
filesystem  0.579  caution  (low_caller_diversity)
playwright  0.394  low_trust (high_error_rate)
fetch       0.365  low_trust (high_error_rate)

curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7

Why this is an ecosystem problem, not a product

A trust layer only works if many independent participants contribute receipts. One person running it alone — which is the current state of XAIP — triggers low_caller_diversity on every high-volume server. That's not a bug; that's the flag working correctly. It's literally telling you not to trust the scores until more callers are in the dataset.

So I'm not pitching a product. I'm asking: if you're building in the agent space and you think trust scoring is a layer that should exist, contribute receipts. Or run an aggregator node (the spec is in the repo, BFT quorum is the next milestone). Or tell me why the design is wrong.

Stack picture

Agent economy layers (rough)
───────────────────────────────
Payments       → x402 (shipped)
Discovery      → Agentic.market (shipped)
Trust scoring  → XAIP + ?          (small, needs company)
Identity       → DID / passkeys    (fragmented)

XAIP is one attempt at the trust row. Almost certainly not the final one — but the row has to get filled, and waiting for Anthropic or a well-funded startup to do it means the first large-scale MCP compromise happens before the layer exists.

Links

Live dashboard: https://xkumakichi.github.io/xaip-protocol/ (scores auto-refresh, no auth)
Previous article: https://dev.to/xkumakichi/a-claude-code-hook-that-warns-you-before-calling-a-low-trust-mcp-server-ckk
Repo: https://github.com/xkumakichi/xaip-protocol (MIT, zero deps)
npm: https://www.npmjs.com/package/xaip-claude-hook
Trust API: https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7

If you're working on adjacent layers — payment, discovery, identity for agents — I'd be glad to compare notes. The interesting question isn't whose trust layer wins; it's whether any trust layer exists by the time the stack starts mattering.

— xkumakichi

A Claude Code hook that warns you before calling a low-trust MCP server

xaip-agent — Mon, 20 Apr 2026 14:15:40 +0000

Last week researchers at Ox published findings showing that the MCP STDIO transport lets arbitrary command execution slip through unchecked, and that 9 of 11 MCP marketplaces they tested were poisonable. Anthropic's response: STDIO is out of scope for protocol-level fixes, the ecosystem is responsible for operational trust.

Fair — Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation in December 2025 specifically so independent infrastructure could grow around it. But that leaves a real gap for anyone running Claude Code today: how do you know whether an MCP server you're about to invoke is trustworthy?

The Anthropic official registry is pure metadata (license, commit count, popularity). mcp-scorecard.ai scores repos, not behavior. BlueRock runs OWASP-style static scans. None of these ask the one question that actually matters:

Does this MCP server, in real call-time use, work?

So I built a small thing to answer it.

The hook

A zero-config Claude Code hook that does two things on every MCP tool call:

Before the call — queries a public trust API for that server. If the score is low, Claude shows an inline warning:

   ⚠ XAIP: "some-server" trust=0.32 (caution, 87 receipts) Risk: high_error_rate

After the call — emits an Ed25519-signed receipt (success, latency, hashed input/output) to a public aggregator that updates the score.

Install:

npm install -g xaip-claude-hook
xaip-claude-hook install

Next MCP call fires the hook. That's the whole UX.

What a receipt looks like

No raw content leaves your machine — only hashes.

{
  "agentDid":      "did:web:context7",
  "callerDid":     "did:key:a1c6cd34…",
  "toolName":      "resolve-library-id",
  "taskHash":      "9f3e…",   // sha256(input).slice(0,16)
  "resultHash":    "1b78…",   // sha256(response).slice(0,16)
  "success":       true,
  "latencyMs":     668,
  "failureType":   "",
  "timestamp":     "2026-04-17T04:24:59.925Z",
  "signature":     "...",     // Ed25519 over canonical JSON (agent key)
  "callerSignature": "..."    // Ed25519 over canonical JSON (caller key)
}

The aggregator rejects anything that fails signature verification. The trust API computes a Bayesian score across all verified receipts per server, weighted by caller diversity — so one enthusiastic installer can't fake a reputation.

What the scores actually look like right now

Being transparent: the dataset is small. A curl against the live trust API today:

Server	Trust	Verdict	Receipts	Flag
memory	0.800	trusted	112	—
git	0.775	trusted	35	—
sqlite	0.753	trusted	42	—
puppeteer	0.671	caution	32	high_error_rate
context7	0.618	caution	560	low_caller_diversity
filesystem	0.579	caution	610	low_caller_diversity
playwright	0.394	low_trust	37	high_error_rate
fetch	0.365	low_trust	36	high_error_rate

Verify any of these yourself:

curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7

The low_caller_diversity flag on high-volume servers is the single most honest number in that table. It means: I'm the biggest caller right now, and that's exactly the problem this tool is supposed to solve. The flag only clears when independent installers start generating receipts — which is what the npm package is for.

Why this is architecturally different from existing approaches

Every other "MCP trust" project I've seen scores the repository:

Commit frequency, license, stars, contributor count (mcp-scorecard.ai)
Static source-code vulnerability scans (BlueRock)
Registry inclusion as implicit trust (official MCP registry)

These are useful proxies, but none of them tell you whether a server works in practice. A well-maintained repo can have a buggy release; a single-author repo can be rock solid; a newly-forked malicious repo looks identical to the original under static scan.

XAIP scores observed behavior. Every call is a signed attestation. The scoring is Bayesian, so:

Servers with few receipts get insufficient_data — no verdict, no warning
High-variance patterns (mixed success/failure) get lower confidence
The high_error_rate flag is computed from real response content, classifying quota exceeded, rate limit, unauthorized, and "isError": true as failures

This is the same philosophy as OpenSSF Scorecard vs. runtime attestation in supply chain: you want both, but only one of them catches regressions in production.

What's missing / where this could go wrong

I want to be specific about limitations, because "AI trust protocol" posts tend to overpromise:

~10 servers, ~1500 receipts total. Small. This post is partly an ask for installers to fix that.
One aggregator node. Byzantine fault tolerance requires quorum; right now there's one Cloudflare Worker. Quorum needs multiple operators, which is the next milestone.
Client-side inferSuccess is heuristic. We look at response text for error patterns. False positives and negatives are possible — fetch's 36% error rate might be over-counted (legit 404s shouldn't hurt the server's score) or real.
Privacy model relies on hashes, not ZK. Inputs and outputs are hashed before transmission, but statistical correlation across taskHashes is possible in principle. Migration to ZK receipt aggregation is a future idea, not a current feature.
I personally generated most of the high-volume receipts. The low_caller_diversity flag you see on context7 and filesystem is me.

Running it yourself

npm install -g xaip-claude-hook
xaip-claude-hook install
xaip-claude-hook status

Open a new Claude Code session. Call any MCP tool. Check:

cat ~/.xaip/hook.log

You'll see lines like:

2026-04-17T04:24:59Z POST context7/resolve-library-id ok=true lat=668ms → 200

And the next time you (or Claude) invoke a low-trust server, the warning shows up inline.

Uninstall is a single command. Keys under ~/.xaip/ persist — delete manually to wipe.

AI Agents Pick Tools Blind

xaip-agent — Tue, 14 Apr 2026 23:43:14 +0000

I connected my AI agent to 3 MCP servers.

It picked one at random.

It timed out. Then retried a different one. Then finally hit one that worked.

$ node without-xaip.js

→ Trying: unknown-server...
  ✗ error — package not found (8.2s)

→ Trying: sequential-thinking...
  ✓ connected — but wrong tool for docs task

→ Trying: context7...
  ✓ success (3.1s)

Total: 11.3 seconds, 2 wasted calls

There are over 1,000 MCP servers now. Your agent has no way to tell which ones are reliable, which ones are broken, and which ones are the right fit.

So I built a fix: one API call that picks the right server first.

$ node with-xaip.js

→ XAIP selected: context7 (trust: 1.0, 248 verified executions)
  ✓ success (3.1s)

Total: 3.1 seconds, 0 wasted calls

This is XAIP — trust scoring for AI agents, backed by real execution data. Not benchmarks. Not self-reported metrics. Actual tool-call results, cryptographically signed.

A live API you can try right now

No signup, no API key. Just curl:

# Trust score for a specific MCP server
curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7

{
  "slug": "context7",
  "trust": 1.0,
  "verdict": "trusted",
  "receipts": 248,
  "confidence": 1,
  "source": "xaip-aggregator (quorum:1)",
  "riskFlags": [],
  "computedFrom": "248 receipts via XAIP Aggregator BFT (1 nodes)"
}

Or let XAIP pick the best server for your task:

curl -X POST https://xaip-trust-api.kuma-github.workers.dev/v1/select \
  -H "Content-Type: application/json" \
  -d '{
    "task": "Fetch React documentation",
    "candidates": ["context7", "sequential-thinking", "unknown-server"]
  }'

{
  "selected": "context7",
  "reason": "Highest trust (1) from 248 verified executions",
  "rejected": [
    { "slug": "unknown-server", "reason": "unscored — no execution data" }
  ],
  "withoutXAIP": "Random selection would pick an unscored server 33% of the time — no execution data, no safety guarantee"
}

The withoutXAIP field exists to make the risk visible. It's the answer to "why do I need this?"

How it works

XAIP has three moving parts:

1. Trust API — Returns trust scores for MCP servers. Scores come from real execution data, not self-reported metrics.

2. Decision Engine — POST /v1/select takes a task and a list of candidate servers, returns the best pick with reasoning. Unscored servers are automatically excluded.

3. Aggregator — Collects Ed25519-signed execution receipts. Every tool call produces a cryptographic receipt that feeds back into trust scores.

The trust model is Bayesian (Beta distribution), weighted by caller diversity to prevent single-caller gaming. If only one caller submits receipts for a server, the score reflects that limited evidence.

Select → Execute → Report
  ↑                    │
  └────────────────────┘
     scores improve

The data is real

This isn't a mock API. Trust scores are computed from 1,127 actual MCP tool-call executions:

Server	Trust	Receipts	Verdict
context7	1.000	248	trusted
sequential-thinking	1.000	285	trusted
filesystem	0.909	594	caution

Monitored via Veridict, a runtime execution monitor that tracks success rates, latency, and failure types.

filesystem scores lower because it has real failures in its history — that's the system working correctly. A trust score should reflect reality, not optimism.

Try the full demo

The dogfooding demo runs the complete loop: select a server, execute MCP tool calls, submit a signed receipt, check the updated score.

git clone https://github.com/xkumakichi/xaip-protocol.git
cd xaip-protocol/demo
npm install
npx tsx dogfood.ts

Takes about 15 seconds. You'll see XAIP select context7, execute real tool calls against it, submit a receipt to the Aggregator, and print the comparison table.

What's next

XAIP is at v0.4.0. The infrastructure is live and the data is real, but adoption is the bottleneck:

More servers — Currently scoring 3 MCP servers. The system scales to any server, but needs execution data flowing in.
More callers — Caller diversity is the main lever for score accuracy. More independent callers = higher confidence.
Platform integrations — Working toward integration with MCP registries like Smithery.

If you're building AI agents that use MCP, you can start using the API today. Scores will keep improving as more execution data flows in.

Why this matters beyond today

Right now, XAIP helps agents pick working tools.

But this becomes critical when agents start doing more than calling APIs — paying for services, delegating tasks across organizations, executing autonomous workflows.

At that point, the question changes from "does this tool work?" to "can I trust this agent with money?"

XAIP is designed for that future. But it already solves a real problem today.

DEV Community: xaip-agent

Evidence Before Delegation — Especially Before Payment

A small scene

What the agent currently sees

What is missing: portable execution evidence

What landed recently

Using precheck() in a few lines

What XAIP is not

Open questions

Try it / read it

Receipts before AI tool calls

Where receipts come from today

Current snapshot (2026-05-11)

What the demo shows

Links

Previously

Portable Trust

The tool-choice problem

Three places trust data can live

Why provider-neutrality matters, structurally

What portable trust looks like

Live data

What "provider-neutral" buys you, concretely

What's next

What the agent stack is still missing

The gap

What a trust layer actually is

The XAIP receipt layer

Why this is an ecosystem problem, not a product

Stack picture

Links

A Claude Code hook that warns you before calling a low-trust MCP server

The hook

What a receipt looks like

What the scores actually look like right now

Why this is architecturally different from existing approaches

What's missing / where this could go wrong

Running it yourself

Links

AI Agents Pick Tools Blind

A live API you can try right now

How it works

The data is real

Try the full demo

What's next

Why this matters beyond today

Links

Using `precheck()` in a few lines