<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: xaip-agent</title>
    <description>The latest articles on DEV Community by xaip-agent (@xkumakichi).</description>
    <link>https://dev.to/xkumakichi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879438%2F973a5c17-3aa5-4b12-9c4f-50ef1b572d8a.png</url>
      <title>DEV Community: xaip-agent</title>
      <link>https://dev.to/xkumakichi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xkumakichi"/>
    <language>en</language>
    <item>
      <title>Evidence Before Delegation — Especially Before Payment</title>
      <dc:creator>xaip-agent</dc:creator>
      <pubDate>Thu, 28 May 2026 02:58:47 +0000</pubDate>
      <link>https://dev.to/xkumakichi/evidence-before-delegation-especially-before-payment-3g50</link>
      <guid>https://dev.to/xkumakichi/evidence-before-delegation-especially-before-payment-3g50</guid>
      <description>&lt;p&gt;Before an agent delegates work — to a tool, a skill, or another agent — it usually sees a name, a description, sometimes a rating. What it does not usually see is what happened the last few hundred times someone called the same candidate. That gap matters more when the call costs money and the skill is closed-source. Three pieces of public work landed recently toward closing it: an individual Internet-Draft for a signed-receipt wire format, &lt;code&gt;xaip-sdk@0.5.0&lt;/code&gt; with a &lt;code&gt;precheck()&lt;/code&gt; helper, and two browser demos that make the contrast visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small scene
&lt;/h2&gt;

&lt;p&gt;An AI agent is asked to translate a document into Japanese. The agent looks at the closest paid skill marketplace it has access to. Three candidates appear. Each has a polished listing. Each has a five-star rating. The prices differ by a few cents per call.&lt;/p&gt;

&lt;p&gt;From the listing alone, all three are interchangeable. The agent could pick any of them. It could ask the user for help. It could just take the cheapest. Whatever it does, it is making a choice with no basis other than someone else's published metadata.&lt;/p&gt;

&lt;p&gt;This is not specific to translation skills, and not specific to any one marketplace. It is the same shape every time an agent has to delegate work to an external tool, skill, or service it cannot inspect.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the agent currently sees
&lt;/h2&gt;

&lt;p&gt;Across runtimes — MCP servers, LangChain tools, OpenAI tool-calling loops, HTTP APIs, paid skill marketplaces — the candidates an agent picks between are typically described by a thin slice of information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a name or slug&lt;/li&gt;
&lt;li&gt;a one-line description&lt;/li&gt;
&lt;li&gt;maybe a category tag or capability list&lt;/li&gt;
&lt;li&gt;maybe a rating, a review count, or a "popular" badge&lt;/li&gt;
&lt;li&gt;if it costs money, a price per call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of that is &lt;strong&gt;publisher-supplied metadata&lt;/strong&gt;. None of it is independent evidence of what the candidate actually does when it is called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Receipts, not ratings.&lt;/strong&gt; That is the gap.&lt;/p&gt;

&lt;p&gt;The gap is annoying in the free case. If an agent picks the wrong MCP server and the call fails, the cost is a retry and some latency. The gap becomes more painful in the paid case. If an agent picks a closed-source skill, pays per execution, and the skill misbehaves or fails, the cost is real and it accumulates. If the same closed-source skill misbehaves in a way the agent does not even detect, the cost is worse.&lt;/p&gt;

&lt;p&gt;This is the structural problem worth naming out loud: agents are increasingly delegating work to opaque candidates under uncertainty, and the inputs to their delegation decision are largely metadata that the candidate itself published.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is missing: portable execution evidence
&lt;/h2&gt;

&lt;p&gt;The piece that is missing is observable, signed, portable evidence of what happened the last N times this candidate was actually called. Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Signed&lt;/strong&gt;, so a verifier can tell who made each claim and that it was not fabricated by a third party.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observable&lt;/strong&gt;, so anyone holding the record can verify it without consulting a registry or central intermediary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portable&lt;/strong&gt;, so the evidence about a tool, skill, or agent moves with that identity across runtimes and marketplaces, rather than living inside one platform's private database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a trust system. It is not an approval system. It is not a sandbox. It is, much more modestly, a &lt;strong&gt;record of attempts&lt;/strong&gt;: a wire format for "what was called, by whom, on whose behalf, with what outcome, how long it took, and how the inputs and outputs are identified by hash."&lt;/p&gt;

&lt;p&gt;If that wire format exists, a caller who is about to delegate to a candidate can ask a simple question before committing: &lt;em&gt;what evidence is available about this candidate already?&lt;/em&gt; The answer is not a verdict. The answer is a record they can read with their own eyes — or their own policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What landed recently
&lt;/h2&gt;

&lt;p&gt;Three pieces of public work landed toward this gap over the past stretch. They are not a complete solution. They are a starting point that a wider set of contributors can build on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. An individual Internet-Draft for the receipt wire format.&lt;/strong&gt;&lt;br&gt;
The format is posted at IETF Datatracker as &lt;code&gt;draft-xkumakichi-xaip-receipts-00&lt;/code&gt;. It defines a JSON wire format for one signed execution receipt: who acted (&lt;code&gt;agentDid&lt;/code&gt;), who delegated (&lt;code&gt;callerDid&lt;/code&gt;), what tool was called, whether the call succeeded, how long it took, and how the inputs and outputs are identified by hash. Signatures are Ed25519, with optional co-signature by the caller. Identities are W3C Decentralized Identifiers, with no constraint on the DID method. The draft is intentionally narrow: it covers the wire format only. Scoring models, aggregation topologies, and decision logic are deployment-policy concerns and explicitly out of scope.&lt;/p&gt;

&lt;p&gt;It is worth being precise about what this is and is not. It is an &lt;strong&gt;individual Internet-Draft&lt;/strong&gt;. It is not an IETF standard. It is not IETF-approved. It has no formal standing in the IETF standards process. The value of being on Datatracker is having a citable URL whose content can be referenced by other individual drafts, papers, or implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;xaip-sdk@0.5.0&lt;/code&gt; on npm, with a &lt;code&gt;precheck()&lt;/code&gt; helper.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;precheck()&lt;/code&gt; is a thin SDK wrapper over a public Trust API endpoint that consumes the receipt graph. Given a task description and a list of candidate slugs, it returns ranked execution evidence — receipt counts, observed success rates, risk flags, and an eligibility flag the SDK computes from the caller's policy. The SDK does not invoke the candidate. The SDK does not pay for anything. It returns evidence; the caller's own logic decides what to do with that evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Two browser demos.&lt;/strong&gt;&lt;br&gt;
The first, "Trust Evidence Before Delegation," covers the free case: three contrasting MCP candidates against the live trust scores. The second, "Before Payment Evidence Demo," covers the paid closed-source case: three fictional translation skills with deliberately indistinguishable marketplace listings but different execution-evidence profiles. Both are static, buildless, and read-only. The paid demo is seeded with synthetic fixture data because there are no real paid-skill receipts in the public graph yet — that is itself one of the open questions.&lt;/p&gt;

&lt;p&gt;These three pieces are intended to be useful independently. The Internet-Draft is useful even if you never install the SDK. The SDK is useful even if you never read the draft. The demos are useful even if you never write any code yourself.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using &lt;code&gt;precheck()&lt;/code&gt; in a few lines
&lt;/h2&gt;

&lt;p&gt;The smallest useful call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;precheck&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;xaip-sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;precheck&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Translate a document into Japanese&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;skill:translator-alpha&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;skill:translator-beta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;skill:translator-gamma&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;minReceipts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;excludeRiskFlags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;repeated_timeout&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;includeDecision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The return value is structured, not narrative. The interesting fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;selected&lt;/code&gt; — the candidate the SDK picked by applying the supplied policy to the ranked evidence. &lt;code&gt;null&lt;/code&gt; if no candidate was eligible. The SDK recomputes this from the policy; it does not blindly forward the server's &lt;code&gt;selected&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ranked&lt;/code&gt; — every input candidate, each with &lt;code&gt;score&lt;/code&gt;, &lt;code&gt;receiptCount&lt;/code&gt;, &lt;code&gt;confidence&lt;/code&gt;, &lt;code&gt;riskFlags&lt;/code&gt;, &lt;code&gt;verdict&lt;/code&gt;, and an &lt;code&gt;eligible&lt;/code&gt; boolean.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unscored&lt;/code&gt; — convenience list of candidates with no execution evidence at all.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reason&lt;/code&gt; — a controlled string. It is one of exactly two values: &lt;code&gt;"Selected using available execution evidence."&lt;/code&gt; or &lt;code&gt;"No eligible candidates based on available execution evidence."&lt;/code&gt; It does not vary by case. Consumer code is supposed to read the structured fields, not parse the string.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;decision&lt;/code&gt; — optional, present only when the caller opts in with &lt;code&gt;includeDecision: true&lt;/code&gt;. Values are &lt;code&gt;"allow"&lt;/code&gt;, &lt;code&gt;"warn"&lt;/code&gt;, or &lt;code&gt;"unknown"&lt;/code&gt;. There is no &lt;code&gt;"block"&lt;/code&gt;; blocking is not the SDK's job.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point of the controlled &lt;code&gt;reason&lt;/code&gt; and the missing &lt;code&gt;"block"&lt;/code&gt; is that the SDK never positions itself as the one making the call. The SDK surfaces structured evidence. The caller decides whether to pay, to invoke, to ask the user, to fall back, or to escalate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What XAIP is not
&lt;/h2&gt;

&lt;p&gt;This is a short list. It is short on purpose.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;XAIP is not a sandbox.&lt;/strong&gt; It does not isolate the execution environment of any tool, skill, or agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XAIP is not an approval engine.&lt;/strong&gt; It does not gate calls. It does not have a &lt;code&gt;"block"&lt;/code&gt; decision value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XAIP is not a payment rail.&lt;/strong&gt; It does not move money. It does not hold balances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XAIP does not make tools safe.&lt;/strong&gt; Safety is a property of the tool and how the caller uses it, not of the record format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XAIP does not guarantee trust.&lt;/strong&gt; It surfaces evidence, which the caller may use as one of several inputs into their own trust decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And one more, which is often what people actually want to ask about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Receipts are the primary artifact. Scores and eligibility are derived views.&lt;/strong&gt; The &lt;code&gt;verdict&lt;/code&gt;, the &lt;code&gt;confidence&lt;/code&gt;, the &lt;code&gt;eligible&lt;/code&gt; boolean — they are derived from the underlying signed receipts using a stated method. The receipts carry the signature chain and the long-term portability. The scores are a convenience.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a future caller wants to derive a different score, with a different aggregation method, or weighted differently, they can do that over the same receipt graph without re-emitting anything. That is the whole reason the wire format is the artifact and the score is the view.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;This is early. Several pieces are explicitly open:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Caller diversity.&lt;/strong&gt; The current public dataset is heavily produced by a small number of callers. Signals derived from the receipt graph become more interesting as more independent observers contribute. There is no theoretical fix for this; it is a question of who actually runs &lt;code&gt;precheck()&lt;/code&gt; and emits receipts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Candidate categories.&lt;/strong&gt; The current SDK treats &lt;code&gt;candidates&lt;/code&gt; as opaque string slugs. The convention is to use a prefix when useful — &lt;code&gt;tool:&lt;/code&gt;, &lt;code&gt;skill:&lt;/code&gt;, &lt;code&gt;agent:&lt;/code&gt; — but a structured shape (&lt;code&gt;{ id, type }&lt;/code&gt;) is deliberately deferred until a second caller asks for it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Receipt provenance.&lt;/strong&gt; As external probe networks, synthetic monitors, and integration tests start emitting receipts, the question of whether a &lt;code&gt;source&lt;/code&gt; field belongs on the receipt — &lt;code&gt;real_agent_call&lt;/code&gt; vs &lt;code&gt;synthetic_probe&lt;/code&gt; vs &lt;code&gt;scheduled_health_check&lt;/code&gt; — becomes more relevant. Mixing these in equal weight would distort the signal. This is logged as an open question against the format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Co-signature ratio enforcement.&lt;/strong&gt; The SDK accepts a &lt;code&gt;requireCoSignatureRatio&lt;/code&gt; policy field for future use but currently throws if it is greater than zero, because the aggregator does not yet expose per-candidate co-signature ratios. Silently accepting a policy that the SDK cannot enforce would be worse than refusing it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Settlement-class tools.&lt;/strong&gt; Tools whose outputs are externally anchored (for example, on-chain settlement) have very different evidence semantics than retrieval tools. The receipt format permits a &lt;code&gt;toolMetadata&lt;/code&gt; field for category hints, but standardizing those hints is deferred.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these need to be resolved before the receipt format is useful. They are listed so the format is not mistaken for a finished product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it / read it
&lt;/h2&gt;

&lt;p&gt;If you want to read the spec:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://datatracker.ietf.org/doc/draft-xkumakichi-xaip-receipts/" rel="noopener noreferrer"&gt;Internet-Draft: draft-xkumakichi-xaip-receipts-00&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to use the SDK:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.npmjs.com/package/xaip-sdk" rel="noopener noreferrer"&gt;xaip-sdk on npm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/xkumakichi/xaip-protocol/blob/main/docs/precheck.md" rel="noopener noreferrer"&gt;precheck() API guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to see what the contrast looks like in a browser:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://xkumakichi.github.io/xaip-protocol/evidence-before-delegation.html" rel="noopener noreferrer"&gt;Trust Evidence Before Delegation (free tools)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://xkumakichi.github.io/xaip-protocol/before-payment-demo.html" rel="noopener noreferrer"&gt;Before Payment Evidence Demo (paid closed-source skills, seeded)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to look at the live aggregator output or the repository:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://xkumakichi.github.io/xaip-protocol/" rel="noopener noreferrer"&gt;Live trust scores dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;Repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want this to work better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;--yes&lt;/span&gt; xaip-caller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;npx.cmd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--yes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xaip-caller&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What it does, before you run it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires Node.js 20 or newer.&lt;/li&gt;
&lt;li&gt;Requires network access (it talks to the public XAIP aggregator).&lt;/li&gt;
&lt;li&gt;Creates or reuses a local caller key under your home directory.&lt;/li&gt;
&lt;li&gt;Makes a few real HTTP checks against public read-only endpoints.&lt;/li&gt;
&lt;li&gt;Signs receipts for those calls with the local key.&lt;/li&gt;
&lt;li&gt;Posts the receipts to the live XAIP aggregator.&lt;/li&gt;
&lt;li&gt;No signup or API key required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It takes about thirty seconds. The receipt format and &lt;code&gt;precheck()&lt;/code&gt; help most when many independent observers are watching the same candidates — and the only way that happens is one caller at a time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Receipts before AI tool calls</title>
      <dc:creator>xaip-agent</dc:creator>
      <pubDate>Mon, 11 May 2026 04:08:55 +0000</pubDate>
      <link>https://dev.to/xkumakichi/receipts-before-ai-tool-calls-pbj</link>
      <guid>https://dev.to/xkumakichi/receipts-before-ai-tool-calls-pbj</guid>
      <description>&lt;p&gt;&lt;a href="https://xkumakichi.github.io/xaip-protocol/evidence-before-delegation.html" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0o21vt793ab3x20xlx6l.gif" alt="Trust Evidence Before Delegation" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a short update on XAIP since my earlier write-up on portable trust.&lt;/p&gt;

&lt;p&gt;The main changes are: a new public demo, refreshed live numbers, and&lt;br&gt;
receipts from MCP, LangChain.js callbacks, and OpenAI-compatible&lt;br&gt;
tool-call loops in the same public trust graph.&lt;/p&gt;

&lt;p&gt;I've been building XAIP, a provider-neutral signed execution evidence&lt;br&gt;
layer for AI agent tool calls.&lt;/p&gt;

&lt;p&gt;The basic idea is simple: before an agent delegates work to an external&lt;br&gt;
tool, it should be able to inspect historical execution evidence from&lt;br&gt;
previous signed receipts.&lt;/p&gt;

&lt;p&gt;XAIP is not another agent framework. It sits underneath agent runtimes&lt;br&gt;
as a portable receipt layer. The receipt format is the same regardless&lt;br&gt;
of which runtime emitted it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where receipts come from today
&lt;/h2&gt;

&lt;p&gt;Current live integrations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MCP servers&lt;/li&gt;
&lt;li&gt;LangChain.js callback handlers&lt;/li&gt;
&lt;li&gt;OpenAI-compatible tool-call loops&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Current snapshot (2026-05-11)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;10 servers in the public trust graph&lt;/li&gt;
&lt;li&gt;3,239 signed execution receipts&lt;/li&gt;
&lt;li&gt;Receipts from MCP, LangChain.js callbacks, and OpenAI-compatible
tool-call loops&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the demo shows
&lt;/h2&gt;

&lt;p&gt;The public demo shows a simple contrast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;without XAIP: candidate tools look interchangeable&lt;/li&gt;
&lt;li&gt;with XAIP: signed receipt history, observed failures, and unscored
candidates are visible before delegation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;XAIP does not make tools safe, and it does not guarantee trust. It&lt;br&gt;
makes execution evidence visible before delegation. Trust scores are&lt;br&gt;
one derived view over receipts — receipts themselves are the primary&lt;br&gt;
artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Demo: &lt;a href="https://xkumakichi.github.io/xaip-protocol/evidence-before-delegation.html" rel="noopener noreferrer"&gt;https://xkumakichi.github.io/xaip-protocol/evidence-before-delegation.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;https://github.com/xkumakichi/xaip-protocol&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback on the receipt model and the pre-delegation evidence framing&lt;br&gt;
is very welcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Previously
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://zenn.dev/xkumakichi/articles/e93a438265a682" rel="noopener noreferrer"&gt;信頼は持ち運べる (2026-04-22)&lt;/a&gt;
— earlier Japanese intro to XAIP focused on portability of trust signals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post updates the framing toward receipts as the primary artifact,&lt;br&gt;
with a new public demo and refreshed live numbers.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Portable Trust</title>
      <dc:creator>xaip-agent</dc:creator>
      <pubDate>Tue, 21 Apr 2026 09:54:57 +0000</pubDate>
      <link>https://dev.to/xkumakichi/portable-trust-o4o</link>
      <guid>https://dev.to/xkumakichi/portable-trust-o4o</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — When an AI agent picks a tool, it makes a trust decision. The quality of that decision depends entirely on &lt;em&gt;where the trust data comes from&lt;/em&gt;. If trust flows through a single gatekeeper — a registry, a platform's curation, a community's moderation — the agent inherits that gatekeeper's failure modes. This post argues that trust infrastructure for AI agents must be provider-neutral and behavior-derived, and walks through what a concrete implementation of that principle looks like, with live data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The tool-choice problem
&lt;/h2&gt;

&lt;p&gt;An AI agent receives a task: "fetch the React hooks docs."&lt;/p&gt;

&lt;p&gt;Its planner produces a candidate list: three documentation tools, two search tools, one fallback web scraper. Which one does it pick?&lt;/p&gt;

&lt;p&gt;Today, the honest answer is: it picks based on &lt;em&gt;name recognition in the model's training data&lt;/em&gt; plus &lt;em&gt;whatever the platform decided to show it&lt;/em&gt;. There is no runtime trust signal. The agent does not know which tool succeeded yesterday, which one is quietly returning stale data, which one has been silently deprecated.&lt;/p&gt;

&lt;p&gt;This is the tool-choice problem, and it is a trust-data problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three places trust data can live
&lt;/h2&gt;

&lt;p&gt;Trust data for tools can come from three very different places:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Self-declared&lt;/strong&gt; — the tool's README says it's good.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform-curated&lt;/strong&gt; — the platform it's published on has a list of "recommended" tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavior-derived&lt;/strong&gt; — past executions are logged, signed, and aggregated; trust is computed from outcomes, not claims.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Only (3) is robust against gaming, drift, and upstream policy changes. But (3) is also the hardest to deliver, because it requires infrastructure: signed receipts, a canonical aggregation model, and an identity system that doesn't depend on any single platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why provider-neutrality matters, structurally
&lt;/h2&gt;

&lt;p&gt;Suppose you build trust scores on top of a single community's registry.&lt;/p&gt;

&lt;p&gt;The registry is itself a trust layer — it decides what's visible, what's highlighted, what's removed. When visibility rules change — whether to promote some tools, demote others, or restrict participation — the scoring space implicitly changes with them. Tools that were previously indexed can disappear from consideration. Projects whose contributors cannot register never accumulate receipts in the first place. None of this reflects anything about the tools' behavior; it reflects the registry's state at a point in time.&lt;/p&gt;

&lt;p&gt;This is not a critique of any particular community. It's a structural property of &lt;strong&gt;any layered system where upstream visibility decisions feed downstream trust signals&lt;/strong&gt;. Those decisions become an implicit input to the trust model, whether or not you want them to.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Without a portable trust layer, agents are not choosing tools — they are inheriting decisions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The implication for trust infrastructure: the &lt;strong&gt;receipts, identity, and scoring must all be portable&lt;/strong&gt;. If a community exits, the data must remain queryable. If a platform changes policy, the scoring must still compute. If an identity provider goes away, the agent must still be verifiable. Trust infrastructure that depends on a single upstream is not trust infrastructure — it is a brittle proxy for that upstream's preferences.&lt;/p&gt;

&lt;h2&gt;
  
  
  What portable trust looks like
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;XAIP&lt;/a&gt; is one implementation of this principle. Its design follows from the structural requirement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Signed receipts&lt;/strong&gt;, not self-reports. Every tool execution produces an Ed25519-signed receipt: &lt;code&gt;{ agentDid, callerDid, taskHash, resultHash, success, latencyMs, timestamp }&lt;/code&gt;. The caller co-signs so the tool cannot unilaterally inflate its own reputation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standards-based identity&lt;/strong&gt;. Agents and callers use &lt;a href="https://www.w3.org/TR/did-core/" rel="noopener noreferrer"&gt;W3C DIDs&lt;/a&gt; (&lt;code&gt;did:key&lt;/code&gt;, &lt;code&gt;did:web&lt;/code&gt;, &lt;code&gt;did:xrpl&lt;/code&gt;). No platform account required. An agent expelled from one community retains its identity in every other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bayesian trust, not thresholds&lt;/strong&gt;. Scores are computed as &lt;code&gt;bayesianScore × callerDiversity × coSignFactor&lt;/code&gt;, with DID-method-dependent priors. Cheap identities don't get free trust; expensive identities converge to the same score given enough evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider-neutral receipt producers&lt;/strong&gt;. The same receipt format is emitted by integrations for &lt;a href="https://github.com/xkumakichi/xaip-protocol/tree/main/clients/claude-code-hook" rel="noopener noreferrer"&gt;MCP&lt;/a&gt;, &lt;a href="https://www.npmjs.com/package/xaip-langchain" rel="noopener noreferrer"&gt;LangChain.js&lt;/a&gt;, and &lt;a href="https://www.npmjs.com/package/xaip-openai" rel="noopener noreferrer"&gt;OpenAI tool calling&lt;/a&gt;. A receipt produced by a LangChain agent is byte-compatible with one from an OpenAI chat completion. The trust graph is one graph, regardless of how the agent was built.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation you can run yourself&lt;/strong&gt;. The reference aggregator is a Cloudflare Worker (open source, small). If you don't trust the public instance, you run your own. Multi-aggregator quorum is part of the spec.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Live data
&lt;/h2&gt;

&lt;p&gt;The reference deployment has been running for a few weeks. As of writing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10 tool servers&lt;/strong&gt; scored (docs retrieval, reasoning, memory, filesystem, search, DB, VCS, and more)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2,100+&lt;/strong&gt; signed execution receipts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated daily collection&lt;/strong&gt; via CI with fresh caller keys each run (caller diversity is a first-class signal)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Live dashboard: &lt;a href="https://xkumakichi.github.io/xaip-protocol/" rel="noopener noreferrer"&gt;xkumakichi.github.io/xaip-protocol&lt;/a&gt;&lt;br&gt;
Trust API: &lt;code&gt;https://xaip-trust-api.kuma-github.workers.dev/v1/servers&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can ask it which tool to pick right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://xaip-trust-api.kuma-github.workers.dev/v1/select &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"task":"Fetch React docs","candidates":["context7","sequential-thinking","unknown-server"]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response includes both the selection and a counterfactual — what would happen if you chose randomly with no trust data. That counterfactual is the value proposition: trust data either saves an agent from a wasted call or it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "provider-neutral" buys you, concretely
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;An agent built on LangChain and an agent built on OpenAI's SDK can share trust data about the same underlying tool. Today, they can't — each framework has its own observability silo.&lt;/li&gt;
&lt;li&gt;A tool whose author is gated out of one community still accumulates trust from callers in every other community.&lt;/li&gt;
&lt;li&gt;A grant reviewer evaluating agent infrastructure projects can verify receipts independently, without relying on any single platform's dashboard.&lt;/li&gt;
&lt;li&gt;A future regulatory regime that asks "what's your trust basis for this agent's tool choices?" has a portable, auditable answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The spec is open, the aggregator is live, the three framework integrations are on npm. The next frontier is &lt;strong&gt;class-aware risk evaluation&lt;/strong&gt; — a settlement tool whose outcomes are anchored to an external ledger doesn't need the same trust signals as an advisory tool whose outputs are freely consumed. The &lt;a href="https://github.com/xkumakichi/xaip-protocol/blob/main/XAIP-SPEC-v0.5-DRAFT.md" rel="noopener noreferrer"&gt;v0.5 draft&lt;/a&gt; tackles that.&lt;/p&gt;

&lt;p&gt;The underlying claim is simple: trust infrastructure for AI agents is too important to depend on any one platform, community, or moderator. The sooner we build it as a portable layer, the sooner the ecosystem can reason about tool choices the way we already reason about TLS certificates and package signatures — with math, not vibes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;XAIP is MIT-licensed and open source. Feedback on the v0.5 draft is welcome via &lt;a href="https://github.com/xkumakichi/xaip-protocol/issues" rel="noopener noreferrer"&gt;GitHub issues&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
    <item>
      <title>What the agent stack is still missing</title>
      <dc:creator>xaip-agent</dc:creator>
      <pubDate>Mon, 20 Apr 2026 23:23:49 +0000</pubDate>
      <link>https://dev.to/xkumakichi/what-the-agent-stack-is-still-missing-3hcn</link>
      <guid>https://dev.to/xkumakichi/what-the-agent-stack-is-still-missing-3hcn</guid>
      <description>&lt;p&gt;This week the agent economy narrative crystallized in three posts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cameron Winklevoss (Gemini):&lt;/strong&gt; "Humans may have built crypto, but crypto is not so much money for humans as it is money for machines."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brian Armstrong (Coinbase):&lt;/strong&gt; launched &lt;a href="https://agentic.market" rel="noopener noreferrer"&gt;Agentic.market&lt;/a&gt;, a discovery layer where AI agents find and pay for services over x402.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;t54.ai:&lt;/strong&gt; "Every check in today's financial stack was designed around a human. Signatures, IDs, clicks, chargebacks. When an AI agent is the one transacting, each of those checks has a gap."&lt;/p&gt;

&lt;p&gt;Three different angles, one convergent thesis: &lt;strong&gt;agents are becoming first-class economic actors, and the existing stack doesn't fit them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Payments have a shipped answer (x402). Discovery now has a shipped answer (Agentic.market). The question I've been sitting with is what sits underneath both of those:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When an agent calls a service, how does it know the service is trustworthy &lt;em&gt;in practice&lt;/em&gt;, not just in documentation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the trust layer. It's the one that's still missing — and it's the one I've been building.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap
&lt;/h2&gt;

&lt;p&gt;A signed transaction proves an agent &lt;em&gt;authorized&lt;/em&gt; a call. It doesn't prove the call was &lt;em&gt;safe to make&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The repo can look well-maintained and still ship a buggy release.&lt;/li&gt;
&lt;li&gt;The marketplace listing can be legitimate and still be an attack (see the Ox Security research on MCP marketplace poisoning published April 16).&lt;/li&gt;
&lt;li&gt;The provider can be fine at T=0 and compromised at T=30 days.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are problems payments don't solve. Discovery doesn't solve them either — an agent finding a service via Agentic.market still needs to know if that service has been acting suspiciously over the last 1,000 calls.&lt;/p&gt;

&lt;p&gt;t54.ai's framing — "each of those checks has a gap" — applies one layer lower than they were writing about. The same gap exists for &lt;em&gt;which services an agent should call at all&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a trust layer actually is
&lt;/h2&gt;

&lt;p&gt;Three things, in order of difficulty:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Signed receipts&lt;/strong&gt; — an attestation that agent A called server B, dual-signed, hashes only (no raw content).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation with defense&lt;/strong&gt; — receipts feed a score. The scoring must be Byzantine-robust or the whole thing is theater.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live scores agents can query before calling&lt;/strong&gt; — one HTTP GET, no auth, no SDK.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Code is the easy part. The hard parts are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cold start.&lt;/strong&gt; A trust layer with no receipts is useless. A trust layer with 10 receipts is misleading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caller diversity.&lt;/strong&gt; If one participant dominates the dataset, you're scoring their experience, not the server's.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial robustness.&lt;/strong&gt; Someone will try to tank a competitor's score. The math has to make that expensive.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The XAIP receipt layer
&lt;/h2&gt;

&lt;p&gt;I shipped one implementation of this. If you want the hook-level walkthrough, the &lt;a href="https://dev.to/xkumakichi/a-claude-code-hook-that-warns-you-before-calling-a-low-trust-mcp-server-ckk"&gt;first article&lt;/a&gt; covers installation and the developer-facing side.&lt;/p&gt;

&lt;p&gt;Briefly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ed25519-signed receipts per MCP tool call (hashed I/O only)&lt;/li&gt;
&lt;li&gt;Public Cloudflare Worker aggregator, Bayesian scoring, per-server flags (&lt;code&gt;high_error_rate&lt;/code&gt;, &lt;code&gt;low_caller_diversity&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;li&gt;One-command Claude Code hook that consumes the scores and contributes receipts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Live scores right now (8 servers, ~1,500 receipts, small but real):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;memory      0.800  trusted
git         0.775  trusted
sqlite      0.753  trusted
puppeteer   0.671  caution  (high_error_rate)
context7    0.618  caution  (low_caller_diversity)
filesystem  0.579  caution  (low_caller_diversity)
playwright  0.394  low_trust (high_error_rate)
fetch       0.365  low_trust (high_error_rate)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is an ecosystem problem, not a product
&lt;/h2&gt;

&lt;p&gt;A trust layer only works if many independent participants contribute receipts. One person running it alone — which is the current state of XAIP — triggers &lt;code&gt;low_caller_diversity&lt;/code&gt; on every high-volume server. That's not a bug; that's the flag working correctly. It's literally telling you not to trust the scores until more callers are in the dataset.&lt;/p&gt;

&lt;p&gt;So I'm not pitching a product. I'm asking: if you're building in the agent space and you think trust scoring is a layer that should exist, contribute receipts. Or run an aggregator node (the spec is in the repo, BFT quorum is the next milestone). Or tell me why the design is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack picture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent economy layers (rough)
───────────────────────────────
Payments       → x402 (shipped)
Discovery      → Agentic.market (shipped)
Trust scoring  → XAIP + ?          (small, needs company)
Identity       → DID / passkeys    (fragmented)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;XAIP is one attempt at the trust row. Almost certainly not the final one — but the row has to get filled, and waiting for Anthropic or a well-funded startup to do it means the first large-scale MCP compromise happens before the layer exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Live dashboard: &lt;a href="https://xkumakichi.github.io/xaip-protocol/" rel="noopener noreferrer"&gt;https://xkumakichi.github.io/xaip-protocol/&lt;/a&gt; (scores auto-refresh, no auth)&lt;/li&gt;
&lt;li&gt;Previous article: &lt;a href="https://dev.to/xkumakichi/a-claude-code-hook-that-warns-you-before-calling-a-low-trust-mcp-server-ckk"&gt;https://dev.to/xkumakichi/a-claude-code-hook-that-warns-you-before-calling-a-low-trust-mcp-server-ckk&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;https://github.com/xkumakichi/xaip-protocol&lt;/a&gt; (MIT, zero deps)&lt;/li&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/xaip-claude-hook" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/xaip-claude-hook&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Trust API: &lt;a href="https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7" rel="noopener noreferrer"&gt;https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're working on adjacent layers — payment, discovery, identity for agents — I'd be glad to compare notes. The interesting question isn't whose trust layer wins; it's whether &lt;em&gt;any&lt;/em&gt; trust layer exists by the time the stack starts mattering.&lt;/p&gt;

&lt;p&gt;— xkumakichi&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>cryptocurrency</category>
      <category>web3</category>
    </item>
    <item>
      <title>A Claude Code hook that warns you before calling a low-trust MCP server</title>
      <dc:creator>xaip-agent</dc:creator>
      <pubDate>Mon, 20 Apr 2026 14:15:40 +0000</pubDate>
      <link>https://dev.to/xkumakichi/a-claude-code-hook-that-warns-you-before-calling-a-low-trust-mcp-server-ckk</link>
      <guid>https://dev.to/xkumakichi/a-claude-code-hook-that-warns-you-before-calling-a-low-trust-mcp-server-ckk</guid>
      <description>&lt;p&gt;Last week researchers at Ox published findings showing that the MCP STDIO transport lets arbitrary command execution slip through unchecked, and that &lt;a href="https://www.theregister.com/2026/04/16/anthropic_mcp_design_flaw/" rel="noopener noreferrer"&gt;9 of 11 MCP marketplaces they tested were poisonable&lt;/a&gt;. Anthropic's response: STDIO is out of scope for protocol-level fixes, the ecosystem is responsible for operational trust.&lt;/p&gt;

&lt;p&gt;Fair — Anthropic &lt;a href="https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation" rel="noopener noreferrer"&gt;donated MCP to the Linux Foundation's Agentic AI Foundation in December 2025&lt;/a&gt; specifically so independent infrastructure could grow around it. But that leaves a real gap for anyone running Claude Code today: &lt;strong&gt;how do you know whether an MCP server you're about to invoke is trustworthy?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Anthropic official registry is pure metadata (license, commit count, popularity). mcp-scorecard.ai scores repos, not behavior. BlueRock runs OWASP-style static scans. None of these ask the one question that actually matters:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does this MCP server, in real call-time use, work?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I built a small thing to answer it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hook
&lt;/h2&gt;

&lt;p&gt;A zero-config Claude Code hook that does two things on every MCP tool call:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Before the call&lt;/strong&gt; — queries a public trust API for that server. If the score is low, Claude shows an inline warning:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   ⚠ XAIP: "some-server" trust=0.32 (caution, 87 receipts) Risk: high_error_rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;After the call&lt;/strong&gt; — emits an Ed25519-signed receipt (success, latency, hashed input/output) to a public aggregator that updates the score.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; xaip-claude-hook
xaip-claude-hook &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next MCP call fires the hook. That's the whole UX.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a receipt looks like
&lt;/h2&gt;

&lt;p&gt;No raw content leaves your machine — only hashes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agentDid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;"did:web:context7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"callerDid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s2"&gt;"did:key:a1c6cd34…"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"toolName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;"resolve-library-id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"taskHash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;"9f3e…"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;sha&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="err"&gt;(input).slice(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"resultHash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;"1b78…"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;sha&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="err"&gt;(response).slice(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"latencyMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="mi"&gt;668&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failureType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-17T04:24:59.925Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"signature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Ed&lt;/span&gt;&lt;span class="mi"&gt;25519&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;over&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;canonical&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(agent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"callerSignature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Ed&lt;/span&gt;&lt;span class="mi"&gt;25519&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;over&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;canonical&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(caller&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The aggregator rejects anything that fails signature verification. The trust API computes a Bayesian score across all verified receipts per server, weighted by caller diversity — so one enthusiastic installer can't fake a reputation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the scores actually look like right now
&lt;/h2&gt;

&lt;p&gt;Being transparent: the dataset is small. A &lt;code&gt;curl&lt;/code&gt; against the live trust API today:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Trust&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Receipts&lt;/th&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;memory&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;trusted&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;git&lt;/td&gt;
&lt;td&gt;0.775&lt;/td&gt;
&lt;td&gt;trusted&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sqlite&lt;/td&gt;
&lt;td&gt;0.753&lt;/td&gt;
&lt;td&gt;trusted&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;puppeteer&lt;/td&gt;
&lt;td&gt;0.671&lt;/td&gt;
&lt;td&gt;caution&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;high_error_rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;context7&lt;/td&gt;
&lt;td&gt;0.618&lt;/td&gt;
&lt;td&gt;caution&lt;/td&gt;
&lt;td&gt;560&lt;/td&gt;
&lt;td&gt;low_caller_diversity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;filesystem&lt;/td&gt;
&lt;td&gt;0.579&lt;/td&gt;
&lt;td&gt;caution&lt;/td&gt;
&lt;td&gt;610&lt;/td&gt;
&lt;td&gt;low_caller_diversity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;playwright&lt;/td&gt;
&lt;td&gt;0.394&lt;/td&gt;
&lt;td&gt;low_trust&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;high_error_rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fetch&lt;/td&gt;
&lt;td&gt;0.365&lt;/td&gt;
&lt;td&gt;low_trust&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;high_error_rate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Verify any of these yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;low_caller_diversity&lt;/code&gt; flag on high-volume servers is the single most honest number in that table. It means: &lt;strong&gt;I'm the biggest caller right now, and that's exactly the problem this tool is supposed to solve&lt;/strong&gt;. The flag only clears when independent installers start generating receipts — which is what the npm package is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is architecturally different from existing approaches
&lt;/h2&gt;

&lt;p&gt;Every other "MCP trust" project I've seen scores the &lt;em&gt;repository&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Commit frequency, license, stars, contributor count (mcp-scorecard.ai)&lt;/li&gt;
&lt;li&gt;Static source-code vulnerability scans (BlueRock)&lt;/li&gt;
&lt;li&gt;Registry inclusion as implicit trust (official MCP registry)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are useful proxies, but none of them tell you whether a server works in practice. A well-maintained repo can have a buggy release; a single-author repo can be rock solid; a newly-forked malicious repo looks identical to the original under static scan.&lt;/p&gt;

&lt;p&gt;XAIP scores &lt;strong&gt;observed behavior&lt;/strong&gt;. Every call is a signed attestation. The scoring is Bayesian, so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Servers with few receipts get &lt;code&gt;insufficient_data&lt;/code&gt; — no verdict, no warning&lt;/li&gt;
&lt;li&gt;High-variance patterns (mixed success/failure) get lower confidence&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;high_error_rate&lt;/code&gt; flag is computed from real response content, classifying &lt;code&gt;quota exceeded&lt;/code&gt;, &lt;code&gt;rate limit&lt;/code&gt;, &lt;code&gt;unauthorized&lt;/code&gt;, and &lt;code&gt;"isError": true&lt;/code&gt; as failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the same philosophy as OpenSSF Scorecard vs. runtime attestation in supply chain: you want both, but &lt;em&gt;only one of them catches regressions in production&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's missing / where this could go wrong
&lt;/h2&gt;

&lt;p&gt;I want to be specific about limitations, because "AI trust protocol" posts tend to overpromise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~10 servers, ~1500 receipts total.&lt;/strong&gt; Small. This post is partly an ask for installers to fix that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One aggregator node.&lt;/strong&gt; Byzantine fault tolerance requires quorum; right now there's one Cloudflare Worker. Quorum needs multiple operators, which is the next milestone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client-side inferSuccess is heuristic.&lt;/strong&gt; We look at response text for error patterns. False positives and negatives are possible — fetch's 36% error rate might be over-counted (legit 404s shouldn't hurt the server's score) or real.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy model relies on hashes, not ZK.&lt;/strong&gt; Inputs and outputs are hashed before transmission, but statistical correlation across taskHashes is possible in principle. Migration to ZK receipt aggregation is a future idea, not a current feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I personally generated most of the high-volume receipts.&lt;/strong&gt; The &lt;code&gt;low_caller_diversity&lt;/code&gt; flag you see on context7 and filesystem is me.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; xaip-claude-hook
xaip-claude-hook &lt;span class="nb"&gt;install
&lt;/span&gt;xaip-claude-hook status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open a new Claude Code session. Call any MCP tool. Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; ~/.xaip/hook.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see lines like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2026-04-17T04:24:59Z POST context7/resolve-library-id ok=true lat=668ms → 200
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the next time you (or Claude) invoke a low-trust server, the warning shows up inline.&lt;/p&gt;

&lt;p&gt;Uninstall is a single command. Keys under &lt;code&gt;~/.xaip/&lt;/code&gt; persist — delete manually to wipe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/xaip-claude-hook" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/xaip-claude-hook&lt;/a&gt; — &lt;code&gt;npm install -g xaip-claude-hook&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;https://github.com/xkumakichi/xaip-protocol&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hook source:&lt;/strong&gt; &lt;a href="https://github.com/xkumakichi/xaip-protocol/tree/main/clients/claude-code-hook" rel="noopener noreferrer"&gt;https://github.com/xkumakichi/xaip-protocol/tree/main/clients/claude-code-hook&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live Trust API:&lt;/strong&gt; &lt;a href="https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7" rel="noopener noreferrer"&gt;https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregator:&lt;/strong&gt; &lt;a href="https://xaip-aggregator.kuma-github.workers.dev" rel="noopener noreferrer"&gt;https://xaip-aggregator.kuma-github.workers.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Issues, scoring bugs, angry takes — all welcome on GitHub. If you maintain an MCP server and your score looks wrong, I want to hear about it first.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>claude</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI Agents Pick Tools Blind</title>
      <dc:creator>xaip-agent</dc:creator>
      <pubDate>Tue, 14 Apr 2026 23:43:14 +0000</pubDate>
      <link>https://dev.to/xkumakichi/stop-your-ai-agent-from-picking-broken-mcp-servers-4pa0</link>
      <guid>https://dev.to/xkumakichi/stop-your-ai-agent-from-picking-broken-mcp-servers-4pa0</guid>
      <description>&lt;p&gt;I connected my AI agent to 3 MCP servers.&lt;/p&gt;

&lt;p&gt;It picked one at random.&lt;/p&gt;

&lt;p&gt;It timed out. Then retried a different one. Then finally hit one that worked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;node without-xaip.js
&lt;span class="go"&gt;
→ Trying: unknown-server...
  ✗ error — package not found (8.2s)

→ Trying: sequential-thinking...
  ✓ connected — but wrong tool for docs task

→ Trying: context7...
  ✓ success (3.1s)

Total: 11.3 seconds, 2 wasted calls
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are over 1,000 MCP servers now. Your agent has no way to tell which ones are reliable, which ones are broken, and which ones are the right fit.&lt;/p&gt;

&lt;p&gt;So I built a fix: one API call that picks the right server first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;node with-xaip.js
&lt;span class="go"&gt;
→ XAIP selected: context7 (trust: 1.0, 248 verified executions)
  ✓ success (3.1s)

Total: 3.1 seconds, 0 wasted calls
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;XAIP&lt;/a&gt; — trust scoring for AI agents, backed by real execution data. Not benchmarks. Not self-reported metrics. Actual tool-call results, cryptographically signed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A live API you can try right now
&lt;/h2&gt;

&lt;p&gt;No signup, no API key. Just curl:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Trust score for a specific MCP server&lt;/span&gt;
curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"context7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trust"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"trusted"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"receipts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;248&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"xaip-aggregator (quorum:1)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"riskFlags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"computedFrom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"248 receipts via XAIP Aggregator BFT (1 nodes)"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or let XAIP pick the best server for your task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://xaip-trust-api.kuma-github.workers.dev/v1/select &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "task": "Fetch React documentation",
    "candidates": ["context7", "sequential-thinking", "unknown-server"]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"selected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"context7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Highest trust (1) from 248 verified executions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rejected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unknown-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unscored — no execution data"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"withoutXAIP"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Random selection would pick an unscored server 33% of the time — no execution data, no safety guarantee"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;withoutXAIP&lt;/code&gt; field exists to make the risk visible. It's the answer to "why do I need this?"&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;XAIP has three moving parts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Trust API&lt;/strong&gt; — Returns trust scores for MCP servers. Scores come from real execution data, not self-reported metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Decision Engine&lt;/strong&gt; — &lt;code&gt;POST /v1/select&lt;/code&gt; takes a task and a list of candidate servers, returns the best pick with reasoning. Unscored servers are automatically excluded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Aggregator&lt;/strong&gt; — Collects Ed25519-signed execution receipts. Every tool call produces a cryptographic receipt that feeds back into trust scores.&lt;/p&gt;

&lt;p&gt;The trust model is Bayesian (Beta distribution), weighted by caller diversity to prevent single-caller gaming. If only one caller submits receipts for a server, the score reflects that limited evidence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Select → Execute → Report
  ↑                    │
  └────────────────────┘
     scores improve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The data is real
&lt;/h2&gt;

&lt;p&gt;This isn't a mock API. Trust scores are computed from 1,127 actual MCP tool-call executions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Trust&lt;/th&gt;
&lt;th&gt;Receipts&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;context7&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;248&lt;/td&gt;
&lt;td&gt;trusted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sequential-thinking&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;285&lt;/td&gt;
&lt;td&gt;trusted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;filesystem&lt;/td&gt;
&lt;td&gt;0.909&lt;/td&gt;
&lt;td&gt;594&lt;/td&gt;
&lt;td&gt;caution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Monitored via &lt;a href="https://github.com/xkumakichi/veridict" rel="noopener noreferrer"&gt;Veridict&lt;/a&gt;, a runtime execution monitor that tracks success rates, latency, and failure types.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;filesystem&lt;/code&gt; scores lower because it has real failures in its history — that's the system working correctly. A trust score should reflect reality, not optimism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try the full demo
&lt;/h2&gt;

&lt;p&gt;The dogfooding demo runs the complete loop: select a server, execute MCP tool calls, submit a signed receipt, check the updated score.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/xkumakichi/xaip-protocol.git
&lt;span class="nb"&gt;cd &lt;/span&gt;xaip-protocol/demo
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npx tsx dogfood.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Takes about 15 seconds. You'll see XAIP select &lt;code&gt;context7&lt;/code&gt;, execute real tool calls against it, submit a receipt to the Aggregator, and print the comparison table.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;XAIP is at v0.4.0. The infrastructure is live and the data is real, but adoption is the bottleneck:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More servers&lt;/strong&gt; — Currently scoring 3 MCP servers. The system scales to any server, but needs execution data flowing in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More callers&lt;/strong&gt; — Caller diversity is the main lever for score accuracy. More independent callers = higher confidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform integrations&lt;/strong&gt; — Working toward integration with MCP registries like Smithery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building AI agents that use MCP, you can start using the API today. Scores will keep improving as more execution data flows in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond today
&lt;/h2&gt;

&lt;p&gt;Right now, XAIP helps agents pick working tools.&lt;/p&gt;

&lt;p&gt;But this becomes critical when agents start doing more than calling APIs — paying for services, delegating tasks across organizations, executing autonomous workflows.&lt;/p&gt;

&lt;p&gt;At that point, the question changes from "does this tool work?" to "can I trust this agent with money?"&lt;/p&gt;

&lt;p&gt;XAIP is designed for that future. But it already solves a real problem today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API&lt;/strong&gt;: &lt;code&gt;https://xaip-trust-api.kuma-github.workers.dev&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;xkumakichi/xaip-protocol&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm&lt;/strong&gt;: &lt;a href="https://www.npmjs.com/package/xaip-sdk" rel="noopener noreferrer"&gt;xaip-sdk&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime monitor&lt;/strong&gt;: &lt;a href="https://github.com/xkumakichi/veridict" rel="noopener noreferrer"&gt;xkumakichi/veridict&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;XAIP doesn't make agents smarter. It prevents them from making dumb choices.&lt;/p&gt;

&lt;p&gt;Built this because I needed it. If your agent is still picking servers blind, &lt;a href="https://github.com/xkumakichi/xaip-protocol" rel="noopener noreferrer"&gt;give it a try&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
