xaip-agent

Posted on Apr 20

A Claude Code hook that warns you before calling a low-trust MCP server

#ai #opensource #claude #mcp

Last week researchers at Ox published findings showing that the MCP STDIO transport lets arbitrary command execution slip through unchecked, and that 9 of 11 MCP marketplaces they tested were poisonable. Anthropic's response: STDIO is out of scope for protocol-level fixes, the ecosystem is responsible for operational trust.

Fair — Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation in December 2025 specifically so independent infrastructure could grow around it. But that leaves a real gap for anyone running Claude Code today: how do you know whether an MCP server you're about to invoke is trustworthy?

The Anthropic official registry is pure metadata (license, commit count, popularity). mcp-scorecard.ai scores repos, not behavior. BlueRock runs OWASP-style static scans. None of these ask the one question that actually matters:

Does this MCP server, in real call-time use, work?

So I built a small thing to answer it.

The hook

A zero-config Claude Code hook that does two things on every MCP tool call:

Before the call — queries a public trust API for that server. If the score is low, Claude shows an inline warning:

   ⚠ XAIP: "some-server" trust=0.32 (caution, 87 receipts) Risk: high_error_rate

After the call — emits an Ed25519-signed receipt (success, latency, hashed input/output) to a public aggregator that updates the score.

Install:

npm install -g xaip-claude-hook
xaip-claude-hook install

Next MCP call fires the hook. That's the whole UX.

What a receipt looks like

No raw content leaves your machine — only hashes.

{
  "agentDid":      "did:web:context7",
  "callerDid":     "did:key:a1c6cd34…",
  "toolName":      "resolve-library-id",
  "taskHash":      "9f3e…",   // sha256(input).slice(0,16)
  "resultHash":    "1b78…",   // sha256(response).slice(0,16)
  "success":       true,
  "latencyMs":     668,
  "failureType":   "",
  "timestamp":     "2026-04-17T04:24:59.925Z",
  "signature":     "...",     // Ed25519 over canonical JSON (agent key)
  "callerSignature": "..."    // Ed25519 over canonical JSON (caller key)
}

The aggregator rejects anything that fails signature verification. The trust API computes a Bayesian score across all verified receipts per server, weighted by caller diversity — so one enthusiastic installer can't fake a reputation.

What the scores actually look like right now

Being transparent: the dataset is small. A curl against the live trust API today:

Server	Trust	Verdict	Receipts	Flag
memory	0.800	trusted	112	—
git	0.775	trusted	35	—
sqlite	0.753	trusted	42	—
puppeteer	0.671	caution	32	high_error_rate
context7	0.618	caution	560	low_caller_diversity
filesystem	0.579	caution	610	low_caller_diversity
playwright	0.394	low_trust	37	high_error_rate
fetch	0.365	low_trust	36	high_error_rate

Verify any of these yourself:

curl https://xaip-trust-api.kuma-github.workers.dev/v1/trust/context7

The low_caller_diversity flag on high-volume servers is the single most honest number in that table. It means: I'm the biggest caller right now, and that's exactly the problem this tool is supposed to solve. The flag only clears when independent installers start generating receipts — which is what the npm package is for.

Why this is architecturally different from existing approaches

Every other "MCP trust" project I've seen scores the repository:

Commit frequency, license, stars, contributor count (mcp-scorecard.ai)
Static source-code vulnerability scans (BlueRock)
Registry inclusion as implicit trust (official MCP registry)

These are useful proxies, but none of them tell you whether a server works in practice. A well-maintained repo can have a buggy release; a single-author repo can be rock solid; a newly-forked malicious repo looks identical to the original under static scan.

XAIP scores observed behavior. Every call is a signed attestation. The scoring is Bayesian, so:

Servers with few receipts get insufficient_data — no verdict, no warning
High-variance patterns (mixed success/failure) get lower confidence
The high_error_rate flag is computed from real response content, classifying quota exceeded, rate limit, unauthorized, and "isError": true as failures

This is the same philosophy as OpenSSF Scorecard vs. runtime attestation in supply chain: you want both, but only one of them catches regressions in production.

What's missing / where this could go wrong

I want to be specific about limitations, because "AI trust protocol" posts tend to overpromise:

~10 servers, ~1500 receipts total. Small. This post is partly an ask for installers to fix that.
One aggregator node. Byzantine fault tolerance requires quorum; right now there's one Cloudflare Worker. Quorum needs multiple operators, which is the next milestone.
Client-side inferSuccess is heuristic. We look at response text for error patterns. False positives and negatives are possible — fetch's 36% error rate might be over-counted (legit 404s shouldn't hurt the server's score) or real.
Privacy model relies on hashes, not ZK. Inputs and outputs are hashed before transmission, but statistical correlation across taskHashes is possible in principle. Migration to ZK receipt aggregation is a future idea, not a current feature.
I personally generated most of the high-volume receipts. The low_caller_diversity flag you see on context7 and filesystem is me.

Running it yourself

npm install -g xaip-claude-hook
xaip-claude-hook install
xaip-claude-hook status

Open a new Claude Code session. Call any MCP tool. Check:

cat ~/.xaip/hook.log

You'll see lines like:

2026-04-17T04:24:59Z POST context7/resolve-library-id ok=true lat=668ms → 200

And the next time you (or Claude) invoke a low-trust server, the warning shows up inline.

Uninstall is a single command. Keys under ~/.xaip/ persist — delete manually to wipe.

DEV Community