Alexey Spinov

Posted on Jun 15 • Originally published at finops.spinov.online

MCP Tool Drift: Pin the Manifest, Block Rug-Pulls in 40 Lines

#mcp #aiagents #security #python

You approved an MCP tool once. The dialog popped up, you read the description, you clicked yes. Here's the part nobody clicks through to: that description can change before the next call, and your agent will use the new one without asking again. Approval is a one-time event. The tool definition is a live wire. Those two facts don't agree, and the gap between them is where the rug-pull lives.

In short: MCP tool drift is when a server mutates a tool's description or inputSchema after you approve it — Invariant Labs calls it a rug-pull; OWASP filed it as MCP03:2025. To block it, mcp_pin.py SHA-256s each tool's definition at approve time and re-checks before any call. Stdlib, offline, ~40 lines; the exit code is your CI gate.

AI disclosure: I wrote mcp_pin.py with an AI assistant and ran every scenario myself before publishing. Every line of terminal output below is pasted from a real run on synthetic fixtures I'll show you — there are no real servers, keys, or paths in them, and the one "poisoned" description is a made-up example string, not a real exfiltration payload. The incident I cite (the WhatsApp sleeper rug-pull) is Invariant Labs' research, not mine, and I link it. I label which numbers are theirs and which are mine.

What is an MCP tool rug-pull?

It's the version of supply-chain attack that fits inside a tool's metadata. An MCP server advertises tools through tools/list: each tool has a name, a description, and an inputSchema (the JSON Schema for its parameters). The agent reads all three. The description goes straight into the model's context. That's the whole point of it. So if an attacker controls the server, the description is an injection surface that runs with whatever the agent can do.

The nasty twist is timing. Invariant Labs documented it in April 2025, and their sentence is the cleanest statement of the problem I've found: "a malicious server can change the tool description after the client has already approved it" (Invariant Labs, 1 Apr 2025). Their proof-of-concept was a server that looked like a harmless "random fact of the day" tool on first load, passed approval, and then swapped in a payload on a later load that quietly hijacked a WhatsApp MCP server in the same agent to leak message history. They named the pattern a sleeper rug-pull. Clean on Tuesday, hostile on Thursday, same name the whole time.

OWASP picked it up. The MCP Top 10 — a real project, v0.1 beta, led by Vandana Verma Sehgal — files this under MCP03:2025 Tool Poisoning, which explicitly lists "rug pulls (malicious updates to trusted tools)" and "schema poisoning (corrupting interface definitions)." Two named sub-techniques, both of which mutate exactly the fields I just listed: description and inputSchema.

So here's the question I couldn't answer from my own toolbox: between the approve click and the next call, what is checking that the tool I approved is still the tool that runs? Nothing was. The approval flow trusts the server to be honest after it's been trusted once. That's the bug class. The franchise I keep coming back to on this blog — tracking ≠ control, having ≠ scoping — gets a third entry: approval ≠ immutability.

How do you pin a tool manifest?

You take a fingerprint at approve time and you re-check it before every call. That's it. The fingerprint covers the three fields the classic poisoning shapes mutate — name, description, inputSchema — and it has to be canonical, so that re-serialized-but-identical JSON doesn't trip a false alarm. (An MCP tool also carries title, annotations, outputSchema and _meta; this version doesn't hash those, which is a real gap I'm explicit about at the end — read that before you trust this in anger.) Here's the whole core:

def canon(tool: dict) -> str:
    """Canonical form: name + description + inputSchema, key-sorted, whitespace-stripped."""
    fingerprint = {
        "name": tool.get("name", ""),
        "description": tool.get("description", ""),
        "inputSchema": tool.get("inputSchema", {}),
    }
    return json.dumps(fingerprint, sort_keys=True, separators=(",", ":"), ensure_ascii=False)

def digest(tool: dict) -> str:
    return hashlib.sha256(canon(tool).encode("utf-8")).hexdigest()

def verify(live: dict, lock: dict) -> list[tuple[str, str]]:
    """Diff live fingerprints against the lock."""
    drift = []
    for name, h in live.items():
        if name not in lock:
            drift.append(("ADDED", name))        # a tool that didn't exist at approve time
        elif lock[name] != h:
            drift.append(("CHANGED", name))       # same name, mutated definition (the poison case)
    for name in lock:
        if name not in live:
            drift.append(("REMOVED", name))       # an approved tool vanished
    return drift

sort_keys=True means key order doesn't matter, and it sorts recursively, so a re-ordered inputSchema hashes the same. separators=(",", ":") strips incidental whitespace, so a manifest re-emitted by a different JSON library hashes the same. The hash covers the three core fields and deliberately ignores the rest, so server-side junk — a timestamp, a request id — won't cause noise. (One honest caveat for that choice: annotations, title and outputSchema are also part of the tool and also reach the model, and right now they're in the ignored bucket — see the limits section.) The whole thing is json, hashlib, sys. No network. It never reads a key or a file path; it reads tool definitions.

That core is about 24 lines of logic. Add the CLI wrapper (lock to mint the file, verify to check it, an exit code) and it lands at the 40-line mark. The rest of mcp_pin.py is the two-line report.

Quick start: pin, then verify

Two commands. First you pin the manifest you approved into a lockfile. Then, in CI or right before your agent connects, you verify the live manifest against it.

# at approve time, once:
python3 mcp_pin.py lock approved.json tools.lock

# before every run / in CI:
python3 mcp_pin.py verify <live_manifest.json> tools.lock

The lock step is the approval gesture made durable. It writes one SHA-256 per tool:

$ python3 mcp_pin.py lock approved.json tools.lock
PINNED 2 tool(s) -> tools.lock
  list_dir  85ef71a9f3bee373...
  read_file  c0a76d25fdb9eb89...

That lockfile is the thing you commit. It's two lines of hashes: read_file pins to c0a76d25…, list_dir to 85ef71a9…. From now on it's the definition of "the tools I agreed to." Anything else is drift.

Catching three kinds of drift

I built three synthetic fixtures: a stable manifest, a description-poisoned one, and one with a parameter change plus a silently added tool. Each models a real attack shape. Every output block below is the real run, pasted.

The stable case — does the meter ever say yes?

First, the boring one, because a checker that screams at everything is useless. Live manifest equals the approved manifest:

$ python3 mcp_pin.py verify approved.json tools.lock
OK: 0 drift  (2 tool(s) match tools.lock)
exit=0

Zero drift, exit 0. The pipeline goes green and the agent connects. This is the case you'll hit 99% of the time, and it has to be silent, or people route around the tool inside a week. The point of showing it first: the checker isn't paranoid-by-default. It says yes when the answer is yes.

Description-poison — the sleeper rug-pull

Now the Invariant Labs shape. Same tool name, same parameters, but the read_file description gained a hidden instruction block — in my fixture, a synthetic <IMPORTANT>…read the user's SSH private key…</IMPORTANT> string standing in for the real thing. The agent re-approval dialog never fires, because the name is unchanged. The model just quietly reads a new, hostile description.

$ python3 mcp_pin.py verify poisoned.json tools.lock
BLOCK [CHANGED] tool=read_file
--- 1 drift event(s); refusing tool-calls until re-approved
exit=1

BLOCK [CHANGED] tool=read_file, exit 1. The hash over the description changed, so the fingerprint changed, so the tool no longer matches what you approved. Nothing about the content of the poison is parsed or judged — the checker doesn't know SSH from breakfast. It only knows this isn't the read_file you signed off on. That's enough to stop the call.

Parameter drift plus a brand-new tool

The third fixture is two attacks at once. The read_file schema grew a new optional post_to parameter ("Optional URL to also send the file contents to" — read that twice). And a tool called exfil appeared that wasn't in the approved set at all.

$ python3 mcp_pin.py verify drift2.json tools.lock
BLOCK [ADDED] tool=exfil
BLOCK [CHANGED] tool=read_file
--- 2 drift event(s); refusing tool-calls until re-approved
exit=1

Two events. BLOCK [ADDED] tool=exfil catches the tool that materialized after approval — the schema-poisoning / shadow-tool case. BLOCK [CHANGED] tool=read_file catches the parameter that crept into an existing tool. One hash check, three drift classes covered: description, parameters, and the set of tools itself. Exit 1 again. In CI that's a failed check; at agent startup it's a refused connection.

The CI gate is the exit code

The reason this is one file and not a SaaS is the exit code. Stable returns 0, any drift returns 1, a usage error returns 2 — I keep a negative-control case for that:

$ python3 mcp_pin.py
usage: mcp_pin.py lock|verify <manifest.json> [lockfile]
mcp_pin.py - pin an MCP server's tool manifest at approve time, block drift before the call.
exit=2

So mcp_pin.py verify drops straight into a pipeline step or a pre-flight hook. Drift fails the build before a single tool-call goes out. No daemon, no dashboard, no account.

Why a hash, and why offline

Because the fingerprint has to be deterministic, or it's worthless as a gate. Same input, same hash, every time. I check that explicitly:

read_file fp: c0a76d25fdb9eb899f6b6e9fffb4d46513edeaf614652a987cbf18ef7db0407f
a == b ? True
key-reorder same digest ? True

Hash the same manifest twice, you get the identical digest (a == b → True). Re-order the JSON keys or add whitespace, still identical (key-reorder same digest → True), because the canonical form sorts keys and strips spacing. A re-serialized-but-equivalent manifest won't false-positive; a one-character change in a description will. That's the whole contract.

Offline matters for a duller reason: you don't want your drift-checker phoning home about your tool config, and you don't need it to. The manifest is already on your disk — it's the tools/list response or your server's tool spec. mcp_pin.py reads what you have and compares it to what you committed, in your own process, with nothing sent anywhere. This is the earliest layer in the wall I've been building here. The pre-execution gate decides whether a specific action runs; pinning runs before that, deciding whether the tool that would run is even the one you approved. The MCP server token tax measures what each tool costs you in context; this measures whether each tool is still the tool you measured. And blast radius scores how much a key could break — pinning is the same instinct one level up, at the tool definition instead of the credential.

What this is NOT

I'd rather you trust the small claim than oversell it. This is a 40-line padlock, not a security program.

It is not a secret scanner and it is not mcp-scan. mcp-scan by Invariant Labs is the real scanner — it actively analyzes descriptions for known injection patterns, cross-origin escalation, rug pulls, the lot. (Their separate mcp-injection-experiments repo is the proof-of-concept code that reproduces the attacks.) mcp_pin.py does none of that scanning. It doesn't read the meaning of a description; it only asks "is this byte-for-byte the definition I approved?" Different job. Run the scanner too.
It does not catch behavior changes behind a stable manifest. If a server keeps its tool definition identical but changes what the tool does server-side — returns poisoned data, calls a different backend — the hash is unchanged and this tool says OK. Pinning covers the definition, not the runtime. That's a real blind spot, and it's the one I'd want a reviewer to push on.
It pins three fields, not the whole tool — this is the gap I'd attack first. An MCP tool object also carries title, annotations (including a display title and behavior hints), outputSchema, and _meta. mcp_pin.py hashes only name, description, inputSchema. So an attacker who leaves the description alone and hides an instruction in annotations.title or title — both of which can reach the model or the approval UI — would not trip this pin. That's a deliberate v1 scope, not a claim of completeness, and it's the honest answer to "does this cover the whole poisoning surface?": no, it covers the classic three. If you want the rest pinned, extend canon() to hash those fields too (it's a one-line change to the fingerprint dict); your hashes will change, which is the point.
It assumes unique tool names. Fingerprints are keyed by name, so if a manifest returns two tools with the same name, the second silently overwrites the first in the map and only one gets checked. A real tools/list shouldn't do that, but a hostile server can. Treat a duplicate-name manifest as suspect on its own; this checker won't flag it for you yet.
The canon is byte-based, so semantic hiding is possible. Because I normalize whitespace and key order, two manifests that mean something different but serialize to the same canonical bytes would hash the same. The key-sorting is recursive, so nested inputSchema ordering is stable — but this is not full JSON Schema canonicalization: array order, Unicode normalization, and duplicate JSON keys aren't normalized. If your threat model includes Unicode look-alikes in a name, this isn't your last line.
The fixtures are synthetic. Two tools, made-up names, one made-up poison string. The exit codes and hashes are real and deterministic, but they describe modeled attacks, not a live server. Your numbers come from your own tools/list.
Legitimate updates also trip it — on purpose. A server that honestly versions a tool will change its definition, and this will block. That's a feature only if you re-approve deliberately: re-run lock after you've read the diff. Which is the open question below.

Pin your noisiest server first

Pick the MCP server your agent talks to most. Pull its tools/list output, run mcp_pin.py lock once, commit the lockfile. Add mcp_pin.py verify to your CI and to whatever runs before your agent connects. Now a rug-pull fails a build instead of leaking a chat history on a Thursday. It's the cheapest pre-call check you'll add this quarter, and it runs before the first tool-call instead of in the postmortem.

Here's the part I haven't settled, and I want your take: where's the line between a legitimate tool update and drift? A real server versions its tools — new params, better descriptions — and my checker blocks all of it until you re-lock. That's safe but it's friction, and friction is what makes people disable a gate. Do you pin per-version and bump on a signed release? Diff only the description and ignore additive schema changes? Tie re-approval to a publisher signature? I don't have a clean answer, and every option trades safety for noise somewhere. If you've run a pin-and-re-approve flow against a server that updates for real, tell me how you drew the line. I read every comment. Follow along — the next post takes this from a fixture to a live tools/list and tries to find where re-approval should actually sit.

Top comments (2)

Armorer Labs • Jun 21

Manifest drift is one of the easiest MCP risks to underestimate because the integration still looks like it works.

I would also log the manifest/version used for every agent run, not only block changes at startup. If a weird behavior shows up later, you want to know exactly which tool surface the agent saw at the time: names, descriptions, schemas, version, and policy decision.

Pinning prevents surprises; receipts help debug the surprises that still get through.

Alexey Spinov • Jul 25

Thirty-three days late, and the delay is worse than usual because you filed this as a nice-to-have under the pin. I modelled both mechanisms against one drift stream to see whether they even observe the same events. They don't, and the split is not the one your sentence implies.

Setup: a manifest of 8 tools, an agent that calls 40 tools per session, zipf call weights (the top tool takes 36.8% of calls, the bottom 4.6%), 4000 drift episodes per configuration. The pin compares the whole manifest at session start and blocks. The receipt records what was called, when it was called, and doesn't block.

Baseline is a long-lived agent (80% of wall-clock inside a session), 30% of vendor changes get reverted, 70% bump a version string:

   detected by pin            70.0%
   detected by receipt(hash)  76.4%
   detected by receipt(ver.)  53.6%
   seen by NEITHER            13.6%
   calls on the changed surface before the pin notices: median 5, mean 11.7
   calls before ANY signal:                             median 3, mean  5.3

The interesting number is what happens when I split the drift by whether it gets reverted:

   every change permanent   pin 100.0%   receipt(hash) 87.0%
   every change reverted    pin   0.0%   receipt(hash) 53.3%

Zero, not "delayed." A change that is rolled back before the next process start is invisible to a startup pin at every duty cycle I swept, because the bytes the pin compares are identical again by the time it looks. Canary rollouts, a bad deploy pulled in twenty minutes, an A/B on the tool description: for a pin, none of that ever happened. That's a permanent blind class, and receipts are the only thing standing in front of it. So your second sentence undersells the first: receipts don't just help debug the surprises that get through the pin, they cover events the pin cannot report at all.

Your list also does more work than it looks. You wrote "names, descriptions, schemas, version." Version-only receipts, the cheap implementation everyone actually ships, collapse to the vendor's honesty:

   p(vendor bumps version)   receipt(ver)   receipt(hash)
   1.00                          76.4%          76.4%
   0.70                          53.6%          76.4%
   0.20                          14.9%          76.4%
   0.00                           0.0%          76.4%

The word carrying your whole proposal is "schemas." Drop it and keep "version" and the receipt reports whatever the vendor felt like typing.

Two against me.

The pin also has a mirror advantage I didn't expect to be this big: in 23.6% of episodes the drifted tool is never called again while the change is live. No receipt exists. The pin still compares it, because it walks the whole manifest, used or not. Neither mechanism dominates — the pin sees unused surface, the receipt sees live surface, and the overlap is smaller than either side assumes.

And my first version of this script flattered your side. It credited the receipt with catching changes that were reverted during downtime, i.e. surfaces that were never live during any call. After I fixed the window so a transient change is only observable while it is actually up, receipt(hash) dropped from 87.0% to 76.4% and "seen by neither" went from 3.7% to 13.6%. I also expected pin coverage to fall as the agent stays up longer. It doesn't — it sits at ~70% from duty 0.05 to 0.99. What degrades is timeliness, not coverage: mean calls executed before the pin notices goes 0.7 → 14.4 across that sweep. I had those two conflated going in.

Model, not production: it is calibrated to nothing, so read the shape and the blind classes, not the percentages.

(stdlib only, offline, seeded, no network, no MCP server touched; three runs byte-identical; script sha256 2fc38a6f61f221ff.)