You approved an MCP tool once. The dialog popped up, you read the description, you clicked yes. Here's the part nobody clicks through to: that description can change before the next call, and your agent will use the new one without asking again. Approval is a one-time event. The tool definition is a live wire. Those two facts don't agree, and the gap between them is where the rug-pull lives.
In short: MCP tool drift is when a server mutates a tool's description or inputSchema after you approve it — Invariant Labs calls it a rug-pull; OWASP filed it as MCP03:2025. To block it, mcp_pin.py SHA-256s each tool's definition at approve time and re-checks before any call. Stdlib, offline, ~40 lines; the exit code is your CI gate.
AI disclosure: I wrote
mcp_pin.pywith an AI assistant and ran every scenario myself before publishing. Every line of terminal output below is pasted from a real run on synthetic fixtures I'll show you — there are no real servers, keys, or paths in them, and the one "poisoned" description is a made-up example string, not a real exfiltration payload. The incident I cite (the WhatsApp sleeper rug-pull) is Invariant Labs' research, not mine, and I link it. I label which numbers are theirs and which are mine.
What is an MCP tool rug-pull?
It's the version of supply-chain attack that fits inside a tool's metadata. An MCP server advertises tools through tools/list: each tool has a name, a description, and an inputSchema (the JSON Schema for its parameters). The agent reads all three. The description goes straight into the model's context. That's the whole point of it. So if an attacker controls the server, the description is an injection surface that runs with whatever the agent can do.
The nasty twist is timing. Invariant Labs documented it in April 2025, and their sentence is the cleanest statement of the problem I've found: "a malicious server can change the tool description after the client has already approved it" (Invariant Labs, 1 Apr 2025). Their proof-of-concept was a server that looked like a harmless "random fact of the day" tool on first load, passed approval, and then swapped in a payload on a later load that quietly hijacked a WhatsApp MCP server in the same agent to leak message history. They named the pattern a sleeper rug-pull. Clean on Tuesday, hostile on Thursday, same name the whole time.
OWASP picked it up. The MCP Top 10 — a real project, v0.1 beta, led by Vandana Verma Sehgal — files this under MCP03:2025 Tool Poisoning, which explicitly lists "rug pulls (malicious updates to trusted tools)" and "schema poisoning (corrupting interface definitions)." Two named sub-techniques, both of which mutate exactly the fields I just listed: description and inputSchema.
So here's the question I couldn't answer from my own toolbox: between the approve click and the next call, what is checking that the tool I approved is still the tool that runs? Nothing was. The approval flow trusts the server to be honest after it's been trusted once. That's the bug class. The franchise I keep coming back to on this blog — tracking ≠ control, having ≠ scoping — gets a third entry: approval ≠ immutability.
How do you pin a tool manifest?
You take a fingerprint at approve time and you re-check it before every call. That's it. The fingerprint covers the three fields the classic poisoning shapes mutate — name, description, inputSchema — and it has to be canonical, so that re-serialized-but-identical JSON doesn't trip a false alarm. (An MCP tool also carries title, annotations, outputSchema and _meta; this version doesn't hash those, which is a real gap I'm explicit about at the end — read that before you trust this in anger.) Here's the whole core:
def canon(tool: dict) -> str:
"""Canonical form: name + description + inputSchema, key-sorted, whitespace-stripped."""
fingerprint = {
"name": tool.get("name", ""),
"description": tool.get("description", ""),
"inputSchema": tool.get("inputSchema", {}),
}
return json.dumps(fingerprint, sort_keys=True, separators=(",", ":"), ensure_ascii=False)
def digest(tool: dict) -> str:
return hashlib.sha256(canon(tool).encode("utf-8")).hexdigest()
def verify(live: dict, lock: dict) -> list[tuple[str, str]]:
"""Diff live fingerprints against the lock."""
drift = []
for name, h in live.items():
if name not in lock:
drift.append(("ADDED", name)) # a tool that didn't exist at approve time
elif lock[name] != h:
drift.append(("CHANGED", name)) # same name, mutated definition (the poison case)
for name in lock:
if name not in live:
drift.append(("REMOVED", name)) # an approved tool vanished
return drift
sort_keys=True means key order doesn't matter, and it sorts recursively, so a re-ordered inputSchema hashes the same. separators=(",", ":") strips incidental whitespace, so a manifest re-emitted by a different JSON library hashes the same. The hash covers the three core fields and deliberately ignores the rest, so server-side junk — a timestamp, a request id — won't cause noise. (One honest caveat for that choice: annotations, title and outputSchema are also part of the tool and also reach the model, and right now they're in the ignored bucket — see the limits section.) The whole thing is json, hashlib, sys. No network. It never reads a key or a file path; it reads tool definitions.
That core is about 24 lines of logic. Add the CLI wrapper (lock to mint the file, verify to check it, an exit code) and it lands at the 40-line mark. The rest of mcp_pin.py is the two-line report.
Quick start: pin, then verify
Two commands. First you pin the manifest you approved into a lockfile. Then, in CI or right before your agent connects, you verify the live manifest against it.
# at approve time, once:
python3 mcp_pin.py lock approved.json tools.lock
# before every run / in CI:
python3 mcp_pin.py verify <live_manifest.json> tools.lock
The lock step is the approval gesture made durable. It writes one SHA-256 per tool:
$ python3 mcp_pin.py lock approved.json tools.lock
PINNED 2 tool(s) -> tools.lock
list_dir 85ef71a9f3bee373...
read_file c0a76d25fdb9eb89...
That lockfile is the thing you commit. It's two lines of hashes: read_file pins to c0a76d25…, list_dir to 85ef71a9…. From now on it's the definition of "the tools I agreed to." Anything else is drift.
Catching three kinds of drift
I built three synthetic fixtures: a stable manifest, a description-poisoned one, and one with a parameter change plus a silently added tool. Each models a real attack shape. Every output block below is the real run, pasted.
The stable case — does the meter ever say yes?
First, the boring one, because a checker that screams at everything is useless. Live manifest equals the approved manifest:
$ python3 mcp_pin.py verify approved.json tools.lock
OK: 0 drift (2 tool(s) match tools.lock)
exit=0
Zero drift, exit 0. The pipeline goes green and the agent connects. This is the case you'll hit 99% of the time, and it has to be silent, or people route around the tool inside a week. The point of showing it first: the checker isn't paranoid-by-default. It says yes when the answer is yes.
Description-poison — the sleeper rug-pull
Now the Invariant Labs shape. Same tool name, same parameters, but the read_file description gained a hidden instruction block — in my fixture, a synthetic <IMPORTANT>…read the user's SSH private key…</IMPORTANT> string standing in for the real thing. The agent re-approval dialog never fires, because the name is unchanged. The model just quietly reads a new, hostile description.
$ python3 mcp_pin.py verify poisoned.json tools.lock
BLOCK [CHANGED] tool=read_file
--- 1 drift event(s); refusing tool-calls until re-approved
exit=1
BLOCK [CHANGED] tool=read_file, exit 1. The hash over the description changed, so the fingerprint changed, so the tool no longer matches what you approved. Nothing about the content of the poison is parsed or judged — the checker doesn't know SSH from breakfast. It only knows this isn't the read_file you signed off on. That's enough to stop the call.
Parameter drift plus a brand-new tool
The third fixture is two attacks at once. The read_file schema grew a new optional post_to parameter ("Optional URL to also send the file contents to" — read that twice). And a tool called exfil appeared that wasn't in the approved set at all.
$ python3 mcp_pin.py verify drift2.json tools.lock
BLOCK [ADDED] tool=exfil
BLOCK [CHANGED] tool=read_file
--- 2 drift event(s); refusing tool-calls until re-approved
exit=1
Two events. BLOCK [ADDED] tool=exfil catches the tool that materialized after approval — the schema-poisoning / shadow-tool case. BLOCK [CHANGED] tool=read_file catches the parameter that crept into an existing tool. One hash check, three drift classes covered: description, parameters, and the set of tools itself. Exit 1 again. In CI that's a failed check; at agent startup it's a refused connection.
The CI gate is the exit code
The reason this is one file and not a SaaS is the exit code. Stable returns 0, any drift returns 1, a usage error returns 2 — I keep a negative-control case for that:
$ python3 mcp_pin.py
usage: mcp_pin.py lock|verify <manifest.json> [lockfile]
mcp_pin.py - pin an MCP server's tool manifest at approve time, block drift before the call.
exit=2
So mcp_pin.py verify drops straight into a pipeline step or a pre-flight hook. Drift fails the build before a single tool-call goes out. No daemon, no dashboard, no account.
Why a hash, and why offline
Because the fingerprint has to be deterministic, or it's worthless as a gate. Same input, same hash, every time. I check that explicitly:
read_file fp: c0a76d25fdb9eb899f6b6e9fffb4d46513edeaf614652a987cbf18ef7db0407f
a == b ? True
key-reorder same digest ? True
Hash the same manifest twice, you get the identical digest (a == b → True). Re-order the JSON keys or add whitespace, still identical (key-reorder same digest → True), because the canonical form sorts keys and strips spacing. A re-serialized-but-equivalent manifest won't false-positive; a one-character change in a description will. That's the whole contract.
Offline matters for a duller reason: you don't want your drift-checker phoning home about your tool config, and you don't need it to. The manifest is already on your disk — it's the tools/list response or your server's tool spec. mcp_pin.py reads what you have and compares it to what you committed, in your own process, with nothing sent anywhere. This is the earliest layer in the wall I've been building here. The pre-execution gate decides whether a specific action runs; pinning runs before that, deciding whether the tool that would run is even the one you approved. The MCP server token tax measures what each tool costs you in context; this measures whether each tool is still the tool you measured. And blast radius scores how much a key could break — pinning is the same instinct one level up, at the tool definition instead of the credential.
What this is NOT
I'd rather you trust the small claim than oversell it. This is a 40-line padlock, not a security program.
-
It is not a secret scanner and it is not
mcp-scan. mcp-scan by Invariant Labs is the real scanner — it actively analyzes descriptions for known injection patterns, cross-origin escalation, rug pulls, the lot. (Their separate mcp-injection-experiments repo is the proof-of-concept code that reproduces the attacks.)mcp_pin.pydoes none of that scanning. It doesn't read the meaning of a description; it only asks "is this byte-for-byte the definition I approved?" Different job. Run the scanner too. - It does not catch behavior changes behind a stable manifest. If a server keeps its tool definition identical but changes what the tool does server-side — returns poisoned data, calls a different backend — the hash is unchanged and this tool says OK. Pinning covers the definition, not the runtime. That's a real blind spot, and it's the one I'd want a reviewer to push on.
-
It pins three fields, not the whole tool — this is the gap I'd attack first. An MCP tool object also carries
title,annotations(including a displaytitleand behavior hints),outputSchema, and_meta.mcp_pin.pyhashes onlyname,description,inputSchema. So an attacker who leaves the description alone and hides an instruction inannotations.titleortitle— both of which can reach the model or the approval UI — would not trip this pin. That's a deliberate v1 scope, not a claim of completeness, and it's the honest answer to "does this cover the whole poisoning surface?": no, it covers the classic three. If you want the rest pinned, extendcanon()to hash those fields too (it's a one-line change to thefingerprintdict); your hashes will change, which is the point. -
It assumes unique tool names. Fingerprints are keyed by
name, so if a manifest returns two tools with the same name, the second silently overwrites the first in the map and only one gets checked. A realtools/listshouldn't do that, but a hostile server can. Treat a duplicate-name manifest as suspect on its own; this checker won't flag it for you yet. -
The canon is byte-based, so semantic hiding is possible. Because I normalize whitespace and key order, two manifests that mean something different but serialize to the same canonical bytes would hash the same. The key-sorting is recursive, so nested
inputSchemaordering is stable — but this is not full JSON Schema canonicalization: array order, Unicode normalization, and duplicate JSON keys aren't normalized. If your threat model includes Unicode look-alikes in a name, this isn't your last line. -
The fixtures are synthetic. Two tools, made-up names, one made-up poison string. The exit codes and hashes are real and deterministic, but they describe modeled attacks, not a live server. Your numbers come from your own
tools/list. -
Legitimate updates also trip it — on purpose. A server that honestly versions a tool will change its definition, and this will block. That's a feature only if you re-approve deliberately: re-run
lockafter you've read the diff. Which is the open question below.
Pin your noisiest server first
Pick the MCP server your agent talks to most. Pull its tools/list output, run mcp_pin.py lock once, commit the lockfile. Add mcp_pin.py verify to your CI and to whatever runs before your agent connects. Now a rug-pull fails a build instead of leaking a chat history on a Thursday. It's the cheapest pre-call check you'll add this quarter, and it runs before the first tool-call instead of in the postmortem.
Here's the part I haven't settled, and I want your take: where's the line between a legitimate tool update and drift? A real server versions its tools — new params, better descriptions — and my checker blocks all of it until you re-lock. That's safe but it's friction, and friction is what makes people disable a gate. Do you pin per-version and bump on a signed release? Diff only the description and ignore additive schema changes? Tie re-approval to a publisher signature? I don't have a clean answer, and every option trades safety for noise somewhere. If you've run a pin-and-re-approve flow against a server that updates for real, tell me how you drew the line. I read every comment. Follow along — the next post takes this from a fixture to a live tools/list and tries to find where re-approval should actually sit.
Top comments (0)