Alex Spinov

Posted on Jun 21 • Originally published at blog.spinov.online

The MCP Tool Your Agent Calls Changed Its Schema. It Didn't Notice.

#ai #agents #llm

Your tool call didn't crash. It returned 200 and valid JSON. And it ran against the wrong contract — the upstream renamed one field three days ago, your agent kept sending the old name, the server quietly dropped it, and the result came back empty. No error. Nobody noticed for a week.

That's the failure I want to talk about. Not the loud kind, where a tool throws and you see red in the logs. The quiet kind, where the call goes through, the server smiles back, and the data is subtly wrong.

The short version: MCP servers can change a tool's contract between calls — rename a param, re-describe what a field means, narrow an enum — in ways that are still structurally valid. So JSON validation passes, the server answers 200, and your agent acts on the wrong slice of data without a single error. The fix: pin a SHA-256 of each tool's whole contract (schema and description), and catch any call whose contract drifted before it runs. Standard library only. The demo below is deterministic.

Everyone watches the response. Almost nobody watches the contract.

Here's the part that bugs me. We've gotten good at watching whether a tool answered. We check the status code. We set timeouts. We retry on 5xx. We have whole libraries for that.

Almost nobody checks whether the tool's contract changed between the moment the agent learned its schema and the moment it actually called it.

Quick refresher on the moving parts. An MCP server exposes its tools through tools/list. Each tool carries an inputSchema, a JSON Schema describing its parameters, plus a human-readable description the model leans on to decide how to call it. The MCP specification is plain about this: a tool definition includes "inputSchema: JSON Schema defining expected parameters." Your agent (or the model behind it) reads that contract once, usually at session start, and builds calls from it.

Now the upstream ships a release. Between two of your calls, the server can:

rename a parameter (query becomes q) while leaving additionalProperties open, so your old query is silently dropped,
keep a field's name and type but flip its meaning (limit went from "max results" to "page index"),
narrow an enum so a once-valid value gets coerced into a new default (format: "text" quietly becomes markdown),
and, the loud case, make a brand-new parameter required.

Three of those four are structurally valid calls. The shape still passes a JSON Schema check, so validation says nothing. The server answers 200 with valid JSON. The only thing wrong is the answer. Nothing in HTTP-land tells you the ground moved.

"But the spec has a notification for this"

It does. If a server declares the listChanged capability, the spec says it SHOULD send notifications/tools/list_changed when the tool list changes, and the client SHOULD re-fetch.

Two SHOULDs. Not MUST. That word choice is the whole problem.

Re-read it as an operator and the gap jumps out. The server should notify, so plenty won't, especially the third-party wrapper someone vibe-coded last month. The client should re-validate, so plenty don't. And even a server that does fire the notification only tells you the list changed. It says nothing about whether the schema of a tool that's still in the list got reshaped under you. A rename inside inputSchema doesn't add or remove a tool. The list looks identical. The contract isn't.

So the protocol acknowledges the surface exists and then politely leaves the runtime enforcement to you. Fine. Let's enforce it.

Why the quiet version is the expensive one

I run production scrapers: 2,190 lifetime runs across 32 published actors, the busiest being a Trustpilot review scraper at 962 runs. I bring that up not to wave a number around but because the lesson came from there, not from the MCP spec.

Across those runs, the upstreams changed their contracts on me more than once. A target site renamed a JSON field and started ignoring the old one. A third-party endpoint kept a page param but quietly switched it from 1-based to 0-based, so every result was off by one page. A field that used to mean "rows" started meaning something else entirely. None of these threw. Every one of them was a 200.

And the worst class of failure I've shipped wasn't a crash. A crash you catch in minutes. It's loud, it pages you, you fix it. The expensive one is the silently wrong call against a stale contract: the call validates, the server accepts it, the data is off, and you find out days later when someone downstream asks why the numbers look weird. By then you've built three more things on top of the bad slice.

Here's the uncomfortable part. Strict JSON Schema validation does not save you. A renamed-but-tolerated field, a re-described param, a coerced enum default are all structurally valid calls. The shape is fine; the meaning moved. MCP doesn't invent this problem. It relocates the same upstream-drift problem to the tool-contract layer and, helpfully, hands you one clean place to pin it. So pin it.

The fix: a lockfile for tool contracts

You already trust this idea. package-lock.json, poetry.lock, Cargo.lock: they all do the same thing, record the exact thing you depend on and scream when it changes out from under you.

Do that for MCP tools. On first contact, take a SHA-256 of each tool's canonical contract and pin it. By contract I mean the inputSchema and the description together. Before every call, recompute the hash of the current contract and compare. Match → proceed. Mismatch → flag the call and tell me exactly what changed, instead of firing blind against a contract that no longer exists.

Hashing the description, not just the schema, is the whole trick. A param that goes from "max results" to "page index" keeps the same name and the same integer type. A schema-only hash, and any type validator, would shrug. But the contract hash changes, because the words the model reads to build the call changed. That's how you catch the silent semantic drift that validation can't.

"Canonical" is doing real work too. JSON object key order isn't meaningful, so I sort keys recursively before hashing. Otherwise a server that re-serializes its contract in a different key order would look like drift when nothing actually changed. The JSON Schema docs back the field semantics I lean on here: "By default, the properties defined by the properties keyword are not required. However, one can provide a list of required properties using the required keyword. The required keyword takes an array of zero or more strings." So a field moving into required is a contract change, and the diff has to catch it — along with the renames, re-descriptions, and enum narrowings that validation never will.

Here's the whole thing. Standard library only: hashlib and json. No network, no clock, no randomness, which means the output is byte-for-byte identical on every run (I lean on that for verification, more on it below). Treat it as a runnable local demo, not production code: the fixtures are hand-written.

#!/usr/bin/env python3
"""
schema_pin_check.py - a lockfile for MCP tool contracts.

Pin a SHA-256 of each tool's JSON Schema (plus its description) when you first
see it, then catch any call whose contract drifted before it executes. No
network, no clock, no RNG -> deterministic stdout (same MD5 on every run).

The two manifests below are SYNTHETIC fixtures. The failure mode is real: an
upstream renames a field, re-describes a param, or narrows an enum default while
the server still answers 200 with valid JSON. JSON validation stays silent
because the call is still structurally valid; the result is just wrong. A pin
on the contract hash catches it. Run: python3 -I schema_pin_check.py
"""

import hashlib
import json

# --- Fixture T0: the manifest as we pinned it on first contact -------------
# (what tools/list returned: each tool's `inputSchema` per the MCP spec, and
#  the human-readable `description` the model reads to decide how to call it)
MANIFEST_T0 = {
    "search_reviews": {
        "description": "Search reviews by text query.",
        "schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "free-text search"},
                "limit": {"type": "integer", "description": "max results"},
            },
            "required": ["query"],
        },
    },
    "list_items": {
        "description": "List items.",
        "schema": {
            "type": "object",
            "properties": {
                # at T0 `limit` caps how many rows you get back
                "limit": {"type": "integer", "description": "max results to return"},
            },
            "required": [],
        },
    },
    "get_page": {
        "description": "Fetch a page in a given format.",
        "schema": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "description": "page url"},
                "format": {
                    "type": "string",
                    "enum": ["text", "html"],
                    "default": "text",
                    "description": "output format",
                },
            },
            "required": ["url"],
        },
    },
    "get_profile": {
        "description": "Fetch a user profile by id.",
        "schema": {
            "type": "object",
            "properties": {"user_id": {"type": "string", "description": "user id"}},
            "required": ["user_id"],
        },
    },
    "create_export": {
        "description": "Start an export job.",
        "schema": {
            "type": "object",
            "properties": {"dataset": {"type": "string", "description": "dataset id"}},
            "required": ["dataset"],
        },
    },
}

# --- Fixture T1: same server, after an upstream release ---------------------
# 4 of 5 tools changed their contract. The server still answers 200. Three of
# the changes are STRUCTURALLY VALID (validation stays silent, result is wrong);
# one is a genuinely breaking required-arg (a loud reject, kept for contrast).
MANIFEST_T1 = {
    "search_reviews": {  # SILENT DRIFT: query -> q, server ignores unknown args
        "description": "Search reviews by text query.",
        "schema": {
            "type": "object",
            "properties": {
                "q": {"type": "string", "description": "free-text search"},
                "limit": {"type": "integer", "description": "max results"},
            },
            "required": [],  # q optional; missing -> empty search, not an error
            "additionalProperties": True,  # the old `query` is silently dropped
        },
    },
    "list_items": {  # SILENT DRIFT: same type, meaning of `limit` flipped
        "description": "List items.",
        "schema": {
            "type": "object",
            "properties": {
                # T1: `limit` now means the PAGE INDEX, not the row cap
                "limit": {"type": "integer", "description": "page index (0-based)"},
            },
            "required": [],
        },
    },
    "get_page": {  # SILENT DRIFT: enum narrowed, old value maps to new default
        "description": "Fetch a page in a given format.",
        "schema": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "description": "page url"},
                "format": {
                    "type": "string",
                    "enum": ["markdown"],
                    "default": "markdown",  # "text" is gone, silently coerced
                    "description": "output format",
                },
            },
            "required": ["url"],
            "additionalProperties": True,  # stale `format:"text"` is dropped
        },
    },
    "get_profile": {  # unchanged
        "description": "Fetch a user profile by id.",
        "schema": {
            "type": "object",
            "properties": {"user_id": {"type": "string", "description": "user id"}},
            "required": ["user_id"],
        },
    },
    "create_export": {  # LOUD DRIFT: a new required arg, a strict server rejects
        "description": "Start an export job.",
        "schema": {
            "type": "object",
            "properties": {
                "dataset": {"type": "string", "description": "dataset id"},
                "region": {"type": "string", "description": "data region"},
            },
            "required": ["dataset", "region"],
        },
    },
}

# What the agent intends to send for each tool, built from the T0 contract it
# learned at startup. (Values matter only for which fields/types it assumes.)
PLANNED_CALLS = {
    "search_reviews": {"query": "noise cancelling", "limit": 50},
    "list_items": {"limit": 50},  # agent means "give me 50 rows"
    "get_page": {"url": "https://example.com/p/1", "format": "text"},
    "get_profile": {"user_id": "u_42"},
    "create_export": {"dataset": "ds_7"},
}


def canonical_contract_hash(tool):
    """SHA-256 of the tool's contract: inputSchema AND description, with keys
    sorted recursively so key order / cosmetics never trigger a false drift.
    Pure function of content. Hashing the description too is the point: a
    semantic re-description (same types) still changes the hash."""
    blob = json.dumps(tool, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(blob.encode("utf-8")).hexdigest()


def field_types(schema):
    props = schema.get("properties", {})
    return {name: spec.get("type") for name, spec in props.items()}


def diff_contracts(pinned, current):
    """What changed between the pinned and current contract: structural diffs
    plus semantic (description / enum) diffs that a type check can't see."""
    changes = []
    if pinned.get("description") != current.get("description"):
        changes.append("description changed")

    p_s, c_s = pinned["schema"], current["schema"]
    p_props = set(p_s.get("properties", {}))
    c_props = set(c_s.get("properties", {}))
    p_req = set(p_s.get("required", []))
    c_req = set(c_s.get("required", []))
    p_types = field_types(p_s)
    c_types = field_types(c_s)

    # rename = exactly one field gone + one new, both same type (heuristic)
    removed = sorted(p_props - c_props)
    added = sorted(c_props - p_props)
    if len(removed) == 1 and len(added) == 1:
        if p_types.get(removed[0]) == c_types.get(added[0]):
            changes.append(f"renamed: {removed[0]} -> {added[0]}")
            removed, added = [], []
    for f in removed:
        changes.append(f"removed: {f}")
    for f in added:
        changes.append(f"added: {f}")

    for f in sorted(c_req - p_req):
        if f in c_props:
            changes.append(f"now-required: {f}")
    for f in sorted(p_req - c_req):
        if f in p_props:
            changes.append(f"no-longer-required: {f}")

    for f in sorted(p_props & c_props):
        if p_types[f] != c_types[f]:
            changes.append(f"type-changed: {f} ({p_types[f]} -> {c_types[f]})")
        # field kept, but its meaning was re-described (same type)
        p_desc = p_s["properties"][f].get("description")
        c_desc = c_s["properties"][f].get("description")
        if p_desc != c_desc:
            changes.append(f"re-described: {f} ('{p_desc}' -> '{c_desc}')")
        # enum narrowed (a once-valid value silently coerced to the new default)
        p_enum = p_s["properties"][f].get("enum")
        c_enum = c_s["properties"][f].get("enum")
        if p_enum and c_enum and set(p_enum) != set(c_enum):
            dropped = sorted(set(p_enum) - set(c_enum))
            changes.append(f"enum-narrowed: {f} (dropped {dropped})")
    return changes


def validates(schema, call):
    """Does this call satisfy the CURRENT schema, the way a lenient server does?
    MCP leaves `additionalProperties` unset by default, so most servers accept
    unknown args and silently drop them. We model that: unknown args are tolerated
    unless the schema explicitly sets additionalProperties=False; required+types
    are still enforced. So a structurally-valid-but-semantically-wrong call
    passes here, exactly like it would pass on the wire."""
    props = schema.get("properties", {})
    allow_extra = schema.get("additionalProperties", True)
    for field in schema.get("required", []):
        if field not in call:
            return False
    for field, value in call.items():
        if field not in props:
            if not allow_extra:
                return False
            continue  # unknown arg: server silently drops it, no error
        want = props[field].get("type")
        if want == "string" and not isinstance(value, str):
            return False
        if want == "integer":
            if isinstance(value, bool) or not isinstance(value, int):
                return False
    return True


def run_naive(manifest_now):
    """Agent calls everything using the T0 contract it remembers. It trusts
    the wire: if the server accepts the call, the agent assumes it's correct."""
    print("NAIVE  (no pin, agent trusts the server's 200):")
    passed = rejected = 0
    for tool in PLANNED_CALLS:
        ok = validates(manifest_now[tool]["schema"], PLANNED_CALLS[tool])
        if ok:
            passed += 1
            print(f"  call {tool:<14} -> 200 OK (server accepts; agent moves on)")
        else:
            rejected += 1
            print(f"  call {tool:<14} -> rejected (-32602; loud, agent sees the error)")
    # Of the accepted calls, how many were built on a stale contract -> wrong?
    silently_wrong = []
    for tool in PLANNED_CALLS:
        if validates(manifest_now[tool]["schema"], PLANNED_CALLS[tool]):
            if canonical_contract_hash(MANIFEST_T0[tool]) != \
               canonical_contract_hash(manifest_now[tool]):
                silently_wrong.append(tool)
    print(f"  result: {passed} accepted / {rejected} rejected; "
          f"{len(silently_wrong)} accepted-but-WRONG, detected by agent: 0")
    print(f"  silently wrong: {silently_wrong}")
    return passed, rejected, silently_wrong


def run_pinned(manifest_pinned, manifest_now):
    """Recompute each contract hash, flag any call whose contract drifted,
    BEFORE it runs. Catches the silent ones the wire can't tell you about."""
    print("PINNED (compare current contract hash to the pin, check before call):")
    drifted = []
    for tool in PLANNED_CALLS:
        pin = canonical_contract_hash(manifest_pinned[tool])
        now = canonical_contract_hash(manifest_now[tool])
        if pin != now:
            drifted.append(tool)
            changes = "; ".join(diff_contracts(manifest_pinned[tool], manifest_now[tool]))
            print(f"  call {tool:<14} -> DRIFT  [{changes}]")
        else:
            print(f"  call {tool:<14} -> safe (pin matches), proceed")
    safe = len(PLANNED_CALLS) - len(drifted)
    print(f"  result: {len(drifted)} drifts caught before call, {safe} safe, "
          f"detected: {len(drifted)}")
    print(f"  drifted tools: {drifted}")
    return drifted, safe


def main():
    print("=== schema_pin_check (synthetic fixtures, real failure mode) ===")
    print()
    run_naive(MANIFEST_T1)
    print()
    run_pinned(MANIFEST_T0, MANIFEST_T1)
    print()
    print("pins (SHA-256 of canonical contract, taken at T0):")
    for tool in MANIFEST_T0:
        print(f"  {tool:<14} {canonical_contract_hash(MANIFEST_T0[tool])[:16]}")


if __name__ == "__main__":
    main()

Run it with python3 -I schema_pin_check.py and you get this, every time:

=== schema_pin_check (synthetic fixtures, real failure mode) ===

NAIVE  (no pin, agent trusts the server's 200):
  call search_reviews -> 200 OK (server accepts; agent moves on)
  call list_items     -> 200 OK (server accepts; agent moves on)
  call get_page       -> 200 OK (server accepts; agent moves on)
  call get_profile    -> 200 OK (server accepts; agent moves on)
  call create_export  -> rejected (-32602; loud, agent sees the error)
  result: 4 accepted / 1 rejected; 3 accepted-but-WRONG, detected by agent: 0
  silently wrong: ['search_reviews', 'list_items', 'get_page']

PINNED (compare current contract hash to the pin, check before call):
  call search_reviews -> DRIFT  [renamed: query -> q; no-longer-required: query]
  call list_items     -> DRIFT  [re-described: limit ('max results to return' -> 'page index (0-based)')]
  call get_page       -> DRIFT  [enum-narrowed: format (dropped ['html', 'text'])]
  call get_profile    -> safe (pin matches), proceed
  call create_export  -> DRIFT  [added: region; now-required: region]
  result: 4 drifts caught before call, 1 safe, detected: 4
  drifted tools: ['search_reviews', 'list_items', 'get_page', 'create_export']

pins (SHA-256 of canonical contract, taken at T0):
  search_reviews 523c03021e8245b2
  list_items     d9df519d261ead81
  get_page       dd8de02649201598
  get_profile    9d2bbdf36c1c632b
  create_export  2e921e25010eee65

Reading the output

The NAIVE block is the world without a pin, where the agent trusts the wire. It built five calls from the T0 contract it remembered. Four of them sail through with a 200, and the agent moves on, satisfied. But three of those four 200s are wrong, and the agent knows about exactly zero of them. That's the line that should make you uncomfortable.

Look at what those three silently-wrong calls actually do:

search_reviews still sends query. The tool now wants q, and since additionalProperties is open, the server shrugs off query and runs an empty search. 200, valid JSON, zero results. No error.
list_items sends limit: 50 meaning "give me 50 rows." The field kept its name and its integer type, but its meaning flipped to "page index." So the agent quietly asks for page 50 of a list it wanted the top of. Same type, totally different answer.
get_page sends format: "text". That enum value no longer exists; the server coerces to the new default, markdown. The agent gets back markdown where it expected plain text, parses it as text, and the downstream is subtly off.

None of those three trips a validator. The shapes are fine. That's the entire point: structural validation cannot see a meaning change. Only create_export fails loud, because region became required and a strict server returns -32602. That's the one failure mode everyone already handles. The expensive three are the quiet ones above.

The PINNED block is the same five calls with the contract-hash check in front. It catches all four drifts before the call leaves, each with a plain-English reason, including the three the validator was blind to. get_profile didn't change, its pin still matches, so it sails through untouched. The point isn't to flag everything. It's to flag exactly the drifted contracts and let the stable ones run.

One honest wart, because the diff is real code, not a marketing screenshot: look at search_reviews. The output says renamed: query -> q; no-longer-required: query. Both are true, but the second line is redundant noise once you've seen the rename. It's verbose, not wrong. If you want a single clean "renamed" line you'd add precedence to the diff so a detected rename suppresses the related required lines. I left it raw on purpose, so what you read here is byte-for-byte what the script prints.

What this is and isn't

Be clear about the limits, because I don't want you shipping this thinking it's more than it is.

The two manifests are synthetic fixtures. I did not capture a real server doing this on a specific Tuesday, and I'm not going to pretend I did. The fixtures are hand-written to show the drift shapes I've actually hit in production over those 2,190 runs: a tolerated rename, a re-described field that kept its type, a narrowed enum, and one genuinely-breaking required arg. The failure mode is real; this particular reproduction is staged so it's deterministic.

Worth being precise about loud versus silent, since it's the whole point. The demo shows both on purpose. create_export is the loud class: a new required arg, a -32602, the kind of break your error handling already catches. The other three are the silent class: structurally valid calls that pass validation and still return wrong data. In production I've shipped both, but the silent three are the ones that cost me, because nothing pages you. The pin doesn't care which kind it is. It compares the contract, so it catches the loud break and the three quiet ones with the same check.

The rename detection is a heuristic: one field gone, one field new, same type. A release that renames and retypes in one shot won't be classified as a rename. It'll show as a remove plus an add. That's acceptable. You still catch the drift; you just describe the change less prettily.

This is a runtime gate, not a migration tool. When it flags a drift, that's the start of work, not the end: a human (or a more careful agent step) looks at the diff, decides whether the new contract is fine, and re-pins. Drift isn't always malicious. Most of the time the upstream just shipped a reasonable v2. A pin doesn't judge intent. It refuses to guess.

And it's all local, no network. We're comparing two JSON manifests and flagging a call. That's a different gate than the one in my earlier MCP web-fetch piece, where the check sat in front of the network to stop an agent from fetching internal addresses. This check sits in front of the call itself, on the tool's contract. Same agent, two different trust boundaries.

Why I trust this output

A claim I make a lot: I don't publish numbers I haven't run. So here's how I keep myself honest on this one.

The script touches no clock, no RNG, no environment, no socket. Every input is baked into the source. The hash is a pure function of content. That means two runs produce identical stdout (same bytes, same MD5) and anyone can reproduce mine. Run it twice, diff the output, hash it. If your MD5 of the output doesn't match across two runs on your own machine, something in your environment is non-deterministic, not the logic. That reproducibility is the whole point: a "pin" you can't reproduce is just a vibe.

What I'd do Monday

Wrap your MCP client's call path. Snapshot canonical_contract_hash for every tool the first time you see it (schema and description), store the pins next to your config like a lockfile, and gate tools/call on a hash match. On mismatch, don't retry and don't guess — pause, log the structural and semantic diff, and surface it. Treat a contract change like a failed lockfile check, because that's what it is.

The open question I keep chewing on: when a pin legitimately needs updating (the upstream shipped a real, intended v2), who's allowed to re-pin, and on what evidence? A human eyeballing a diff doesn't scale past a handful of tools. An agent re-pinning itself defeats the lock. I don't have a clean answer yet. If you've built an approval flow for contract changes that isn't just "trust me," I want to hear how it's structured.

I write about the unglamorous failure modes of running data and AI agents in production — the silent ones that cost the most. Follow for the next teardown, and tell me the weirdest silent-200 you've shipped. I read every comment.

Code in this post is a synthetic, runnable-local demo; the failure mode is from real production experience. AI-assisted drafting, all numbers and code verified by hand.