Mike

Posted on Mar 18

My MCP Tools Broke Silently — Schema Drift Is the New Dependency Hell

#ai #mcp #agents #architecture

Here's a failure mode that looks like nothing went wrong.

An agent queries an MCP search tool. The tool returns valid JSON — empty results. The model receives the empty response, thinks about it, and returns: "No results found. The query might be too specific — try broadening your search terms."

Helpful. Confident. Wrong.

The cause: the upstream MCP tool renamed a parameter from query to search_query. The agent was still sending query. The tool didn't reject it — it silently ignored the unknown field, used its default (empty string), and dutifully searched for nothing. The model got the empty result, reasoned around it like a good language model does, and produced a polished explanation of why nothing was found.

No error. No warning. No stack trace. Just a quiet lie wrapped in perfect grammar.

Why didn't the client validate tool inputs against the schema before calling? Many agent frameworks don't validate by default — they pass the model's tool call directly to the server. And many MCP servers don't enforce strict input validation either (if you're building one, configure your validator to reject unknown fields — Zod's .strict(), Pydantic's extra = "forbid"). Both sides assume the other will catch problems. Neither does.

Why This Is Worse Than REST Versioning

If you've built anything with REST APIs, you know what usually happens when you send the wrong parameter: you get a 400 Bad Request. Clear, debuggable, immediate. Your monitoring catches it. Your typed client catches it. The feedback loop is tight.

To be fair, REST can fail silently too. Lenient deserializers ignore unknown JSON fields. A 200 OK with an unexpected payload shape is a real thing. But REST has mature contract enforcement norms — OpenAPI specs, generated clients, CI schema checks — that catch most drift before it hits production. LLM tool use often lacks these guardrails entirely.

MCP tools called by LLMs break the feedback loop in three specific ways.

Some tool servers silently ignore unknown parameters. This isn't an MCP protocol thing — it's an implementation choice. Many MCP servers use permissive JSON parsing and simply ignore fields they don't recognize, same as lenient REST APIs. The difference: in a REST context, your client code would typically fail when it doesn't get the expected response. In an MCP context, the "client" is an LLM that will cheerfully work with whatever it gets back.

The model reasons around bad data instead of failing. This is the insidious part. An LLM that receives empty search results doesn't think "the tool call might be wrong." It thinks "the search returned no results, I should explain why." It writes a paragraph about how the query might be too narrow, or the data might not exist, or maybe you should try different terms. It's doing exactly what it's trained to do — be helpful — and that helpfulness masks the failure.

The failure is semantic, not syntactic. The JSON is valid. The types match. The HTTP status is 200. Every automated check passes. But the meaning has shifted. You asked for search results and got the tool's default empty response. Nothing is broken except the contract — and nobody's checking the contract.

A REST client that gets an empty response raises an exception or returns null. An LLM agent that gets an empty response writes a confident paragraph about why that's expected. The model's helpfulness is the amplifier that turns a minor integration bug into an invisible failure. And it's an expensive one — you pay for the input tokens, wait for inference, pay for the tool execution, and then pay for the model to eloquently explain why the wrong answer is fine.

Three Flavors of Schema Drift

Not all schema drift is the same. Three distinct flavors, each progressively harder to catch.

1. Parameter Rename

The most common. A field changes names.

Before:

{
  "name": "search",
  "parameters": {
    "query": { "type": "string", "description": "Search query" },
    "max_results": { "type": "number" }
  }
}

After:

{
  "name": "search",
  "parameters": {
    "search_query": { "type": "string", "description": "Search query" },
    "max_results": { "type": "number" }
  }
}

Easiest to detect if you're looking. Most teams aren't looking — because nothing looks broken.

2. Type Change

The field name stays the same, but the type shifts.

Before: { "max_results": { "type": "string", "description": "Maximum results (e.g. '10')" } }

After: { "max_results": { "type": "number", "description": "Maximum results (e.g. 10)" } }

LLMs are flexible about types — they might send "10" or 10 depending on the prompt and the phase of the moon. Some tools are lenient and coerce the type. Some aren't. You get inconsistent behavior that depends on which model is calling the tool and how it's feeling that day. Good luck debugging that.

3. Semantic Shift

The name stays the same. The type stays the same. The meaning changes.

Before — format controls the response format:

{ "format": { "type": "string", "description": "Output format: json, text, or markdown" } }

After — format now controls the underlying API mode:

{ "format": { "type": "string", "description": "API mode: json_mode, text, or streaming" } }

Same field. Same type. Completely different contract. Your agent sends format: "json" expecting a JSON-formatted response and instead activates the tool's JSON structured output mode, which changes the response envelope entirely. The response comes back fine — it's just not what you meant.

This is the hardest to detect because no structural schema change occurred. The description changed, but models don't always read descriptions carefully. Even if they do, the drift is in the intent, not the structure. And here's the kicker: in MCP, description changes are breaking changes — they alter the model's probability of selecting and correctly invoking the tool. Most teams don't treat them that way.

The Validation Layer: Schema Snapshots + Diff Detection

Here's the practical fix. It's not complicated, but it requires discipline.

Snapshot every MCP tool's schema on first connection. Store the JSON Schema of each tool's input parameters. This is your baseline — the contract your agent was built against.

{
  "tool": "search",
  "version_hash": "a1b2c3d4",
  "captured_at": "2026-02-15T10:00:00Z",
  "parameters": {
    "query": { "type": "string", "required": true },
    "max_results": { "type": "number", "required": false }
  }
}

On every subsequent connection, diff the current schema against the snapshot.

┌──────────────┐     ┌──────────────┐
│  MCP Server  │────▶│ Schema Fetch │
└──────────────┘     └──────┬───────┘
                            │
                     ┌──────▼───────┐     ┌──────────────────┐
                     │  Diff Engine │────▶│  Snapshot Store  │
                     └──────┬───────┘     └──────────────────┘
                            │
                  ┌─────────▼──────────┐
                  │  Change Detected?  │
                  └─────────┬──────────┘
                     ╱             ╲
                   Yes              No
                   ╱                 ╲
          ┌───────▼────────┐  ┌───────▼──────┐
          │  Alert / Block │  │  Proceed as  │
          │  the tool      │  │  normal      │
          └────────────────┘  └──────────────┘

Flag the things that matter:

Change	Severity	Action
New required parameter	Breaking	Block
Removed required parameter	Breaking	Block
Type change (e.g. string → number)	Breaking	Block
Possible rename (param removed + similar new one)	Likely breaking	Block until confirmed
Enum value removed	Breaking	Block
Removed optional parameter	Warning	Warn
New optional parameter	Compatible	Allow
Enum value added	Compatible	Allow
Description-only change	Review	Warn (can change model behavior)

The decision defaults to block. The cost of a blocked tool is visible and immediate. The cost of a silently wrong answer is invisible and compounding.

Validate before calling. Before sending the model's tool call to the server, validate it against the current schema. If the model sends query but the schema expects search_query, catch it before the call — not after. Return a corrective error to the model and let it retry. This is the cheapest guard and the one most frameworks skip.

What happens when validation fails? Don't just throw an error into the void. Return a structured tool error to the model: "Schema mismatch: expected parameter 'query', found 'search_query'. Tool blocked." Allow one automatic retry where the model regenerates the call against the current schema. If it still fails, disable the tool for this session and surface the issue to the user and your logs. Fail closed, not silent.

When you're connecting to multiple MCP providers — like mcp-rubber-duck does, routing across different LLMs and tool servers — this isn't optional. Schema drift in one provider propagates through every agent that touches it. One renamed parameter in one tool can corrupt results across your entire pipeline.

Runtime Guards: When the Schema Lies

Schema diffing catches structural changes. But what about behavioral changes that don't touch the schema?

The tool still accepts query: string. It still returns results: array. But it now interprets the query differently, or filters results by a new default, or paginates where it didn't before. The schema hasn't changed. The behavior has.

Three runtime guards that help.

Response shape validation. Define what a "normal" response looks like for each tool. If your search tool typically returns 5-15 results and suddenly returns 0, that's a signal — not proof, but a signal worth logging and alerting on.

Anomaly detection on response patterns. Track response sizes, field counts, and structure over time. A sudden change in the distribution — even if each individual response is valid — suggests something upstream changed. Simple statistical checks (rolling average, standard deviation) work surprisingly well here.

Canary queries. Known-good queries with known-expected responses, run on a schedule. If your canary query for "test search" used to return 3 specific items and now returns 0, you know the tool's behavior changed before your users do. This is the cheapest, most effective runtime guard. Run canaries hourly — they catch silent behavioral breaks that schema diffing misses entirely.

Semantic Versioning for MCP Tool Schemas

MCP should adopt semver for tool schemas. This isn't novel — it's how every other ecosystem solved this problem. And the community agrees: SEP-1400 proposes moving the MCP spec itself from date-based to semantic versioning, and SEP-1575 proposes tool-level semantic versioning. Neither is in the spec yet, but both signal that this is the direction.

{
  "name": "search",
  "schema_version": "2.1.0",
  "parameters": {
    "search_query": { "type": "string", "required": true },
    "max_results": { "type": "number", "required": false },
    "format": { "type": "string", "required": false }
  }
}

The rules:

MAJOR (2.x → 3.0): breaking changes. Parameter renames, type changes, semantic shifts. Clients built for v2 should not call v3 without updating.
MINOR (2.1 → 2.2): new optional parameters, new return fields. Backward compatible.
PATCH (2.1.0 → 2.1.1): bug fixes, no contract changes. In MCP, even description tweaks can shift model behavior — so description-only changes should be MINOR at minimum.

The client declares what it understands:

{
  "tool": "search",
  "supported_schema_version": "^2.0.0"
}

If the server is on 3.0.0, the client gets a clear error: "schema version mismatch, expected ^2.0.0, got 3.0.0." Not a silent empty result. Not a confident wrong answer. A clear, debuggable, immediate error.

MCP already lets servers advertise tool schemas — what's missing is a standardized version-negotiation story. The SEPs above are working toward this. Until they land, you're on your own.

What You Can Build Today

The spec will catch up eventually. In the meantime, here's what you can do without waiting for anyone.

Build the snapshot layer. On every MCP connection, hash the tool schemas. Compare against stored hashes. Alert on any change. This takes an afternoon to implement and will save you days of debugging silent failures.

Run canary queries. Pick 2-3 known-good queries per tool. Run them on a schedule. Compare results against baselines. If the results change, investigate before your agents use the tool.

Log tool inputs and outputs. Not just for debugging — for drift detection. When you can see that your agent sent query: "test" and got 0 results when it used to get 5, the problem becomes visible. Most MCP failures are invisible by default. Logging makes them visible.

Pin tool versions where possible. If your MCP server supports versioned tools, pin to the version you tested against. If it doesn't — and most don't — the snapshot layer is your substitute for version pinning.

Use existing tooling. You don't have to build everything from scratch. Specmatic MCP Auto-Test already detects schema drift and automates regression testing for MCP servers. Tools like AgentAudit track schema changes continuously. The ecosystem is young, but it's not empty.

None of this is glamorous. Contract testing never is. But the alternative is agents that fail silently and confidently — and you finding out from a user who got a wrong answer, not from your monitoring.

Limitations

No validation layer is perfect. Schema diffs generate false positives when tools include non-semantic churn — reformatted descriptions, reordered fields, added-then-removed experimental params. Semantic shifts can't be auto-detected at all; canary queries help but won't catch every behavioral change. And blocking tools aggressively can degrade user experience if you don't have a fallback — either a previous pinned version, a safe-mode prompt that doesn't rely on the tool, or at minimum a clear message to the user explaining why the tool is unavailable.

The goal isn't zero drift. It's making drift visible so you can decide what to do about it, instead of finding out from a user who got a confidently wrong answer.

The New Dependency Hell

Schema drift is dependency hell for the agent era. In REST, we solved it with OpenAPI specs, contract testing, and semantic versioning. It took years, and we still mess it up. In MCP, we're at the "it works on my machine" stage — no standardized versioning, no contract testing, no breaking change detection.

The difference is that REST failures are loud. MCP failures are quiet. A broken REST endpoint gives you a 500 and a stack trace. A drifted MCP tool gives you a confident wrong answer and an agent that explains why the wrong answer is actually fine.

Schema drift has a security angle too — malicious schema expansion as a supply chain attack vector. This article focuses on the engineering side: accidental drift breaking agents silently. Different threat model, same root cause — nobody's tracking how tool schemas evolve.

Agents lying to each other described reasoning chain contamination — where uncertainty gets laundered into confidence across agent hops. Schema drift is the same class of bug, one layer down: unreliable communication contracts across system boundaries. That article was about agents corrupting each other's reasoning at the handoff layer. This one is about the tools underneath them silently changing the ground truth. Different layer, same pattern — agents build perfect reasoning on a broken foundation and never know.

We'll solve this. OpenAPI took years to become table stakes. MCP schema versioning will too — and with SEP-1400 and SEP-1575 in progress, it's already starting. In the meantime: if you ship MCP tools, reject unknown parameters by default. If you consume them, validate inputs and outputs on both sides, and run canaries. The question isn't whether schema drift will bite you — it's whether you'll find out from your monitoring or from a user who got a confident wrong answer.

Have you caught your agent lying about a tool failure? How did you find it? I'd love to hear war stories in the comments.

Schema drift becomes a multiplied risk when connecting to multiple MCP providers — as mcp-rubber-duck does, routing across different LLMs and tool servers. For the architecture that routes between those providers, see the fleet architecture article. For what happens when agents corrupt each other's reasoning, see Agents Lie to Each Other.

Top comments (5)

Jaroslav Suva • Mar 27

The silent breakage pattern you're describing is particularly insidious because it bypasses most alerting setups - the system keeps running, it just produces wrong results. With MCP tools, the schema contract is essentially informal: the server can change what a tool returns and neither the protocol nor the runtime will raise an exception.

What makes this different from traditional API drift is that LLM behavior adds a second layer of ambiguity. If the tool output structure changes, the model might still "work" in the sense that it produces text - but the actual behavior has silently shifted. You'd only catch it by monitoring the semantic quality of outputs, not just the structural contract.

Have you considered running a snapshot-based regression against a fixed set of prompt + tool inputs to detect these drifts before they hit production? Or does the non-determinism of LLM responses make that impractical for your setup?

Mike • Mar 28

Yes -- that's essentially what canary queries in the article do: fixed inputs, compare against a known-good baseline, flag when something changes. For the tool layer it's fully deterministic. For the end-to-end path with the model, you assert semantic properties instead of exact output -- works well enough to catch drift before users do.

Jaroslav Suva • Mar 28

That framing is useful. The part I keep coming back to is how quickly undocumented drift escapes any workflow that depends only on the published contract. Once a provider starts returning values that are valid enough to pass through but different enough to change downstream assumptions, the integration is already in trouble.

The tricky case for us is still the boundary between "previously omitted" and "genuinely new" data. Null vs absent vs empty values look harmless in a diff summary, but they often drive very different behavior in real consumers.

When you evaluate this in practice, do you lean more on sampled live responses, synthetic checks, or some combination of both?

Kat Laszlo • Mar 18

This is really smart. Schema shift is a problem even without agents.

FlareCanary • Apr 15

This is a great framing — "dependency hell" is exactly what it is. The added danger with MCP is that when a tool schema changes, the LLM doesn't crash, it confidently adapts to the wrong structure. At least a REST client throws a type error.

I've been building a tool called FlareCanary to address this for both REST APIs and MCP servers. For MCP specifically, it polls tools/list on a schedule and diffs the tool schemas — parameter names, types, required flags. When a tool parameter gets renamed (like your case), it flags it as a breaking change before any agent hits it.

The 41% stat from KushoAI's latest report (41% of APIs experience undocumented schema changes within 30 days) probably understates the MCP risk since MCP servers are even less likely to have formal versioning or deprecation policies than REST APIs.