How to do Regression Testing for MCP Servers

#mcp #devops #testing

If you maintain an MCP server, there is a class of breakage that no amount of unit testing will catch. Someone on your team renames a tool parameter from query to search_query, or rephrases a tool description from "Search the web" to "Search the web for recent results," and the change passes every test in the suite because nothing actually validates the protocol surface your server exposes to AI agents.

These schema drift issues are not always visible outright, but they accumulate over time and tend to surface as baffling agent failures — tools that stop being selected, arguments that arrive malformed, responses that get misinterpreted — precisely because MCP tool descriptions are not documentation in the traditional sense. They are instructions. The model reads them to decide when to call a tool, how to invoke it, and what to do with the result. A reworded description is not a cosmetic change. It is a change in the instruction set the model operates from.

The ecosystem has clearly started feeling this. Over the past few weeks, a small wave of projects has appeared — schema drift detectors, description diffing tools, supply chain auditors for tool surfaces — all trying to address some facet of the same underlying problem: MCP servers need the kind of regression testing discipline that REST APIs have enjoyed for the better part of a decade.

We solved the analogous problem for HTTP a long time ago

Consider how we test REST APIs today. We have OpenAPI specs that serve as a committed, diffable contract. We have Pact for consumer-driven contract testing. We have VCR.py and its equivalents in every language — record an HTTP exchange, commit the cassette to the repo, replay it in tests so you never depend on a live server. When the contract changes, the diff appears in your pull request, attributed to an author, ready for review.

MCP has none of this infrastructure yet. There are active proposals for tool semantic versioning in the spec, but nothing has shipped. There is no static artifact that describes a server's tool surface — when the server starts, it responds to tools/list with whatever it happens to have in memory at that moment. Nothing to commit, nothing to pin, nothing to diff against.

But the pattern that works for REST ought to work here too: record a known-good protocol exchange, commit it to the repository, and verify on every subsequent change that the server still produces the same output.

The difference lies in what you record. For REST, it is HTTP request/response pairs. For MCP, it is the full JSON-RPC lifecycle: the initialize handshake (where capabilities are negotiated), the tools/list response (where every tool schema lives), and the actual tools/call results (where behavioural regressions hide). Capture that entire exchange into a single artifact, and you have both a regression test and living documentation of your server's public interface.

Record, commit, verify

I built mcp-recorder to implement this pattern. The mental model is VCR.py, applied to the MCP protocol rather than raw HTTP.

It works as a transparent proxy (for HTTP servers) or subprocess wrapper (for stdio servers) that captures the full MCP exchange into a JSON cassette file. That single recording unlocks two testing directions:

Record:   Client → mcp-recorder → Real Server → cassette.json
                                   (HTTP or stdio)

Replay:   Client → mcp-recorder (mock) → cassette.json     (test your client)
Verify:   mcp-recorder (client mock) → Real Server          (test your server)

Replay serves recorded responses to your client without a real server — no credentials, no network, deterministic every time. Verify sends the recorded requests to your (possibly changed) server and diffs the actual responses against the golden recording. The verify output looks like this:

Verifying golden.json against node dist/index.js

  1. initialize          [PASS]
  2. tools/list          [PASS]
  3. tools/call [search] [FAIL]
       $.result.content[0].text: "old output" != "new output"
  4. tools/call [analyze] [PASS]

Result: 3/4 passed, 1 failed

Exit code is non-zero on any diff, so it plugs directly into CI.

The distinction from schema-only tools matters: because you are recording actual protocol exchanges rather than just comparing tools/list snapshots, you capture behavioural regression as well. If a tools/call used to return a specific error format and now returns something different, the cassette catches it. If capabilities that were previously advertised during initialize quietly disappear, the cassette catches that too. Schema diffing alone would miss both.

Both transports are first-class citizens. Most MCP servers in local development communicate over stdio — you spawn a subprocess and exchange JSON-RPC over stdin/stdout. Remote and cloud-hosted servers use HTTP (Streamable HTTP or SSE). The cassette format is identical regardless of transport:

# stdio — the typical case for locally developed MCP servers
mcp-recorder verify --cassette golden.json \
  --target-stdio "node dist/index.js"

# HTTP — remote or hosted servers
mcp-recorder verify --cassette golden.json \
  --target https://your-mcp-server.example.com

Applying this to a real server

To make this concrete rather than hypothetical, let's look at what it takes to add regression testing to an existing, production MCP server.

monday.com's MCP server is a good candidate. It is a TypeScript server exposing 20+ tools to AI agents — boards, items, updates, documents, workflows — and its only CI workflow at the time of writing is an npm publish step. There is no test that would catch a renamed tool, a removed parameter, or a changed description.

I submitted PR #222 to add schema regression testing. The entire integration consists of a scenarios file, a golden cassette directory, and one CI step. Here's the scenarios file in full — it is 14 lines:

schema_version: "1.0"
target:
  command: "node"
  args: ["packages/monday-api-mcp/dist/index.js"]
  env:
    MONDAY_TOKEN: "test-token"

scenarios:
  list_tools:
    description: "Capture all tool schemas, descriptions, and annotations"
    actions:
      - list_tools

When you run mcp-recorder record-scenarios scenarios.yml, it spawns the server as a subprocess, performs the MCP handshake, calls tools/list, and writes everything into a cassette. The MONDAY_TOKEN is set to a dummy value because initialize and tools/list don't validate the token — they simply enumerate the in-memory tool registry. No network calls, no secrets, no real API access required.

The resulting golden cassette is roughly 3,200 lines of JSON, capturing every tool's name, description, input schema, and annotations. Because this cassette is committed to the repo, it functions as living documentation — and more importantly, when someone opens a pull request that changes the tool surface, the diff tells you precisely what changed. "This PR added a required workspace_id parameter to get_items" or "this PR renamed the create_board tool" are not things you need to discover by reading source code — they appear as JSON diffs in the PR, ready for review.

I took a similar approach with Tavily's MCP server (a search API, 5 tools, 50+ parameters), but pushed it further by including actual tools/call invocations in the scenarios. Because the server is spawned without a TAVILY_API_KEY, tool calls hit the API key validation guard and return a deterministic McpError — which means the cassette captures not only the full schema surface but also the error contract. If someone changes the error message format or the error code, the cassette catches it.

Both integrations are fully additive — no existing files were modified.

Try it yourself

The quickest way to see this in action is against a public demo server. Save the following as scenarios.yml:

schema_version: "1.0"
target: https://mcp.devhelm.io

scenarios:
  demo_walkthrough:
    description: "Record tool schemas and a sample tool call"
    actions:
      - list_tools
      - list_resources
      - call_tool:
          name: add
          arguments: { a: 2, b: 3 }
      - call_tool:
          name: greet
          arguments: { name: "world", style: "pirate" }

Then:

pip install mcp-recorder

# Record cassettes from the scenarios file
mcp-recorder record-scenarios scenarios.yml

# Inspect what was captured
mcp-recorder inspect cassettes/demo_walkthrough.json

# Verify — should pass against the same server
mcp-recorder verify \
  --cassette cassettes/demo_walkthrough.json \
  --target https://mcp.devhelm.io

For your own server, the pattern is the same. Write a scenarios file pointing at your stdio command or HTTP URL, record, commit the cassettes, and add a verify step to CI:

# .github/workflows/mcp-regression.yml (the relevant step)
- run: pip install mcp-recorder
- run: |
    mcp-recorder verify \
      --cassette cassettes/tools_and_schemas.json \
      --target-stdio "node dist/index.js"

If you are working in a Python project, the pytest plugin activates automatically on install. Each test gets an isolated replay server on a random port:

import pytest

@pytest.mark.mcp_cassette("cassettes/golden.json")
def test_no_regression(mcp_verify_result):
    assert mcp_verify_result.failed == 0, mcp_verify_result.results

When a change is intentional — you genuinely meant to rename that tool — update the cassette with --update and the new snapshot becomes the baseline.

The cassette as contract

Once you commit a cassette, something quietly useful emerges: your git history becomes an audit trail of your MCP server's public interface. Every tool rename, every schema change, every description edit appears as a diff, attributed to an author, tied to a pull request. You did not set out to build a changelog of your tool surface, but you have one.

This has a natural extension for the other side of the relationship. If you consume an MCP server you do not control — a third-party integration, a vendor API — the same approach works in reverse. Record what the server exposes today, run verify on a schedule, and detect when the upstream shifts before your agents do.

The MCP spec does not yet have a mechanism for pinning tool versions — the proposals are still under discussion. Until something ships, a committed cassette is the closest thing to a pinned contract.

mcp-recorder is MIT-licensed and on PyPI. I would be glad to hear what works and what does not — issues and pull requests are welcome.

We're working on more tooling for MCP and agent reliability — sign up for updates at devhelm.io.