klement Gunndu

Posted on Mar 26

Your MCP Server Has No Tests. Here Are 4 Patterns to Fix That.

#testing #ai #python #mcp

Most MCP servers ship with exactly one test: a developer typing a prompt into Claude and checking if the output looks right.

That is not testing. That is hoping. And it breaks the moment you change a tool signature, add a parameter, or update your database schema.

Why MCP Servers Are Hard to Test

MCP servers sit between deterministic code and non-deterministic LLMs. Your tools are pure functions — they take inputs, return outputs. But the consumers of those tools are language models that interpret schemas, pick tools based on descriptions, and pass arguments based on inference.

This creates a testing gap. Traditional unit tests cover your business logic. LLM integration tests are slow, expensive, and non-deterministic. The four patterns below close that gap using real MCP protocol interactions — without calling an LLM.

Pattern 1: In-Memory Unit Tests With FastMCP Client

FastMCP 2.x includes a Client class that connects directly to your server in-memory. No subprocess. No network. No LLM. You test your actual server logic through the real MCP protocol — in milliseconds.

Install the testing dependency:

pip install pytest-asyncio

Configure pytest to handle async automatically in pyproject.toml:

[tool.pytest.ini_options]
asyncio_mode = "auto"

Here is a server with two tools:

# server.py
from fastmcp import FastMCP

mcp = FastMCP("WeatherServer")

@mcp.tool()
def get_forecast(city: str, days: int = 3) -> dict:
    """Get weather forecast for a city."""
    if days < 1 or days > 14:
        raise ValueError(f"Days must be 1-14, got {days}")
    return {
        "city": city,
        "days": days,
        "forecast": [{"day": i + 1, "temp_c": 20 + i} for i in range(days)],
    }

@mcp.tool()
def get_alerts(region: str) -> list[str]:
    """Get active weather alerts for a region."""
    alerts_db = {"northwest": ["Wind advisory until 6 PM"]}
    return alerts_db.get(region.lower(), [])

Now write the tests. Create a pytest fixture that wraps your server in a Client:

# test_server.py
import pytest
from fastmcp import Client
from server import mcp  # your FastMCP server instance

@pytest.fixture
async def client():
    async with Client(transport=mcp) as c:
        yield c

async def test_forecast_returns_correct_days(client):
    result = await client.call_tool("get_forecast", {"city": "Seattle", "days": 5})
    data = result.data
    assert data["city"] == "Seattle"
    assert len(data["forecast"]) == 5

async def test_forecast_default_days(client):
    result = await client.call_tool("get_forecast", {"city": "Portland"})
    assert len(result.data["forecast"]) == 3

async def test_forecast_invalid_days_raises(client):
    with pytest.raises(Exception):
        await client.call_tool("get_forecast", {"city": "Seattle", "days": 30})

async def test_alerts_existing_region(client):
    result = await client.call_tool("get_alerts", {"region": "Northwest"})
    assert len(result.data) == 1
    assert "Wind advisory" in result.data[0]

async def test_alerts_unknown_region(client):
    result = await client.call_tool("get_alerts", {"region": "Antarctica"})
    assert result.data == []

Run with pytest -v. Every test executes in milliseconds because it uses in-memory transport. No HTTP, no subprocess, no LLM calls.

This pattern catches three categories of bugs immediately: broken tool logic, wrong return types, and missing error handling.

Pattern 2: Schema Validation Tests

Your MCP tools expose JSON schemas to LLMs. If the schema drifts from your implementation — a renamed parameter, a changed type, a missing description — the LLM picks the wrong tool or passes wrong arguments. Schema tests lock this contract down.

async def test_tool_registry_completeness(client):
    """Verify all expected tools are registered."""
    tools = await client.list_tools()
    tool_names = {t.name for t in tools}
    assert tool_names == {"get_forecast", "get_alerts"}

async def test_forecast_schema_has_required_params(client):
    """Verify the forecast tool schema matches expectations."""
    tools = await client.list_tools()
    forecast = next(t for t in tools if t.name == "get_forecast")

    schema = forecast.inputSchema
    assert "city" in schema["properties"]
    assert schema["properties"]["city"]["type"] == "string"
    assert "city" in schema.get("required", [])

async def test_all_tools_have_descriptions(client):
    """LLMs select tools based on descriptions. Missing = broken routing."""
    tools = await client.list_tools()
    for tool in tools:
        assert tool.description, f"Tool '{tool.name}' has no description"
        assert len(tool.description) > 10, (
            f"Tool '{tool.name}' description too short: '{tool.description}'"
        )

Schema tests catch a specific class of failure that unit tests miss entirely: your code works, but the LLM cannot use it. This happens more often than you think. A developer renames a parameter from city to location. The tool still works. The schema updates automatically. But every prompt template, every LLM workflow, and every agent that was trained on the old schema now sends city and gets a validation error.

Schema tests make this failure loud and immediate. When the parameter name changes, test_forecast_schema_has_required_params fails in CI. The developer sees the break before it ships.

If your server exposes resources alongside tools, test those registrations too:

async def test_resources_are_registered(client):
    """Verify static resources are accessible."""
    resources = await client.list_resources()
    assert len(resources) > 0, "Server exposes no resources"

Run schema tests in CI on every commit. Schema drift is silent and devastating — these tests make it visible.

Pattern 3: Parameterized Edge Case Testing

MCP tools receive arguments from LLMs. LLMs send unexpected inputs — empty strings, extreme values, wrong types. Parameterized tests cover these systematically.

@pytest.mark.parametrize(
    "city, days, expected_count",
    [
        ("Seattle", 1, 1),
        ("Tokyo", 7, 7),
        ("São Paulo", 14, 14),
        ("New York", 3, 3),
    ],
)
async def test_forecast_valid_ranges(client, city, days, expected_count):
    result = await client.call_tool("get_forecast", {"city": city, "days": days})
    assert len(result.data["forecast"]) == expected_count

@pytest.mark.parametrize(
    "invalid_days",
    [0, -1, 15, 100],
)
async def test_forecast_rejects_invalid_days(client, invalid_days):
    with pytest.raises(Exception):
        await client.call_tool(
            "get_forecast", {"city": "Seattle", "days": invalid_days}
        )

@pytest.mark.parametrize(
    "region, has_alerts",
    [
        ("Northwest", True),
        ("northwest", True),
        ("NORTHWEST", True),
        ("southeast", False),
        ("", False),
    ],
)
async def test_alerts_case_insensitive(client, region, has_alerts):
    result = await client.call_tool("get_alerts", {"region": region})
    if has_alerts:
        assert len(result.data) > 0
    else:
        assert len(result.data) == 0

For complex return structures, the inline-snapshot library eliminates manual assertion writing. Install it with pip install inline-snapshot, then write:

from inline_snapshot import snapshot

async def test_forecast_structure(client):
    result = await client.call_tool("get_forecast", {"city": "Seattle", "days": 2})
    assert result.data == snapshot()

Run pytest --inline-snapshot=fix,create once. The library fills in the expected value automatically from the actual output. On subsequent runs, it asserts against the stored snapshot. When your output changes intentionally, run --inline-snapshot=fix to update.

Pattern 4: Interactive Testing With MCP Inspector

Unit tests verify code paths. But sometimes you need to see what the LLM sees — the exact schemas, the raw responses, the protocol messages. MCP Inspector is the official visual testing tool for this.

Launch it against your server:

# For a Python MCP server
npx @modelcontextprotocol/inspector uv --directory ./my-server run my-server

# For a published PyPI package
npx @modelcontextprotocol/inspector uvx mcp-server-git --repository ~/code/repo.git

This opens a web UI at http://localhost:6274 that connects to your MCP server through a local proxy. From the Inspector, you can:

Browse all tools — see the JSON schema exactly as an LLM receives it
Call any tool — fill in parameters through a form, see the raw response
Inspect resources — view static context your server exposes
Test prompts — verify prompt templates render correctly

MCP Inspector serves a different purpose than pytest. Use it for:

Exploratory testing during development — try edge cases manually
Schema review — verify descriptions are clear enough for LLM tool selection
Debugging failures — reproduce exact inputs that caused production issues
Demo and documentation — show stakeholders what your server exposes

The Inspector does not replace automated tests. It complements them. Write pytest tests for regression coverage. Use Inspector for exploration and debugging.

A Practical Test Strategy

Combine all four patterns into a layered strategy:

Layer 1: In-memory unit tests (Pattern 1)
  → Run on every save. Sub-second feedback. Catch logic bugs.

Layer 2: Schema validation tests (Pattern 2)
  → Run in CI on every commit. Catch contract drift.

Layer 3: Parameterized edge cases (Pattern 3)
  → Run in CI. Catch boundary failures and type handling.

Layer 4: MCP Inspector (Pattern 4)
  → Use during development. Manual exploration and debugging.

Your pyproject.toml test configuration:

[tool.pytest.ini_options]
asyncio_mode = "auto"
markers = [
    "schema: schema validation tests",
    "edge: edge case and boundary tests",
]

Run fast tests during development:

pytest -v -m "not schema and not edge"

Run everything in CI:

pytest -v --tb=short

What This Costs You

Setting up in-memory MCP tests takes about 30 minutes for an existing server. The fixture is 5 lines. Each test is 3-6 lines. You get sub-second feedback on every change.

Compare that to the alternative: a user discovers your tool returns the wrong type, the LLM hallucinates a workaround, and you spend two hours debugging a production trace.

Thirty minutes of test setup prevents hours of production debugging. That trade is worth it every time.

The MCP ecosystem is growing fast. As of March 2026, thousands of MCP servers exist on npm and PyPI, and the number is accelerating. Most of them have zero automated tests. If you ship yours with a proper test suite, you are already ahead of 90% of the ecosystem. More importantly, your users will trust your server because it actually works when they upgrade.

Follow @klement_gunndu for more MCP and AI engineering content. We're building in public.

Top comments (6)

freerave • Mar 26

Spot on with the testing patterns! However, the real headache in testing MCP servers usually lies in the transport layer (especially when dealing with stdio vs SSE) and the non-deterministic nature of the client-side LLM.
When applying these 4 patterns, how do you reliably mock the client context injection and simulate multi-step, complex tool executions without making the test suite inherently flaky? Do you have a specific approach for strictly isolating the transport layer from the tool execution state?

klement Gunndu • Mar 27

Great point on transport being the real pain — we bypass it entirely by testing at the protocol level with ClientSession against an in-memory server, skipping stdio/SSE. For LLM non-determinism, we assert on tool call structure, not content.

klement Gunndu • Mar 26

Transport layer isolation is the key challenge you're pointing at. The approach that works: test tool logic with direct function calls (no transport), then test transport separately with a deterministic harness that replays recorded stdio/SSE sequences.

For the LLM non-determinism, skip it entirely in unit tests. Mock the client context injection with fixed payloads that represent realistic tool call chains. Your test asserts that given input X, the server produces output Y — the LLM is out of scope at that layer.

For multi-step tool executions, a state snapshot approach helps: capture server state after each step, assert invariants between steps, reset to known state before each test. That eliminates flakiness from accumulated state.

The transport-vs-execution split matters because stdio has buffering quirks that SSE doesn't. Testing them together means a stdio buffer flush timing issue looks like a tool logic bug. Separate the layers, test each in isolation, then run a small set of integration tests that exercise the full path.

klement Gunndu • Mar 27

Transport isolation is the hardest part. What works: abstract the transport behind an interface so tool tests never see stdio/SSE at all — they get typed input and produce typed output. For multi-step executions, define the expected tool call sequence as fixtures and assert against that sequence deterministically. The flakiness usually comes from mixing transport concerns with tool logic in the same test. Separate those layers and each becomes independently testable.

klement Gunndu • Mar 28

Transport layer testing is genuinely the hardest part. For stdio, we wrap the server process in a subprocess fixture that captures stdin/stdout directly — no mocking the transport, just exercising the real pipe. For SSE, a lightweight test HTTP server that records events works better than trying to mock the streaming connection. The LLM non-determinism side we handle by testing the tool dispatch layer separately from the model — assert that given a specific tool call payload, the server returns the right structured result. Keeps the tests deterministic where it matters.

klement Gunndu • Apr 2

Great point on transport complexity — we stub the transport at the protocol boundary so tests hit real tool logic but never touch stdio/SSE. For LLM non-determinism, snapshot the raw JSON-RPC messages and assert on structure, not content.