DEV Community

Cover image for pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.
Stefan Broenner
Stefan Broenner

Posted on

pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.

I Learned This the Hard Way

I built two MCP servers — Excel MCP Server and Windows MCP Server. Both had solid test suites. Both broke the moment a real LLM tried to use them.

I spent weeks doing manual testing with GitHub Copilot. Open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again. Sometimes the design was fundamentally broken and I spent weeks on a wild goose chase before realizing the whole approach needed rethinking.

The failure modes were always the same:

  • The LLM picks the wrong tool out of 15 similar-sounding options
  • It passes {"account_id": "checking"} when the parameter is account
  • It ignores the system prompt entirely
  • It asks the user "Would you like me to do that?" instead of just doing it

Why? Because I tested the code, not the AI interface.

For LLMs, your API isn't functions and types — it's tool descriptions, parameter schemas, and system prompts. That's what the model actually reads. No compiler catches a bad tool description. No unit test validates that an LLM will pick the right tool. And if you also inject Agent Skills — do they actually help? Or make things worse? Do LLMs really behave the way you think they will?

(No. They don't.)

So I built pytest-aitest, heavily inspired by agent-benchmark by Dmytro Mykhaliev.

It's a pytest plugin — uv add pytest-aitest and you're done. No new CLI, no new syntax. Works with your existing fixtures, markers, and CI/CD pipelines.

Write Tests as Prompts

Your test is a prompt. Write what a user would say. Let the LLM figure out how to use your tools. Assert on what happened.

from pytest_aitest import Agent, Provider, MCPServer

async def test_balance_query(aitest_run):
    agent = Agent(
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
    )

    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert result.tool_was_called("get_balance")
Enter fullscreen mode Exit fullscreen mode

If this fails, the problem isn't your code — it's your tool description. The LLM couldn't figure out which tool to call or what parameters to pass. Fix the description, run again. This is TDD for AI interfaces.

The Red/Green/Refactor Cycle — For Tool Descriptions

🔴 Red: Write a failing test

async def test_transfer(aitest_run):
    result = await aitest_run(agent, "Move $200 from checking to savings")
    assert result.tool_was_called("transfer")
Enter fullscreen mode Exit fullscreen mode

The LLM reads your tool descriptions, gets confused, calls the wrong thing. Test fails.

🟢 Green: Fix the interface

# Before — too vague
@mcp.tool()
def transfer(from_acct: str, to_acct: str, amount: float) -> str:
    """Transfer money."""

# After — the LLM knows exactly what to do
@mcp.tool()
def transfer(from_account: str, to_account: str, amount: float) -> str:
    """Transfer money between accounts (checking, savings).
    Amount must be positive. Returns new balances for both accounts."""
Enter fullscreen mode Exit fullscreen mode

Run again. Test passes.

🔄 Refactor: Let AI analysis tell you what else to fix

This is where it gets interesting. pytest-aitest doesn't just tell you pass/fail — it runs a second LLM that analyzes every failure and tells you why it happened and what to improve. Traditional testing requires a human to interpret failures. Here, the AI does it:

Screenshot of pytest-aitest report showing deploy recommendation for gpt-5-mini, pass rate comparison across models, cost metrics, and AI-generated failure analysis

The report tells you which model to deploy, why it wins, and what to fix. It analyzes cost efficiency, tool usage patterns, and prompt effectiveness across all your configurations. Unused tools? The AI flags them. Prompt causing permission-seeking behavior? It explains the mechanism. See a full sample report →

Compare Models, Prompts, and Server Versions

The real power is comparison. Test multiple configurations against the same test suite:

MODELS = ["gpt-5-mini", "gpt-4.1"]
PROMPTS = {"brief": "Be concise.", "detailed": "Explain your reasoning."}

AGENTS = [
    Agent(
        name=f"{model}-{prompt_name}",
        provider=Provider(model=f"azure/{model}"),
        mcp_servers=[banking_server],
        system_prompt=prompt,
    )
    for model in MODELS
    for prompt_name, prompt in PROMPTS.items()
]

@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")
    assert result.success
Enter fullscreen mode Exit fullscreen mode

4 configurations. Same tests. The report generates an Agent Leaderboard — winner by pass rate, then cost as tiebreaker:

Agent Pass Rate Tokens Cost
gpt-5-mini-brief 100% 747 $0.002
gpt-4.1-brief 100% 560 $0.008
gpt-5-mini-detailed 100% 1,203 $0.004

Deploy: gpt-5-mini (brief prompt) — 100% pass rate at lowest cost.

The same pattern works for A/B testing server versions (did your refactor break tool discoverability?), comparing system prompts, and measuring the impact of Agent Skills.

Multi-Turn Sessions

Real users don't ask one question. They have conversations:

@pytest.mark.session("banking-chat")
class TestBankingConversation:
    async def test_check_balance(self, aitest_run):
        result = await aitest_run(agent, "What's my checking balance?")
        assert result.success

    async def test_transfer(self, aitest_run):
        # Agent remembers we were talking about checking
        result = await aitest_run(agent, "Transfer $200 to savings")
        assert result.tool_was_called("transfer")

    async def test_verify(self, aitest_run):
        # Agent remembers the transfer
        result = await aitest_run(agent, "What are my new balances?")
        assert result.success
Enter fullscreen mode Exit fullscreen mode

Tests share conversation history. The report shows the full session flow with sequence diagrams.

Who This Is For

  • MCP server authors — Validate that LLMs can actually use your tools, not just that the code works
  • Agent builders — Find the cheapest model + prompt combo that passes your test suite
  • Teams shipping AI products — Gate deployments on LLM-facing regression tests in CI/CD

Works with 100+ LLM providers via LiteLLM — Azure, OpenAI, Anthropic, Google, local models, whatever you're running.

The Key Insight

The test is a prompt. The LLM is the test harness. The report tells you what to fix.

Traditional testing validates that your code works. pytest-aitest validates that an LLM can understand and use your code. These are different things, and the gap between them is where your production bugs live.

Your tool descriptions are an API. Test them like one.

Get Started

pytest-aitest is open source. Contributions welcome!

Star pytest-aitest on GitHub

GitHub logo sbroenne / pytest-aitest

A pytest plugin for validating whether language models can actually understand and operate your interfaces: MCP servers, system prompts, agent skills and tools.

pytest-aitest

PyPI version Python versions CI License: MIT

Test your AI interfaces. AI analyzes your results.

A pytest plugin for test-driven development of MCP servers, tools, prompts, and skills. Write tests first. Let the AI analysis drive your design.

Why?

Your MCP server passes all unit tests. Then an LLM tries to use it and picks the wrong tool, passes garbage parameters, or ignores your system prompt.

Because you tested the code, not the AI interface. For LLMs, your API is tool descriptions, schemas, and prompts — not functions and types. No compiler catches a bad tool description. No linter flags a confusing schema. Traditional tests can't validate them.

How It Works

So I built pytest-aitest: write tests as natural language prompts. An Agent bundles an LLM with your tools — you assert on what happened:

from pytest_aitest import Agent, Provider, MCPServer
async def test_balance_query(aitest_run):
    agent = Agent(
        provider=Provider
Enter fullscreen mode Exit fullscreen mode

Top comments (0)