DEV Community

Cover image for pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.
Stefan Broenner
Stefan Broenner

Posted on

pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.

I Learned This the Hard Way

I built two MCP servers — Excel MCP Server and Windows MCP Server. Both had solid test suites. Both broke the moment a real LLM tried to use them.

I spent weeks doing manual testing with GitHub Copilot. Open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again. Sometimes the design was fundamentally broken and I spent weeks on a wild goose chase before realizing the whole approach needed rethinking.

The failure modes were always the same:

  • The LLM picks the wrong tool out of 15 similar-sounding options
  • It passes {"account_id": "checking"} when the parameter is account
  • It ignores the system prompt entirely
  • It asks the user "Would you like me to do that?" instead of just doing it

Why? Because I tested the code, not the AI interface.

For LLMs, your API isn't functions and types — it's tool descriptions, parameter schemas, and system prompts. That's what the model actually reads. No compiler catches a bad tool description. No unit test validates that an LLM will pick the right tool. And if you also inject Agent Skills — do they actually help? Or make things worse? Do LLMs really behave the way you think they will?

(No. They don't.)

So I built pytest-aitest, heavily inspired by agent-benchmark by Dmytro Mykhaliev.

It's a pytest plugin — uv add pytest-aitest and you're done. No new CLI, no new syntax. Works with your existing fixtures, markers, and CI/CD pipelines.

Write Tests as Prompts

Your test is a prompt. Write what a user would say. Let the LLM figure out how to use your tools. Assert on what happened.

from pytest_aitest import Agent, Provider, MCPServer

async def test_balance_query(aitest_run):
    agent = Agent(
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
    )

    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert result.tool_was_called("get_balance")
Enter fullscreen mode Exit fullscreen mode

If this fails, the problem isn't your code — it's your tool description. The LLM couldn't figure out which tool to call or what parameters to pass. Fix the description, run again. This is TDD for AI interfaces.

The Red/Green/Refactor Cycle — For Tool Descriptions

🔴 Red: Write a failing test

async def test_transfer(aitest_run):
    result = await aitest_run(agent, "Move $200 from checking to savings")
    assert result.tool_was_called("transfer")
Enter fullscreen mode Exit fullscreen mode

The LLM reads your tool descriptions, gets confused, calls the wrong thing. Test fails.

🟢 Green: Fix the interface

# Before — too vague
@mcp.tool()
def transfer(from_acct: str, to_acct: str, amount: float) -> str:
    """Transfer money."""

# After — the LLM knows exactly what to do
@mcp.tool()
def transfer(from_account: str, to_account: str, amount: float) -> str:
    """Transfer money between accounts (checking, savings).
    Amount must be positive. Returns new balances for both accounts."""
Enter fullscreen mode Exit fullscreen mode

Run again. Test passes.

🔄 Refactor: Let AI analysis tell you what else to fix

This is where it gets interesting. pytest-aitest doesn't just tell you pass/fail — it runs a second LLM that analyzes every failure and tells you why it happened and what to improve. Traditional testing requires a human to interpret failures. Here, the AI does it:

Screenshot of pytest-aitest report showing deploy recommendation for gpt-5-mini, pass rate comparison across models, cost metrics, and AI-generated failure analysis

The report tells you which model to deploy, why it wins, and what to fix. It analyzes cost efficiency, tool usage patterns, and prompt effectiveness across all your configurations. Unused tools? The AI flags them. Prompt causing permission-seeking behavior? It explains the mechanism. See a full sample report →

Compare Models, Prompts, and Server Versions

The real power is comparison. Test multiple configurations against the same test suite:

MODELS = ["gpt-5-mini", "gpt-4.1"]
PROMPTS = {"brief": "Be concise.", "detailed": "Explain your reasoning."}

AGENTS = [
    Agent(
        name=f"{model}-{prompt_name}",
        provider=Provider(model=f"azure/{model}"),
        mcp_servers=[banking_server],
        system_prompt=prompt,
    )
    for model in MODELS
    for prompt_name, prompt in PROMPTS.items()
]

@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")
    assert result.success
Enter fullscreen mode Exit fullscreen mode

4 configurations. Same tests. The report generates an Agent Leaderboard — winner by pass rate, then cost as tiebreaker:

Agent Pass Rate Tokens Cost
gpt-5-mini-brief 100% 747 $0.002
gpt-4.1-brief 100% 560 $0.008
gpt-5-mini-detailed 100% 1,203 $0.004

Deploy: gpt-5-mini (brief prompt) — 100% pass rate at lowest cost.

The same pattern works for A/B testing server versions (did your refactor break tool discoverability?), comparing system prompts, and measuring the impact of Agent Skills.

Multi-Turn Sessions

Real users don't ask one question. They have conversations:

@pytest.mark.session("banking-chat")
class TestBankingConversation:
    async def test_check_balance(self, aitest_run):
        result = await aitest_run(agent, "What's my checking balance?")
        assert result.success

    async def test_transfer(self, aitest_run):
        # Agent remembers we were talking about checking
        result = await aitest_run(agent, "Transfer $200 to savings")
        assert result.tool_was_called("transfer")

    async def test_verify(self, aitest_run):
        # Agent remembers the transfer
        result = await aitest_run(agent, "What are my new balances?")
        assert result.success
Enter fullscreen mode Exit fullscreen mode

Tests share conversation history. The report shows the full session flow with sequence diagrams.

Who This Is For

  • MCP server authors — Validate that LLMs can actually use your tools, not just that the code works
  • Agent builders — Find the cheapest model + prompt combo that passes your test suite
  • Teams shipping AI products — Gate deployments on LLM-facing regression tests in CI/CD

Works with 100+ LLM providers via LiteLLM — Azure, OpenAI, Anthropic, Google, local models, whatever you're running.

The Key Insight

The test is a prompt. The LLM is the test harness. The report tells you what to fix.

Traditional testing validates that your code works. pytest-aitest validates that an LLM can understand and use your code. These are different things, and the gap between them is where your production bugs live.

Your tool descriptions are an API. Test them like one.

Get Started

pytest-aitest is open source. Contributions welcome!

Star pytest-aitest on GitHub

GitHub logo sbroenne / pytest-aitest

The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.

🗄️ This project is archived and no longer maintained. It has been replaced by pytest-skill-engineering. Do not use this project for new work. This repository is kept as a read-only archive.

pytest-aitest

PyPI version Python versions CI License: MIT

Skill Engineering. Test-driven. AI-analyzed.

A pytest plugin for skill engineering — test your MCP server tools, prompt templates, agent skills, and custom agents with real LLMs. Red/Green/Refactor for the skill stack. Let AI analysis tell you what to fix.

Why?

Modern AI systems are built on skill engineering — the discipline of designing modular, reliable, callable capabilities that an LLM can discover, invoke, and orchestrate to perform real tasks. Skills are what separate "text generator" from "agent that actually does things."

An MCP server is the runtime for those skills. It doesn't ship alone — it comes bundled with the full skill engineering stack: tools (callable functions), prompt templates (server-side reasoning starters), agent skills (domain knowledge…

Top comments (0)