I Learned This the Hard Way
I built two MCP servers — Excel MCP Server and Windows MCP Server. Both had solid test suites. Both broke the moment a real LLM tried to use them.
I spent weeks doing manual testing with GitHub Copilot. Open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again. Sometimes the design was fundamentally broken and I spent weeks on a wild goose chase before realizing the whole approach needed rethinking.
The failure modes were always the same:
- The LLM picks the wrong tool out of 15 similar-sounding options
- It passes
{"account_id": "checking"}when the parameter isaccount - It ignores the system prompt entirely
- It asks the user "Would you like me to do that?" instead of just doing it
Why? Because I tested the code, not the AI interface.
For LLMs, your API isn't functions and types — it's tool descriptions, parameter schemas, and system prompts. That's what the model actually reads. No compiler catches a bad tool description. No unit test validates that an LLM will pick the right tool. And if you also inject Agent Skills — do they actually help? Or make things worse? Do LLMs really behave the way you think they will?
(No. They don't.)
So I built pytest-aitest, heavily inspired by agent-benchmark by Dmytro Mykhaliev.
It's a pytest plugin — uv add pytest-aitest and you're done. No new CLI, no new syntax. Works with your existing fixtures, markers, and CI/CD pipelines.
Write Tests as Prompts
Your test is a prompt. Write what a user would say. Let the LLM figure out how to use your tools. Assert on what happened.
from pytest_aitest import Agent, Provider, MCPServer
async def test_balance_query(aitest_run):
agent = Agent(
provider=Provider(model="azure/gpt-5-mini"),
mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
)
result = await aitest_run(agent, "What's my checking balance?")
assert result.success
assert result.tool_was_called("get_balance")
If this fails, the problem isn't your code — it's your tool description. The LLM couldn't figure out which tool to call or what parameters to pass. Fix the description, run again. This is TDD for AI interfaces.
The Red/Green/Refactor Cycle — For Tool Descriptions
🔴 Red: Write a failing test
async def test_transfer(aitest_run):
result = await aitest_run(agent, "Move $200 from checking to savings")
assert result.tool_was_called("transfer")
The LLM reads your tool descriptions, gets confused, calls the wrong thing. Test fails.
🟢 Green: Fix the interface
# Before — too vague
@mcp.tool()
def transfer(from_acct: str, to_acct: str, amount: float) -> str:
"""Transfer money."""
# After — the LLM knows exactly what to do
@mcp.tool()
def transfer(from_account: str, to_account: str, amount: float) -> str:
"""Transfer money between accounts (checking, savings).
Amount must be positive. Returns new balances for both accounts."""
Run again. Test passes.
🔄 Refactor: Let AI analysis tell you what else to fix
This is where it gets interesting. pytest-aitest doesn't just tell you pass/fail — it runs a second LLM that analyzes every failure and tells you why it happened and what to improve. Traditional testing requires a human to interpret failures. Here, the AI does it:
The report tells you which model to deploy, why it wins, and what to fix. It analyzes cost efficiency, tool usage patterns, and prompt effectiveness across all your configurations. Unused tools? The AI flags them. Prompt causing permission-seeking behavior? It explains the mechanism. See a full sample report →
Compare Models, Prompts, and Server Versions
The real power is comparison. Test multiple configurations against the same test suite:
MODELS = ["gpt-5-mini", "gpt-4.1"]
PROMPTS = {"brief": "Be concise.", "detailed": "Explain your reasoning."}
AGENTS = [
Agent(
name=f"{model}-{prompt_name}",
provider=Provider(model=f"azure/{model}"),
mcp_servers=[banking_server],
system_prompt=prompt,
)
for model in MODELS
for prompt_name, prompt in PROMPTS.items()
]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
result = await aitest_run(agent, "What's my checking balance?")
assert result.success
4 configurations. Same tests. The report generates an Agent Leaderboard — winner by pass rate, then cost as tiebreaker:
| Agent | Pass Rate | Tokens | Cost |
|---|---|---|---|
| gpt-5-mini-brief | 100% | 747 | $0.002 |
| gpt-4.1-brief | 100% | 560 | $0.008 |
| gpt-5-mini-detailed | 100% | 1,203 | $0.004 |
Deploy: gpt-5-mini (brief prompt) — 100% pass rate at lowest cost.
The same pattern works for A/B testing server versions (did your refactor break tool discoverability?), comparing system prompts, and measuring the impact of Agent Skills.
Multi-Turn Sessions
Real users don't ask one question. They have conversations:
@pytest.mark.session("banking-chat")
class TestBankingConversation:
async def test_check_balance(self, aitest_run):
result = await aitest_run(agent, "What's my checking balance?")
assert result.success
async def test_transfer(self, aitest_run):
# Agent remembers we were talking about checking
result = await aitest_run(agent, "Transfer $200 to savings")
assert result.tool_was_called("transfer")
async def test_verify(self, aitest_run):
# Agent remembers the transfer
result = await aitest_run(agent, "What are my new balances?")
assert result.success
Tests share conversation history. The report shows the full session flow with sequence diagrams.
Who This Is For
- MCP server authors — Validate that LLMs can actually use your tools, not just that the code works
- Agent builders — Find the cheapest model + prompt combo that passes your test suite
- Teams shipping AI products — Gate deployments on LLM-facing regression tests in CI/CD
Works with 100+ LLM providers via LiteLLM — Azure, OpenAI, Anthropic, Google, local models, whatever you're running.
The Key Insight
The test is a prompt. The LLM is the test harness. The report tells you what to fix.
Traditional testing validates that your code works. pytest-aitest validates that an LLM can understand and use your code. These are different things, and the gap between them is where your production bugs live.
Your tool descriptions are an API. Test them like one.
Get Started
pytest-aitest is open source. Contributions welcome!
- Documentation — Full guides and API reference
-
PyPI —
uv add pytest-aitest - Sample Report — See AI analysis in action
sbroenne
/
pytest-aitest
A pytest plugin for validating whether language models can actually understand and operate your interfaces: MCP servers, system prompts, agent skills and tools.
pytest-aitest
Test your AI interfaces. AI analyzes your results.
A pytest plugin for test-driven development of MCP servers, tools, prompts, and skills. Write tests first. Let the AI analysis drive your design.
Why?
Your MCP server passes all unit tests. Then an LLM tries to use it and picks the wrong tool, passes garbage parameters, or ignores your system prompt.
Because you tested the code, not the AI interface. For LLMs, your API is tool descriptions, schemas, and prompts — not functions and types. No compiler catches a bad tool description. No linter flags a confusing schema. Traditional tests can't validate them.
How It Works
So I built pytest-aitest: write tests as natural language prompts. An Agent bundles an LLM with your tools — you assert on what happened:
from pytest_aitest import Agent, Provider, MCPServer
async def test_balance_query(aitest_run):
agent = Agent(
provider=Provider…

Top comments (0)