Last month I shipped an MCP agent that triages GitHub issues. It works great — until it silently breaks and nobody notices.
Here are the last three bugs I hit:
I tweaked the system prompt. The agent stopped calling
create_issueand just summarised the bug report in plain text. CI didn't catch it — CI tests the code, not the agent behavior.I swapped Sonnet for Haiku to save cost. The agent started calling
list_issuesfour times before eachcreate_issue. Integration tests still passed. Token bill tripled.GitHub rate-limited me mid-test. The entire pytest suite went red. I rolled back a perfectly good change because I couldn't tell flake from regression.
Every one of those would have been caught by a tool that tests the agent's trajectory — which tools it picks, in what order, with what arguments — against a fast, hermetic mock.
That tool is mcptest. Here's how I use it.
The Scenario
I have an agent that reads a bug report and decides whether to:
- Open a new issue if the bug is novel
- Comment on an existing issue if a similar one is already filed
- Do nothing if the report is spam or unclear
The agent is about thirty lines of Python wrapping an LLM call and the MCP client:
# agent.py
import asyncio, json, os, sys
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters, stdio_client
SYSTEM_PROMPT = """You are an issue triage assistant for acme/api.
Before filing a new issue, ALWAYS check list_issues for duplicates.
If a duplicate exists, call add_comment instead."""
async def main():
fx = json.loads(os.environ["MCPTEST_FIXTURES"])[0]
params = StdioServerParameters(
command=sys.executable,
args=["-m", "mcptest.mock_server", fx],
env=os.environ.copy(),
)
user = sys.stdin.read()
async with stdio_client(params) as (r, w):
async with ClientSession(r, w) as session:
await session.initialize()
tools = await session.list_tools()
# Your real agent calls an LLM here with SYSTEM_PROMPT + tools
# and executes whatever tool_calls it returns
plan = your_llm_plan(SYSTEM_PROMPT, user, tools)
for call in plan:
await session.call_tool(call.name, arguments=call.args)
asyncio.run(main())
That's it. The rest of this post is about testing what this agent does — which tools it picks, in what order — without ever calling a real LLM.
I want to verify four things:
- For a clear bug report, it calls
create_issueexactly once. - It checks
list_issuesbeforecreate_issue(duplicate check). - When the server rate-limits it, it recovers gracefully.
- It never calls
delete_issue. Ever.
Step 1 — Mock the GitHub MCP Server in YAML
No code required. Declare the tools, canned responses, and error scenarios:
# fixtures/github.yaml
server:
name: mock-github
version: "1.0"
tools:
- name: list_issues
input_schema:
type: object
properties:
repo: { type: string }
query: { type: string }
required: [repo]
responses:
- match: { query: "login 500" }
return:
issues: [] # No duplicate → should open a new one
- match: { query: "dark mode" }
return:
issues: [{ number: 12, title: "Add dark mode" }]
- default: true
return: { issues: [] }
- name: create_issue
input_schema:
type: object
properties:
repo: { type: string }
title: { type: string }
body: { type: string }
required: [repo, title]
responses:
- match: { repo: "acme/api" }
return: { number: 42, url: "https://github.com/acme/api/issues/42" }
- default: true
error: rate_limited # Simulate real-world flake
- name: add_comment
input_schema:
type: object
properties:
repo: { type: string }
number: { type: integer }
body: { type: string }
required: [repo, number, body]
responses:
- default: true
return: { ok: true }
- name: delete_issue
responses:
- default: true
return: { deleted: true }
errors:
- name: rate_limited
error_code: -32000
message: "GitHub API rate limit exceeded"
That's a real MCP server. It speaks MCP over stdio, just like the real GitHub server. Your agent connects to it the same way.
Step 2 — Write the Tests
# tests/test_triage.yaml
name: triage agent
fixtures:
- ../fixtures/github.yaml
agent:
command: python agent.py
cases:
- name: opens a new issue for a novel bug
input: "login page returns 500 on Safari"
assertions:
- tool_called: create_issue
- tool_call_count: { tool: create_issue, count: 1 }
- param_matches:
tool: create_issue
param: repo
value: "acme/api"
- no_errors: true
- name: checks duplicates before creating
input: "login page returns 500 on Safari"
assertions:
- tool_order:
- list_issues
- create_issue
- name: comments on an existing duplicate instead of creating
input: "add dark mode support"
assertions:
- tool_called: add_comment
- tool_not_called: create_issue
- name: never deletes anything
input: "spam: buy crypto now!!!"
assertions:
- tool_not_called: delete_issue
- name: recovers from rate-limit gracefully
input: "file bug for org: wrongorg/wrongrepo"
assertions:
- tool_called: create_issue
- error_handled: "rate limit"
Five test cases. Every one maps directly to a bug I've actually hit in production.
Step 3 — Run It
$ mcptest run
mcptest results
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Suite ┃ Case ┃ Status ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ triage agent │ opens a new issue for a novel bug │ PASS │
│ triage agent │ checks duplicates before creating │ PASS │
│ triage agent │ comments on existing duplicate │ PASS │
│ triage agent │ never deletes anything │ PASS │
│ triage agent │ recovers from rate-limit gracefully│ PASS │
└────────────────┴────────────────────────────────────┴────────┘
5 passed, 0 failed (5 total)
⏱ 1.6s
1.6 seconds. Zero tokens. No GitHub API calls. No rate-limit flake.
Step 4 — The Regression-Diff Trick
This is where the tool earns its keep.
First, snapshot the current (known-good) agent trajectories as baselines:
$ mcptest snapshot
✓ saved baseline for triage agent::opens a new issue... (2 tool calls)
✓ saved baseline for triage agent::checks duplicates... (2 tool calls)
✓ saved baseline for triage agent::comments on duplicate (2 tool calls)
...
Now go tweak the system prompt. Something innocuous — change this:
"You are a GitHub issue triage assistant. Check for duplicates before filing."
To this:
"You are a helpful assistant that handles GitHub issue reports."
No [ERROR] in the code. All unit tests still pass. Linter is happy.
$ mcptest diff --ci
✗ triage agent::checks duplicates before creating
tool_order REGRESSION:
baseline: list_issues → create_issue
current: create_issue ← list_issues was dropped!
✗ triage agent::comments on existing duplicate
tool_called REGRESSION:
baseline: add_comment was called
current: add_comment was never called
2 regression(s) across 5 case(s)
Exit: 1
Exit code 1 — CI blocks the merge. The agent silently lost its duplicate-check behavior because of a one-sentence prompt change.
This is the bug that cost me a weekend. I never want to hit it again.
Step 5 — Wire It into CI
# .github/workflows/agent-tests.yml
name: Agent tests
on: [pull_request]
jobs:
mcptest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install mcp-agent-test
- name: Run tests
run: mcptest run
- name: Diff against baselines
run: mcptest diff --ci
- name: Post PR summary
if: always()
run: mcptest github-comment
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Now every PR that changes the prompt, the model, or the agent code gets its trajectories diffed against main. If behavior shifts, a comment lands on the PR with the exact tool-order delta. Reviewers see agent behavior changed as clearly as they see code changed.
Why MCP-Specific Matters
Eval tooling is consolidating fast — independent evaluation startups keep getting folded into model-provider platforms. That's useful if you're all-in on one vendor. It's a lock-in risk if you're not.
mcptest is independent, MIT-licensed, and specifically shaped for MCP agents. The tool_called / tool_order / error_handled primitives exist because that's what an MCP trajectory actually looks like — not because someone ported a generic LLM-eval DSL.
MCP agent testing is particularly underserved. DeepEval is great for prompt evaluation. Inspect AI is great for benchmarks. Neither gives you "run your agent against a mock GitHub server and assert it didn't call delete_issue."
mcptest does.
Try It
pip install mcp-agent-test
# Scaffold a new project
mcptest init
# Or clone the quickstart
git clone https://github.com/josephgec/mcptest
cd mcptest/examples/quickstart
mcptest run
# Or install a pre-built fixture pack
mcptest install-pack github ./my-project
mcptest install-pack slack ./my-project
mcptest install-pack filesystem ./my-project
Six packs ship out of the box: GitHub, Slack, filesystem, database, HTTP, and git. Each one is a realistic mock with error scenarios baked in and tests that actually assert something.
📦 Source: github.com/josephgec/mcptest
📦 PyPI: pypi.org/project/mcp-agent-test
If you're building an MCP agent and haven't started writing tests yet, you're accumulating the same three bugs I accumulated. Start with one fixture and one test case. Catch the first prompt-change regression. Then you'll understand why MCP agents need this.
If this was useful, a ⭐ on the repo helps others find it.
Top comments (0)