If you're building AI agents with the Model Context Protocol (MCP), you've probably asked yourself: "How do I systematically test my MCP servers across different LLMs?" After wrestling with this problem myself, I built agent-benchmark - an open-source testing framework that makes it very simple to validate MCP server behavior across Claude, GPT, Gemini, and more.
Why I Built This
My main goal was simplicity. Inspired by Taurus (https://gettaurus.org), I wanted QA engineers to create comprehensive tests without writing codeโjust YAML. But I also needed power: parameterization, data extraction, and seamless CI/CD integration.
Testing MCP integrations is challenging. You need to test across providers, validate tool calls and parameters, check performance metrics, simulate conversations, and run regression tests. I couldn't find a tool that did all this well, so I built one.
Key Features
- ๐ Multi-provider support (Claude, GPT, Gemini, Groq, Azure, Vertex AI)
- ๐ Connect multiple MCP servers simultaneously with per-session tool restrictions to save tokens
- ๐งช Multiple assertion types for comprehensive validation
- ๐ HTML, json and markdown reports with performance comparisons
- ๐ฏ Session-based testing for multi-turn conversations
- ๐ฆ Test suite support with shared configuration, parameters, and success criteria across multiple logically organized test files
- ๐ฒ Built-in template system with Handlebars and Faker integrations and support for user-defined variables and environment variables
- ๐ Extract data from tool responses using JSONPath to pass between tests
- ๐ CI/CD ready
Quick Start
Installation
One-line install on macOS/Linux:
curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.sh | bash
Or download pre-built binaries or build from source on the releases page.
Your First Test: Cross-Provider Testing
This test runs the same calculation request across Claude and GPT simultaneously. You'll see how each model handles tool calls, compare their response times and token usage, and identify which providers work best with your MCP server:
providers:
- name: claude
type: ANTHROPIC
token: {{ANTHROPIC_API_KEY}}
model: claude-sonnet-4-20250514
- name: gpt
type: OPENAI
token: {{OPENAI_API_KEY}}
model: gpt-4o-mini
servers:
- name: calculator
type: stdio
command: node calculator-server.js
agents:
- name: claude-agent
provider: claude
servers:
- name: calculator
- name: gpt-agent
provider: gpt
servers:
- name: calculator
settings:
verbose: true
max_iterations: 10
sessions:
- name: Math Operations
tests:
- name: Complex calculation
prompt: "Calculate (42 * 13) + 87"
assertions:
- type: tool_called
tool: calculate
- type: output_contains
value: "633"
- type: max_latency_ms
value: 5000
Run it:
export ANTHROPIC_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
agent-benchmark -f test-calculator.yaml -reportType html
The HTML report will show you side-by-side comparisons - which agent passed, response times, token usage, and any failures.
Assertions
agent-benchmark includes multiple assertion types:
Tool Validation:
-
tool_called- Verify a tool was invoked -
tool_not_called- Ensure a tool wasn't used -
tool_call_count- Exact number of calls -
tool_call_order- Sequential tool usage -
tool_param_equals- Parameter validation (supports nested params!) -
tool_param_matches_regex- Regex parameter matching -
no_hallucinated_tools- Catch when agents call non-existent tools
Output Validation:
-
output_contains/output_not_contains -
output_regex- Pattern matching -
tool_result_matches_json- JSONPath validation
Performance:
-
max_tokens- Token usage limits -
max_latency_ms- Response time validation
Template System & Faker
Generate dynamic test data with built-in Handlebars helpers:
variables:
# Random values
user_id: "{{randomInt lower=1000 upper=9999}}"
api_key: "{{randomValue type='ALPHANUMERIC' length=32}}"
# Timestamps
created_at: "{{now format='unix'}}"
future_date: "{{now offset='7 days' format='yyyy-MM-dd'}}"
# Fake data
email: "{{faker 'Internet.email'}}"
name: "{{faker 'Name.full_name'}}"
address: "{{faker 'Address.street'}}"
company: "{{faker 'Company.name'}}"
This is perfect for creating realistic test scenarios without hardcoding values.
Session-Based Testing
Tests within a session share conversation history, letting you test multi-turn interactions:
sessions:
- name: User Workflow
tests:
- name: Create user
prompt: "Create a new user named {{name}}"
extractors:
- type: jsonpath
tool: create_user
path: "$.data.user.id"
variable_name: user_id
- name: Update user
prompt: "Update user {{user_id}} email to {{email}}"
assertions:
- type: tool_param_equals
tool: update_user
params:
id: "{{user_id}}"
email: "{{email}}"
The second test has access to the conversation from the first test, plus extracted variables.
CI/CD Integration
agent-benchmark is designed for automation:
## GitHub Actions
- name: Test MCP Servers
run: |
agent-benchmark -s test-suite.yaml -o results.json -reportType json
env:
ANTHROPIC_API_KEY: {{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: {{ secrets.OPENAI_API_KEY }}
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: test-results
path: results.json
Set success criteria in your suite:
criteria:
success_rate: 0.80 # 80% pass rate required
Exit codes are standard: 0 for success, 1 for failure.
Data Extraction
Extract data from tool results and pass it between tests to create realistic multi-step workflows:
sessions:
- name: User Workflow
tests:
- name: Create user
prompt: "Create a new user"
extractors:
- type: jsonpath
tool: create_user
path: "$.data.user.id"
variable_name: user_id
assertions:
- type: tool_called
tool: create_user
- name: Get user details
prompt: "Get details for user {{user_id}}"
assertions:
- type: tool_called
tool: get_user
- type: tool_param_equals
tool: get_user
params:
id: "{{user_id}}"
Extractors use JSONPath to pull values from tool responses and make them available as variables in subsequent tests. This enables you to test complete user journeys where each step depends on data from previous operations - like creating a record and then updating or querying it using the generated ID.
Test Suites
For larger projects, organize tests into suites:
# suite.yaml
name: "Complete MCP Test Suite"
test_files:
- tests/basic-operations.yaml
- tests/error-handling.yaml
- tests/performance.yaml
providers:
# Shared provider config
servers:
# Shared server config
criteria:
success_rate: 0.90
Run the entire suite:
agent-benchmark -s suite.yaml -o full-report -reportType html,json,md
Real-World Use Cases
I've been using agent-benchmark for:
- Regression Testing: Catch breaking changes in MCP servers before they hit production
- Performance Benchmarking: Compare latency and token usage across LLM providers to optimize costs
- Tool Schema Validation: Identify poorly designed tool schemas that cause multiple models to fail - if Claude, GPT, and Gemini all struggle with your tool, it's the schema, not the models
- Integration Testing: Test complex multi-tool workflows and multi-server setups to catch configuration issues and interaction problems between different MCP servers
What's Next?
I'm actively working on:
- Interactive conversation support: Testing scenarios where the LLM requests additional information from the user mid-conversation. This will enable validation of more complex, multi-turn interactions where agents need clarification or supplementary data.
- Integration with BlazeMeter and popular testing platforms: This will unlock centralized test management, historical trending, team collaboration, and enterprise reporting. Imagine running your MCP tests alongside your API and performance tests in one unified dashboard.
- New assertions and enhanced templating: More validation options for complex scenarios and even more powerful template helpers to reduce test maintenance and increase reusability.
Get Started
Check out the GitHub repo for:
- Full documentation
- More examples
- Installation instructions
- Contributing guidelines
The project is Apache 2.0 licensed and contributions are very welcome!
Wrapping Up
Testing MCP servers doesn't have to be complicated. With agent-benchmark, you can define comprehensive tests in YAML, run them across multiple LLM providers, and get beautiful reports - all with a single command.
Give it a try and let me know what you think! I'd love to hear about your use cases or feature requests.
Links:
Have questions? Drop them in the comments or open an issue on GitHub!
Top comments (0)