DEV Community

Cover image for Testing MCP Servers Made Easy with agent-benchmark
Dmytro Mykhaliev
Dmytro Mykhaliev

Posted on

Testing MCP Servers Made Easy with agent-benchmark

If you're building AI agents with the Model Context Protocol (MCP), you've probably asked yourself: "How do I systematically test my MCP servers across different LLMs?" After wrestling with this problem myself, I built agent-benchmark - an open-source testing framework that makes it very simple to validate MCP server behavior across Claude, GPT, Gemini, and more.

Why I Built This

My main goal was simplicity. Inspired by Taurus (https://gettaurus.org), I wanted QA engineers to create comprehensive tests without writing codeโ€”just YAML. But I also needed power: parameterization, data extraction, and seamless CI/CD integration.
Testing MCP integrations is challenging. You need to test across providers, validate tool calls and parameters, check performance metrics, simulate conversations, and run regression tests. I couldn't find a tool that did all this well, so I built one.

Key Features

  • ๐Ÿ”„ Multi-provider support (Claude, GPT, Gemini, Groq, Azure, Vertex AI)
  • ๐Ÿ”Œ Connect multiple MCP servers simultaneously with per-session tool restrictions to save tokens
  • ๐Ÿงช Multiple assertion types for comprehensive validation
  • ๐Ÿ“Š HTML, json and markdown reports with performance comparisons
  • ๐ŸŽฏ Session-based testing for multi-turn conversations
  • ๐Ÿ“ฆ Test suite support with shared configuration, parameters, and success criteria across multiple logically organized test files
  • ๐ŸŽฒ Built-in template system with Handlebars and Faker integrations and support for user-defined variables and environment variables
  • ๐Ÿ”„ Extract data from tool responses using JSONPath to pass between tests
  • ๐Ÿ”— CI/CD ready

Quick Start

Installation

One-line install on macOS/Linux:

curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.sh | bash
Enter fullscreen mode Exit fullscreen mode

Or download pre-built binaries or build from source on the releases page.

Your First Test: Cross-Provider Testing

This test runs the same calculation request across Claude and GPT simultaneously. You'll see how each model handles tool calls, compare their response times and token usage, and identify which providers work best with your MCP server:

providers:
  - name: claude
    type: ANTHROPIC
    token: {{ANTHROPIC_API_KEY}}
    model: claude-sonnet-4-20250514

  - name: gpt
    type: OPENAI
    token: {{OPENAI_API_KEY}}
    model: gpt-4o-mini

servers:
  - name: calculator
    type: stdio
    command: node calculator-server.js

agents:
  - name: claude-agent
    provider: claude
    servers:
      - name: calculator    

  - name: gpt-agent
    provider: gpt
    servers:
      - name: calculator

settings:
  verbose: true
  max_iterations: 10

sessions:
  - name: Math Operations
    tests:
      - name: Complex calculation
        prompt: "Calculate (42 * 13) + 87"
        assertions:
          - type: tool_called
            tool: calculate
          - type: output_contains
            value: "633"
          - type: max_latency_ms
            value: 5000
Enter fullscreen mode Exit fullscreen mode

Run it:

export ANTHROPIC_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
agent-benchmark -f test-calculator.yaml -reportType html
Enter fullscreen mode Exit fullscreen mode

The HTML report will show you side-by-side comparisons - which agent passed, response times, token usage, and any failures.

Assertions

agent-benchmark includes multiple assertion types:

Tool Validation:

  • tool_called - Verify a tool was invoked
  • tool_not_called - Ensure a tool wasn't used
  • tool_call_count - Exact number of calls
  • tool_call_order - Sequential tool usage
  • tool_param_equals - Parameter validation (supports nested params!)
  • tool_param_matches_regex - Regex parameter matching
  • no_hallucinated_tools - Catch when agents call non-existent tools

Output Validation:

  • output_contains / output_not_contains
  • output_regex - Pattern matching
  • tool_result_matches_json - JSONPath validation

Performance:

  • max_tokens - Token usage limits
  • max_latency_ms - Response time validation

Template System & Faker

Generate dynamic test data with built-in Handlebars helpers:

variables:
  # Random values
  user_id: "{{randomInt lower=1000 upper=9999}}"
  api_key: "{{randomValue type='ALPHANUMERIC' length=32}}"

  # Timestamps
  created_at: "{{now format='unix'}}"
  future_date: "{{now offset='7 days' format='yyyy-MM-dd'}}"

  # Fake data
  email: "{{faker 'Internet.email'}}"
  name: "{{faker 'Name.full_name'}}"
  address: "{{faker 'Address.street'}}"
  company: "{{faker 'Company.name'}}"
Enter fullscreen mode Exit fullscreen mode

This is perfect for creating realistic test scenarios without hardcoding values.

Session-Based Testing

Tests within a session share conversation history, letting you test multi-turn interactions:

sessions:
  - name: User Workflow
    tests:
      - name: Create user
        prompt: "Create a new user named {{name}}"
        extractors:
          - type: jsonpath
            tool: create_user
            path: "$.data.user.id"
            variable_name: user_id

      - name: Update user
        prompt: "Update user {{user_id}} email to {{email}}"
        assertions:
          - type: tool_param_equals
            tool: update_user
            params:
              id: "{{user_id}}"
              email: "{{email}}"
Enter fullscreen mode Exit fullscreen mode

The second test has access to the conversation from the first test, plus extracted variables.

CI/CD Integration

agent-benchmark is designed for automation:

## GitHub Actions
- name: Test MCP Servers
  run: |
    agent-benchmark -s test-suite.yaml -o results.json -reportType json
  env:
    ANTHROPIC_API_KEY: {{ secrets.ANTHROPIC_API_KEY }}
    OPENAI_API_KEY: {{ secrets.OPENAI_API_KEY }}

- name: Upload Results
  uses: actions/upload-artifact@v3
  with:
    name: test-results
    path: results.json
Enter fullscreen mode Exit fullscreen mode

Set success criteria in your suite:

criteria:
  success_rate: 0.80  # 80% pass rate required
Enter fullscreen mode Exit fullscreen mode

Exit codes are standard: 0 for success, 1 for failure.

Data Extraction

Extract data from tool results and pass it between tests to create realistic multi-step workflows:

sessions:
  - name: User Workflow
    tests:
      - name: Create user
        prompt: "Create a new user"
        extractors:
          - type: jsonpath
            tool: create_user
            path: "$.data.user.id"
            variable_name: user_id
        assertions:
          - type: tool_called
            tool: create_user

      - name: Get user details
        prompt: "Get details for user {{user_id}}"
        assertions:
          - type: tool_called
            tool: get_user
          - type: tool_param_equals
            tool: get_user
            params:
              id: "{{user_id}}"
Enter fullscreen mode Exit fullscreen mode

Extractors use JSONPath to pull values from tool responses and make them available as variables in subsequent tests. This enables you to test complete user journeys where each step depends on data from previous operations - like creating a record and then updating or querying it using the generated ID.

Test Suites

For larger projects, organize tests into suites:

# suite.yaml
name: "Complete MCP Test Suite"
test_files:
  - tests/basic-operations.yaml
  - tests/error-handling.yaml
  - tests/performance.yaml

providers:
  # Shared provider config

servers:
  # Shared server config

criteria:
  success_rate: 0.90
Enter fullscreen mode Exit fullscreen mode

Run the entire suite:

agent-benchmark -s suite.yaml -o full-report -reportType html,json,md
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

I've been using agent-benchmark for:

  1. Regression Testing: Catch breaking changes in MCP servers before they hit production
  2. Performance Benchmarking: Compare latency and token usage across LLM providers to optimize costs
  3. Tool Schema Validation: Identify poorly designed tool schemas that cause multiple models to fail - if Claude, GPT, and Gemini all struggle with your tool, it's the schema, not the models
  4. Integration Testing: Test complex multi-tool workflows and multi-server setups to catch configuration issues and interaction problems between different MCP servers

What's Next?

I'm actively working on:

  • Interactive conversation support: Testing scenarios where the LLM requests additional information from the user mid-conversation. This will enable validation of more complex, multi-turn interactions where agents need clarification or supplementary data.
  • Integration with BlazeMeter and popular testing platforms: This will unlock centralized test management, historical trending, team collaboration, and enterprise reporting. Imagine running your MCP tests alongside your API and performance tests in one unified dashboard.
  • New assertions and enhanced templating: More validation options for complex scenarios and even more powerful template helpers to reduce test maintenance and increase reusability.

Get Started

Check out the GitHub repo for:

  • Full documentation
  • More examples
  • Installation instructions
  • Contributing guidelines

The project is Apache 2.0 licensed and contributions are very welcome!

Wrapping Up

Testing MCP servers doesn't have to be complicated. With agent-benchmark, you can define comprehensive tests in YAML, run them across multiple LLM providers, and get beautiful reports - all with a single command.

Give it a try and let me know what you think! I'd love to hear about your use cases or feature requests.


Links:

Have questions? Drop them in the comments or open an issue on GitHub!

Top comments (0)