<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dmytro Mykhaliev</title>
    <description>The latest articles on DEV Community by Dmytro Mykhaliev (@dmytro_mykhalievioioio).</description>
    <link>https://dev.to/dmytro_mykhalievioioio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3679938%2Febfd8285-5cdd-4814-9fd6-3b423c864679.png</url>
      <title>DEV Community: Dmytro Mykhaliev</title>
      <link>https://dev.to/dmytro_mykhalievioioio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dmytro_mykhalievioioio"/>
    <language>en</language>
    <item>
      <title>Testing MCP Servers Made Easy with agent-benchmark</title>
      <dc:creator>Dmytro Mykhaliev</dc:creator>
      <pubDate>Sun, 28 Dec 2025 13:49:23 +0000</pubDate>
      <link>https://dev.to/dmytro_mykhalievioioio/testing-mcp-servers-made-easy-with-agent-benchmark-41dg</link>
      <guid>https://dev.to/dmytro_mykhalievioioio/testing-mcp-servers-made-easy-with-agent-benchmark-41dg</guid>
      <description>&lt;p&gt;If you're building AI agents with the Model Context Protocol (MCP), you've probably asked yourself: "How do I systematically test my MCP servers across different LLMs?" After wrestling with this problem myself, I built &lt;strong&gt;agent-benchmark&lt;/strong&gt; - an open-source testing framework that makes it very simple to validate MCP server behavior across Claude, GPT, Gemini, and more.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why I Built This
&lt;/h3&gt;

&lt;p&gt;My main goal was simplicity. Inspired by Taurus (&lt;a href="https://gettaurus.org" rel="noopener noreferrer"&gt;https://gettaurus.org&lt;/a&gt;), I wanted QA engineers to create comprehensive tests without writing code—just YAML. But I also needed power: parameterization, data extraction, and seamless CI/CD integration.&lt;br&gt;
Testing MCP integrations is challenging. You need to test across providers, validate tool calls and parameters, check performance metrics, simulate conversations, and run regression tests. I couldn't find a tool that did all this well, so I built one.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🔄 Multi-provider support (Claude, GPT, Gemini, Groq, Azure, Vertex AI)&lt;/li&gt;
&lt;li&gt;🔌 Connect multiple MCP servers simultaneously with per-session tool restrictions to save tokens&lt;/li&gt;
&lt;li&gt;🧪 Multiple assertion types for comprehensive validation&lt;/li&gt;
&lt;li&gt;📊 HTML, json and markdown reports with performance comparisons&lt;/li&gt;
&lt;li&gt;🎯 Session-based testing for multi-turn conversations&lt;/li&gt;
&lt;li&gt;📦 Test suite support with shared configuration, parameters, and success criteria across multiple logically organized test files&lt;/li&gt;
&lt;li&gt;🎲 Built-in template system with Handlebars and Faker integrations and support for user-defined variables and environment variables&lt;/li&gt;
&lt;li&gt;🔄 Extract data from tool responses using JSONPath to pass between tests&lt;/li&gt;
&lt;li&gt;🔗 CI/CD ready&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Installation
&lt;/h4&gt;

&lt;p&gt;One-line install on macOS/Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or download pre-built binaries or build from source on the &lt;a href="https://github.com/mykhaliev/agent-benchmark/releases" rel="noopener noreferrer"&gt;releases page&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Your First Test: Cross-Provider Testing
&lt;/h4&gt;

&lt;p&gt;This test runs the same calculation request across Claude and GPT simultaneously. You'll see how each model handles tool calls, compare their response times and token usage, and identify which providers work best with your MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ANTHROPIC&lt;/span&gt;
    &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OPENAI&lt;/span&gt;
    &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;

&lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;calculator&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stdio&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node calculator-server.js&lt;/span&gt;

&lt;span class="na"&gt;agents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-agent&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude&lt;/span&gt;
    &lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;calculator&lt;/span&gt;    

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-agent&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt&lt;/span&gt;
    &lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;calculator&lt;/span&gt;

&lt;span class="na"&gt;settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;verbose&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;max_iterations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;sessions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Math Operations&lt;/span&gt;
    &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Complex calculation&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calculate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(42&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;13)&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;87"&lt;/span&gt;
        &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_called&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;calculate&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output_contains&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;633"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;max_latency_ms&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-key"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-key"&lt;/span&gt;
agent-benchmark &lt;span class="nt"&gt;-f&lt;/span&gt; test-calculator.yaml &lt;span class="nt"&gt;-reportType&lt;/span&gt; html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTML report will show you side-by-side comparisons - which agent passed, response times, token usage, and any failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Assertions
&lt;/h3&gt;

&lt;p&gt;agent-benchmark includes multiple assertion types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool Validation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tool_called&lt;/code&gt; - Verify a tool was invoked&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tool_not_called&lt;/code&gt; - Ensure a tool wasn't used&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tool_call_count&lt;/code&gt; - Exact number of calls&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tool_call_order&lt;/code&gt; - Sequential tool usage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tool_param_equals&lt;/code&gt; - Parameter validation (supports nested params!)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tool_param_matches_regex&lt;/code&gt; - Regex parameter matching&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;no_hallucinated_tools&lt;/code&gt; - Catch when agents call non-existent tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Output Validation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;output_contains&lt;/code&gt; / &lt;code&gt;output_not_contains&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;output_regex&lt;/code&gt; - Pattern matching&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tool_result_matches_json&lt;/code&gt; - JSONPath validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_tokens&lt;/code&gt; - Token usage limits&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_latency_ms&lt;/code&gt; - Response time validation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Template System &amp;amp; Faker
&lt;/h3&gt;

&lt;p&gt;Generate dynamic test data with built-in Handlebars helpers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Random values&lt;/span&gt;
  &lt;span class="na"&gt;user_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{randomInt&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;lower=1000&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;upper=9999}}"&lt;/span&gt;
  &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{randomValue&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;type='ALPHANUMERIC'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;length=32}}"&lt;/span&gt;

  &lt;span class="c1"&gt;# Timestamps&lt;/span&gt;
  &lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{now&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;format='unix'}}"&lt;/span&gt;
  &lt;span class="na"&gt;future_date&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{now&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;offset='7&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;format='yyyy-MM-dd'}}"&lt;/span&gt;

  &lt;span class="c1"&gt;# Fake data&lt;/span&gt;
  &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{faker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'Internet.email'}}"&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{faker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'Name.full_name'}}"&lt;/span&gt;
  &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{faker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'Address.street'}}"&lt;/span&gt;
  &lt;span class="na"&gt;company&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{faker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'Company.name'}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is perfect for creating realistic test scenarios without hardcoding values.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session-Based Testing
&lt;/h3&gt;

&lt;p&gt;Tests within a session share conversation history, letting you test multi-turn interactions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sessions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;User Workflow&lt;/span&gt;
    &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create user&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;named&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{name}}"&lt;/span&gt;
        &lt;span class="na"&gt;extractors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jsonpath&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;create_user&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$.data.user.id"&lt;/span&gt;
            &lt;span class="na"&gt;variable_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user_id&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update user&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Update&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{user_id}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{email}}"&lt;/span&gt;
        &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_param_equals&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;update_user&lt;/span&gt;
            &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{user_id}}"&lt;/span&gt;
              &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{email}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second test has access to the conversation from the first test, plus extracted variables.&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD Integration
&lt;/h3&gt;

&lt;p&gt;agent-benchmark is designed for automation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;## GitHub Actions&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test MCP Servers&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;agent-benchmark -s test-suite.yaml -o results.json -reportType json&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;secrets.ANTHROPIC_API_KEY&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;secrets.OPENAI_API_KEY&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload Results&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/upload-artifact@v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-results&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;results.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set success criteria in your suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;criteria&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;success_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.80&lt;/span&gt;  &lt;span class="c1"&gt;# 80% pass rate required&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit codes are standard: 0 for success, 1 for failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Extraction
&lt;/h3&gt;

&lt;p&gt;Extract data from tool results and pass it between tests to create realistic multi-step workflows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sessions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;User Workflow&lt;/span&gt;
    &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create user&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user"&lt;/span&gt;
        &lt;span class="na"&gt;extractors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jsonpath&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;create_user&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$.data.user.id"&lt;/span&gt;
            &lt;span class="na"&gt;variable_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user_id&lt;/span&gt;
        &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_called&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;create_user&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Get user details&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;details&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{user_id}}"&lt;/span&gt;
        &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_called&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_user&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_param_equals&lt;/span&gt;
            &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_user&lt;/span&gt;
            &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{user_id}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extractors use JSONPath to pull values from tool responses and make them available as variables in subsequent tests. This enables you to test complete user journeys where each step depends on data from previous operations - like creating a record and then updating or querying it using the generated ID.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test Suites
&lt;/h3&gt;

&lt;p&gt;For larger projects, organize tests into suites:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# suite.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Complete&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;MCP&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Suite"&lt;/span&gt;
&lt;span class="na"&gt;test_files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;tests/basic-operations.yaml&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;tests/error-handling.yaml&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;tests/performance.yaml&lt;/span&gt;

&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Shared provider config&lt;/span&gt;

&lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Shared server config&lt;/span&gt;

&lt;span class="na"&gt;criteria&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;success_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.90&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the entire suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-benchmark &lt;span class="nt"&gt;-s&lt;/span&gt; suite.yaml &lt;span class="nt"&gt;-o&lt;/span&gt; full-report &lt;span class="nt"&gt;-reportType&lt;/span&gt; html,json,md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real-World Use Cases
&lt;/h3&gt;

&lt;p&gt;I've been using agent-benchmark for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Regression Testing&lt;/strong&gt;: Catch breaking changes in MCP servers before they hit production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Benchmarking&lt;/strong&gt;: Compare latency and token usage across LLM providers to optimize costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Schema Validation&lt;/strong&gt;: Identify poorly designed tool schemas that cause multiple models to fail - if Claude, GPT, and Gemini all struggle with your tool, it's the schema, not the models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration Testing&lt;/strong&gt;: Test complex multi-tool workflows and multi-server setups to catch configuration issues and interaction problems between different MCP servers&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What's Next?
&lt;/h3&gt;

&lt;p&gt;I'm actively working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interactive conversation support&lt;/strong&gt;: Testing scenarios where the LLM requests additional information from the user mid-conversation. This will enable validation of more complex, multi-turn interactions where agents need clarification or supplementary data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with BlazeMeter and popular testing platforms&lt;/strong&gt;: This will unlock centralized test management, historical trending, team collaboration, and enterprise reporting. Imagine running your MCP tests alongside your API and performance tests in one unified dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New assertions and enhanced templating&lt;/strong&gt;: More validation options for complex scenarios and even more powerful template helpers to reduce test maintenance and increase reusability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Get Started
&lt;/h3&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/mykhaliev/agent-benchmark" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full documentation&lt;/li&gt;
&lt;li&gt;More examples&lt;/li&gt;
&lt;li&gt;Installation instructions&lt;/li&gt;
&lt;li&gt;Contributing guidelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project is Apache 2.0 licensed and contributions are very welcome!&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrapping Up
&lt;/h3&gt;

&lt;p&gt;Testing MCP servers doesn't have to be complicated. With agent-benchmark, you can define comprehensive tests in YAML, run them across multiple LLM providers, and get beautiful reports - all with a single command.&lt;/p&gt;

&lt;p&gt;Give it a try and let me know what you think! I'd love to hear about your use cases or feature requests.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/mykhaliev/agent-benchmark" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;MCP Protocol Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mykhaliev/agent-benchmark/releases" rel="noopener noreferrer"&gt;Latest Release&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Have questions? Drop them in the comments or open an issue on GitHub!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>llm</category>
      <category>testing</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
