Testing MCP Server Tools in AI Agents — A Practical Guide

#mcp #vercel #agentskills

Building an MCP server is only half the job. The other half — testing its tools — is where most developers drop the ball.

If you're using the Vercel AI SDK to build AI agents with Model Context Protocol (MCP), you already know how powerful tool-calling can be. But an untested tool is a liability. A hallucinating LLM calling a broken tool can cause cascading failures that are notoriously hard to debug in production. This guide walks you through a layered testing strategy — from fast unit tests to full end-to-end evaluations.

Why Testing MCP Tools Is Uniquely Hard

MCP tools sit at the intersection of three systems: your server logic, the LLM's tool selection behavior, and the AI SDK's execution pipeline. A bug can live in any one of them. Testing just the server in isolation gives you false confidence — you also need to verify that the LLM actually picks the right tool for a given prompt, and that the AI SDK correctly routes the call.

Layer 1: Unit Test Tool Logic Directly

Before involving any LLM, test your tool's execute() function in complete isolation. The AI SDK lets you invoke tool execution directly with a typed input:

const result = await tools['search-docs'].execute(
  { query: 'MCP transport options' },
  { messages: [], toolCallId: 'test-001' }
);

expect(result.documents.length).toBeGreaterThan(0);

This runs in milliseconds, costs nothing, and catches the most common bugs — wrong output shapes, missing error handling, and schema mismatches.

Layer 2: Inspect Schemas with the MCP Inspector

Run the official @modelcontextprotocol/inspector CLI tool against your server before writing a single test:

npx @modelcontextprotocol/inspector node dist/server.js

The browser UI lets you browse all registered tools, manually invoke them with custom inputs, and inspect raw JSON-RPC payloads. This is your smoke test — it confirms your server is reachable and your schemas are well-formed. Do this after every new tool you add.

Layer 3: Mock the LLM for Integration Tests

Real LLM API calls are slow, flaky, and expensive in CI pipelines. The AI SDK ships a MockLanguageModelV1 in ai/test specifically for this:

import { MockLanguageModelV1 } from 'ai/test';
import { generateText } from 'ai';

const mockModel = new MockLanguageModelV1({
  doGenerate: async () => ({
    finishReason: 'tool-calls',
    toolCalls: [{
      toolCallType: 'function',
      toolCallId: 'call-1',
      toolName: 'get-weather',
      args: JSON.stringify({ location: 'Bengaluru' }),
    }],
    rawCall: { rawPrompt: null, rawSettings: {} },
    usage: { promptTokens: 10, completionTokens: 5 },
  }),
});

const { toolResults } = await generateText({
  model: mockModel,
  tools: mcpTools,
  prompt: 'What is the weather in Bengaluru?',
});

This tests your entire AI agent pipeline — tool registration, execution, result parsing — without a single API call.

Layer 4: Evaluate Tool Selection Accuracy

The hardest problem to test is: will the LLM choose the right tool? A tool with a vague description might get ignored. A poorly scoped tool might be called when it shouldn't be. Use mcp-evals for structured grading:

import { evalTool } from 'mcp-evals';

const result = await evalTool({
  prompt: 'Find all open GitHub issues labeled bug',
  expectedTool: 'list-github-issues',
  grader: 'openai/gpt-4o',
});

console.log(`Score: ${result.score}, Passed: ${result.passed}`);

Run these evals whenever you change a tool's name, description, or input schema — those are the highest-risk moments.

The One Rule That Saves Hours of Debugging

Always call await mcpClient.close() after every test. Leaving MCP connections open causes port conflicts, zombie processes, and flaky tests in CI. Wrap every test suite in a beforeAll/afterAll block:

beforeAll(async () => { mcpClient = await createMCPClient({ ... }); });
afterAll(async () => { await mcpClient.close(); });

What to Test at Each Layer

Layer	Speed	Cost	Tests
`tool.execute()` directly	Fast	Free	Logic, output schema, error handling
MCP Inspector	Manual	Free	Schema validity, tool discoverability
`MockLanguageModelV1`	Fast	Free	Full agent pipeline, tool routing
`mcp-evals` grading	Slow	LLM cost	Tool selection accuracy by description

Start with Layer 1 for every tool, use Layer 3 for CI, and run Layer 4 before any deployment that touches tool descriptions or names. Your future self — staring at a 3 AM production incident — will thank you.

Have questions about testing a specific tool type? Drop them in the comments below.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.