Building an MCP server is only half the job. The other half — testing its tools — is where most developers drop the ball.
If you're using the Vercel AI SDK to build AI agents with Model Context Protocol (MCP), you already know how powerful tool-calling can be. But an untested tool is a liability. A hallucinating LLM calling a broken tool can cause cascading failures that are notoriously hard to debug in production. This guide walks you through a layered testing strategy — from fast unit tests to full end-to-end evaluations.
Why Testing MCP Tools Is Uniquely Hard
MCP tools sit at the intersection of three systems: your server logic, the LLM's tool selection behavior, and the AI SDK's execution pipeline. A bug can live in any one of them. Testing just the server in isolation gives you false confidence — you also need to verify that the LLM actually picks the right tool for a given prompt, and that the AI SDK correctly routes the call.
Layer 1: Unit Test Tool Logic Directly
Before involving any LLM, test your tool's execute() function in complete isolation. The AI SDK lets you invoke tool execution directly with a typed input:
const result = await tools['search-docs'].execute(
{ query: 'MCP transport options' },
{ messages: [], toolCallId: 'test-001' }
);
expect(result.documents.length).toBeGreaterThan(0);
This runs in milliseconds, costs nothing, and catches the most common bugs — wrong output shapes, missing error handling, and schema mismatches.
Layer 2: Inspect Schemas with the MCP Inspector
Run the official @modelcontextprotocol/inspector CLI tool against your server before writing a single test:
npx @modelcontextprotocol/inspector node dist/server.js
The browser UI lets you browse all registered tools, manually invoke them with custom inputs, and inspect raw JSON-RPC payloads. This is your smoke test — it confirms your server is reachable and your schemas are well-formed. Do this after every new tool you add.
Layer 3: Mock the LLM for Integration Tests
Real LLM API calls are slow, flaky, and expensive in CI pipelines. The AI SDK ships a MockLanguageModelV1 in ai/test specifically for this:
import { MockLanguageModelV1 } from 'ai/test';
import { generateText } from 'ai';
const mockModel = new MockLanguageModelV1({
doGenerate: async () => ({
finishReason: 'tool-calls',
toolCalls: [{
toolCallType: 'function',
toolCallId: 'call-1',
toolName: 'get-weather',
args: JSON.stringify({ location: 'Bengaluru' }),
}],
rawCall: { rawPrompt: null, rawSettings: {} },
usage: { promptTokens: 10, completionTokens: 5 },
}),
});
const { toolResults } = await generateText({
model: mockModel,
tools: mcpTools,
prompt: 'What is the weather in Bengaluru?',
});
This tests your entire AI agent pipeline — tool registration, execution, result parsing — without a single API call.
Layer 4: Evaluate Tool Selection Accuracy
The hardest problem to test is: will the LLM choose the right tool? A tool with a vague description might get ignored. A poorly scoped tool might be called when it shouldn't be. Use mcp-evals for structured grading:
import { evalTool } from 'mcp-evals';
const result = await evalTool({
prompt: 'Find all open GitHub issues labeled bug',
expectedTool: 'list-github-issues',
grader: 'openai/gpt-4o',
});
console.log(`Score: ${result.score}, Passed: ${result.passed}`);
Run these evals whenever you change a tool's name, description, or input schema — those are the highest-risk moments.
The One Rule That Saves Hours of Debugging
Always call await mcpClient.close() after every test. Leaving MCP connections open causes port conflicts, zombie processes, and flaky tests in CI. Wrap every test suite in a beforeAll/afterAll block:
beforeAll(async () => { mcpClient = await createMCPClient({ ... }); });
afterAll(async () => { await mcpClient.close(); });
What to Test at Each Layer
| Layer | Speed | Cost | Tests |
|---|---|---|---|
tool.execute() directly |
Fast | Free | Logic, output schema, error handling |
| MCP Inspector | Manual | Free | Schema validity, tool discoverability |
MockLanguageModelV1 |
Fast | Free | Full agent pipeline, tool routing |
mcp-evals grading |
Slow | LLM cost | Tool selection accuracy by description |
Start with Layer 1 for every tool, use Layer 3 for CI, and run Layer 4 before any deployment that touches tool descriptions or names. Your future self — staring at a 3 AM production incident — will thank you.
Have questions about testing a specific tool type? Drop them in the comments below.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.