DEV Community

Cover image for Introducing mcp-tef - Testing Your MCP Tool Descriptions Before They Cause Problems
Nigel Brown for Stacklok

Posted on

Introducing mcp-tef - Testing Your MCP Tool Descriptions Before They Cause Problems

Introducing mcp-tef - Testing Your MCP Tool Descriptions Before They Cause Problems


TL;DR

When you build MCP tools, vague or overlapping descriptions cause LLMs to select the wrong tools—or no tools at all. Testing in production frustrates users and damages trust. mcp-tef is an open-source tool evaluation system that lets you test tool descriptions systematically before deployment, catching problems early with real LLM testing, similarity detection, and quality analysis.


The Problem: Tool Description Failures in Production

When you write an MCP tool, you provide a name and description. The LLM reads this description and decides whether to use your tool based on user prompts. But here's what goes wrong:

Vague descriptions confuse LLMs. A tool called search with description "Search for things" gives the LLM no information about what can be searched, how to search it, or what it returns.

Overlapping descriptions cause conflicts. You might have your own create_issue tool, but then add a third-party GitHub MCP server that also has create_issue. The LLM sees two tools with identical names doing similar things and can't determine which to select.

The result: The LLM either picks the wrong tool entirely or becomes so confused that it picks no tool at all. Users get frustrated, trust erodes, and you're debugging in production.

It gets worse with mixed environments. The MCP ecosystem is growing fast. You're mixing custom tools with third-party MCP servers, and maybe multiple third-party servers together. Each has its own set of tools, and they all need to play nicely together. Without systematic testing, conflicts and confusion multiply.


Why This Matters: The Cost of Getting It Wrong

Testing in production is expensive. By the time you realize your tool descriptions are broken, you've already frustrated users. You're fixing problems reactively instead of preventing them proactively.

Manual testing doesn't scale. How do you know if your fix actually works? How do you know if two descriptions are too similar? How do you test that the LLM will actually pick the right tool when a user asks a real question? You can't manually test every possible prompt against every combination of tools.

The solution: Test tool descriptions systematically before deployment, with real LLM testing and actionable feedback.


How mcp-tef Solves This

mcp-tef is an open source (Apache 2.0 licensed) tool evaluation system that helps you create correct, non-clashing tool descriptions from the start. It provides three core capabilities:

1. Tool evaluation

Create test cases with real user prompts (queries), and mcp-tef tests whether the LLM picks the right tool. It provides metrics (precision, recall, F1 scores), validates parameter extraction, and analyzes confidence. If the LLM is highly confident but wrong, that's a "misleading" description that needs immediate attention.

Example:

# Create a test case
mtef test-case create \
  --url https://localhost:8000 \
  --name "GitHub repository search" \
  --query "Find repositories related to MCP tools" \
  --expected-server "http://localhost:8080/github/mcp" \
  --expected-tool "search_repositories" \
  --servers "http://localhost:8080/github/mcp:streamable-http" \
  --insecure

✓ Test case created successfully
ID: d2fcb4bf-8334-4339-a0a8-c1ead2deeea6

# Run the test
mtef test-run execute d2fcb4bf-8334-4339-a0a8-c1ead2deeea6 \
  --url https://localhost:8000 \
  --model-provider openrouter \
  --model-name anthropic/claude-3.5-sonnet \
  --api-key sk-or-v1-... \
  --insecure
Enter fullscreen mode Exit fullscreen mode

Result:

✓ Test run completed successfully
Status: completed
Classification: TP (True Positive)
Tool Match: Correct
Confidence: high (robust description)
Param Score: 10.0/10
Execution: 9,295 ms
Enter fullscreen mode Exit fullscreen mode

2. Similarity detection

Uses embeddings to find tools with similar descriptions. Generates similarity matrices showing which tools overlap, and flags high-similarity pairs (e.g., 0.87 similarity) that might confuse the LLM. Provides specific recommendations for differentiation, including revised descriptions you can use.

Example:

mtef similarity analyze \
  --url https://localhost:8000 \
  --server-urls "http://localhost:8080/fetch/mcp:streamable-http,http://localhost:8080/toolhive-doc-mcp/mcp:streamable-http,http://localhost:8080/mcp-optimizer/mcp:streamable-http,http://localhost:8080/github/mcp:streamable-http" \
  --threshold 0.85 \
  --insecure
Enter fullscreen mode Exit fullscreen mode

Result:

✓ Analysis complete: 18 pairs flagged above 0.85 threshold
Analyzed 55 tools across 4 servers
Enter fullscreen mode Exit fullscreen mode

3. Tool quality analysis

Scores tool descriptions on clarity, completeness, and conciseness (1-10 scale). Tells you what's missing, what's vague, and what could be improved. Provides suggested improved descriptions.

Example:

$ mtef tool-quality \
  --url https://localhost:8000 \
  --server-urls "http://localhost:8080/toolhive-doc-mcp/mcp" \
  --model-provider openrouter \
  --model-name anthropic/claude-3.5-sonnet \
  --insecure \
  --timeout 120
Enter fullscreen mode Exit fullscreen mode

Result:

ℹ Using mcp-tef at https://localhost:8000

Tool Quality Evaluation Results
============================================================

┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Tool Name  ┃ Clarity ┃ Completeness ┃ Conciseness ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ query_docs │  7/10   │     6/10     │    9/10     │
│ get_chunk  │  6/10   │     4/10     │    8/10     │
└────────────┴─────────┴──────────────┴─────────────┘

✓ Evaluated 2 tool(s)

Enter fullscreen mode Exit fullscreen mode

Note on transport support:

  • Supported: mcp-tef connects to MCP servers using the Streamable HTTP or SSE (deprecated) transports.
  • Not supported: mcp-tef does not support stdio servers directly, but you can use stdio-based MCP servers with ToolHive, which runs stdio servers and exposes them via Streamable HTTP endpoint.

Using mcp-tef: CLI and HTTP API

All the examples in this post use the mtef CLI tool, but every operation can also be performed directly via HTTP API calls. The mcp-tef server exposes a REST API with OpenAPI documentation, so you can integrate it into your own workflows, CI/CD pipelines, or applications. The server provides interactive API documentation at /docs and an OpenAPI specification at /openapi.json.

Both approaches provide the same functionality—choose the one that fits your workflow.


Where You Use It

Your own MCP servers: Test descriptions before deployment. Create test cases for common user prompts, run them through mcp-tef, iterate on descriptions until tests pass.

Third-party MCP servers: Evaluate tools before integrating. Test server tools in isolation, see how well they perform, make informed decisions about which servers to use.

Mixed environments: Before mixing multiple servers together, run similarity detection. See which tools conflict, use mcp-tef's recommendations to understand how to differentiate them—maybe you'll need vMCP's prefixing, or maybe you can improve descriptions.

Continuous testing: As you add new tools or update descriptions, keep testing. Make mcp-tef part of your CI/CD pipeline. Catch problems before they reach users.

LLM comparison and migration: Validate that different models (e.g., Anthropic Claude vs. Ollama Llama) correctly select tools using the same test cases. Compare performance across providers to ensure tool descriptions work consistently.


Real-World Example

You're building a document management MCP server with a tool called search and description: "Search for documents."

mcp-tef flags it:

  • Clarity: 3/10
  • Missing: what can you search? Content? Filenames? Metadata? What does it return?

You improve it to: "Search document CONTENT using keywords and boolean operators. Supports PDF, TXT, DOCX, and MD files. Returns ranked results with highlighted excerpts and relevance scores."

You test it: Create a test case, run it, LLM correctly selects your tool. Great!

But then you add a third-party file system MCP server with find_files: "Find files by searching with patterns."

Similarity detection catches it: 0.87 similarity. Recommendation: "Emphasize that search searches CONTENT, while find_files searches FILENAMES."

You differentiate them clearly: Now the LLM can distinguish between searching document content and finding files by name. If the third-party server doesn't update their description, you can still use vMCP to prefix them, but now the descriptions are also clear, so the LLM makes better choices.


Getting Started

mcp-tef is open source and works with several providers: Anthropic, OpenAI, Openrouter and Ollama.

Prerequisites

Required:

Optional:

  • Ollama — for local LLM testing (no API keys needed)
  • Docker — if deploying via the CLI (mtef deploy)
  • API keys — for cloud LLM providers (e.g., OpenRouter) if not using Ollama

Install:

uv tool install \
"mcp-tef-cli@git+https://github.com/StacklokLabs/mcp-tef.git#subdirectory=cli"
Enter fullscreen mode Exit fullscreen mode

Deploy:

mtef deploy --health-check
Enter fullscreen mode Exit fullscreen mode

Test your tools:

Using the examples above

# Check quality
mtef tool-quality ...

# Create test case
mtef test-case create --name "My first test" --query ...

# Run test
mtef test-run execute <test-case-id> ...
Enter fullscreen mode Exit fullscreen mode

The whole process takes just a few minutes. You'll immediately see if your descriptions work or if they need improvement.


How mcp-tef Works with vMCP and MCP Optimizer

These tools are designed to work together, each solving different parts of the MCP ecosystem challenge:

mcp-tef helps you write better tool descriptions from the start. It tests whether descriptions are clear, complete, and differentiated. When descriptions are good, LLMs make better tool selection decisions.

vMCP (Virtual MCP Server) provides a unified gateway for multiple MCP servers, handling tool name conflicts through intelligent prefixing and routing. When you've tested your descriptions with mcp-tef, vMCP's prefixing works even better—the LLM can distinguish tools not just by name, but by their clear, well-differentiated descriptions.

MCP Optimizer intelligently routes requests to the right tools across your MCP ecosystem. With well-tested descriptions from mcp-tef, Optimizer has better information to work with, requiring fewer manual overrides and making smarter routing decisions.

The workflow: Use mcp-tef to test and improve your tool descriptions. Deploy with vMCP to handle multi-server coordination. Let MCP Optimizer route requests intelligently. Good descriptions make all these solutions work better together, creating a more reliable and maintainable system.


The Verdict

mcp-tef helps you write better tool descriptions systematically, with real LLM testing and actionable feedback. But great descriptions work even better when combined with the right infrastructure tools.

Key takeaway: Test your tool descriptions before deploying. Good descriptions lead to better tool selection, which leads to happier users. And when you combine well-tested descriptions with tools like vMCP and MCP Optimizer, you get a robust, maintainable MCP ecosystem that works reliably at scale.


Key Points Summary

  1. The problem: Vague or overlapping tool descriptions confuse LLMs, leading to incorrect tool selection.
  2. Why it matters: Testing in production frustrates users; prevention is better than reactive fixes.
  3. The solution: mcp-tef provides systematic testing with tool evaluation, similarity detection, and quality analysis.
  4. Where to use it: Your own servers, third-party servers, mixed environments, continuous testing, LLM comparison.
  5. The goal: Create descriptions that are correct and don't clash, making your entire MCP ecosystem work better.
  6. Working together: mcp-tef, vMCP, and MCP Optimizer complement each other. Good descriptions make infrastructure tools work even better.

Want to join in the MCP fun? Visit toolhive.dev and join the ToolHive community on Discord.

Top comments (0)