Alejandro Ponce de León for Stacklok

Posted on Dec 10, 2025

Stacklok's MCP Optimizer vs Anthropic's Tool Search Tool: A Head-to-Head Comparison

#ai #mcp #anthropic #stacklok

TL;DR

Both solutions tackle the critical problem of token bloat from excessive tool definitions. However, our testing with 2,792 tools reveals a stark performance gap: Stacklok MCP Optimizer achieves 94% accuracy in selecting the right tools, while Anthropic's Tool Search Tool achieves only 34% accuracy. If you're building production AI agents that need reliable tool selection without breaking the bank on tokens, these numbers matter.

The Problem Both Are Solving

When you connect AI agents to multiple Model Context Protocol (MCP) servers, tool definitions quickly consume massive portions of your context window, often before your actual conversation even begins.

The reality? Most queries only need a handful of these tools. Loading all of them wastes tokens (read: money) and degrades model performance as the tool count grows.

Both Stacklok MCP Optimizer (launched October 28, 2025) and Anthropic's Tool Search Tool (launched November 20, 2025 as part of their advanced tool use beta) address this by loading a single search tool that finds and loads only the necessary tools on demand.

Why This Matters: Real Benefits and Trade-offs

The Upside

Token savings are substantial. We've observed up to 80% reductions in input tokens. In their internal testing, Anthropic reports their approach preserves 191,300 tokens of context compared to loading all tools upfront, an 85% reduction. In rate-limited enterprise environments, this translates directly to cost savings and faster response times.

Improved model performance. Reducing token overhead doesn't just save money, it can improve model accuracy. Anthropic's internal testing showed substantial improvements with Tool Search Tool enabled: Opus 4 jumped from 49% to 74%, and Opus 4.5 improved from 79.5% to 88.1% on MCP evaluations. However, it's important to note that Anthropic's experiments and datasets are not publicly available, making direct comparisons challenging.

Our own testing with MCP Optimizer across different model tiers revealed an interesting pattern: while state-of-the-art models like Claude Sonnet 4 maintained strong performance when benchmarking tool selection accuracy (94.6% → 93.4%), mid-tier and smaller models showed significant improvements. Gemini 2.5 Flash increased from 83.2% to 92.4%, and the gpt-oss-20B model nearly doubled its accuracy from 38% to 69.4%. This suggests that efficient tool loading particularly benefits models with tighter context constraints, making MCP Optimizer valuable across different deployment scenarios, from resource-constrained edge deployments to cost-optimized production systems.

The Downside

Risk of tool retrieval failure. The benefits above assume the search tool successfully finds the right tool. But what happens when it doesn't? If the search tool can't find the right tool, your task fails or produces unexpected behavior. While the agent can retry searches, this introduces latency and still consumes tokens. The critical question becomes: How often does the search actually work in practice ? This is precisely what our head-to-head comparison measures.

How Each Approach Works

Both solutions introduce a lightweight search tool, but their algorithms differ significantly:

Stacklok MCP Optimizer: Combines semantic search with BM25 for hybrid tool discovery
Anthropic Tool Search Tool: Offers two variants, BM25-only or regex-based pattern matching

The algorithmic difference has profound implications for real-world performance, as our testing reveals.

The Head-to-Head Comparison

We conducted a comprehensive evaluation to answer the question: Which approach is more effective? (Source code and full results)

Test Methodology

Loaded 2,792 tools from various MCP servers using the MCP-tools dataset
For each tool, generated a synthetic query using an LLM that would naturally require that specific tool
- Example: For GitHub's create_pull_request tool → Generated query: "Create a pull request from feature-branch to main branch in the octocat/Hello-World repository on GitHub"
- Example: Slack's channels_list tool → Generated query: "Show me all channels in my Slack workspace"
Used Claude Sonnet 4.5 to test whether each approach could correctly search and select the original tool that generated the query
- Retrieval Accuracy: Does the correct tool appear anywhere in the search results returned by the search tool?
- Selection Accuracy: Is the correct tool actually selected by the model for use?
- This direct mapping lets us objectively measure retrieval accuracy: we know the ground truth for every query. In the examples above, the correct tools would be GitHub's create_pull_request and Slack's channels_list

Results

The stark difference in selection accuracy between approaches primarily reflects retrieval effectiveness rather than model performance. Since all approaches used the same model (Claude Sonnet 4.5) for tool selection, the 94% vs 34% accuracy gap stems from MCP Optimizer's superior retrieval accuracy (98% vs 48%). Put simply: if the correct tool doesn't appear in the search results, even the best model cannot select it. MCP Optimizer's hybrid semantic + BM25 search successfully surfaces the correct tool in 98% of cases, giving the model the opportunity to make the right selection. In contrast, Tool Search Tool's lower retrieval rates mean the model often never sees the correct tool among its options.

These results align with independent testing from other organizations. Arcade reported that Anthropic's Tool Search achieved only 56% retrieval accuracy with regex and 64% with BM25 across 4,027 tools.

Runtime Performance Characteristics

Approach	Average execution time	Average tools retrieved	Average input tokens*
MCP Optimizer	5.75 seconds	5.2	3296
Tool Search Tool (BM25)	12.05 seconds	5.0	2823
Tool Search Tool (regex)	13.55 seconds	5.2	3679

* Average Input Tokens: The total number of tokens sent to the model per request, including system prompt, tool definitions, and user query.

Beyond accuracy, the operational characteristics of each approach reveal important trade-offs. Tool Search Tool (BM25) achieves the lowest token consumption at 2,823 tokens per request, which likely stems from retrieving slightly fewer tools on average (5.0 vs 5.2). However, MCP Optimizer's token count of 3,296 still represents substantial savings compared to attempting to load all 2,792 tools upfront, which would require 206,073 tokens and cause an error due to context window limitations.

The execution time differences are noteworthy: MCP Optimizer completes searches in 5.75 seconds on average, while Tool Search Tool takes 12.05 (BM25) and 13.55 seconds (regex). However, this comparison requires context. MCP Optimizer was executed locally in our test environment, while Tool Search Tool operates as an internal Anthropic service with unknown infrastructure requirements and potential network latency.

What This Means

The numbers tell a clear story: MCP Optimizer consistently finds the correct tool 94% of the time, while Tool Search Tool's accuracy hovers around 30-34% in environments with thousands of tools. For production systems where reliability and performance matters, this gap is significant.

The Verdict

Anthropic's Tool Search Tool correctly identifies a real problem facing production AI deployments. The concept of on-demand tool loading is sound, and the token savings are genuine. However, the current implementation isn't production-ready for environments with large tool catalogs. Limited to Claude Sonnet 4.5 and Opus 4.5, it remains a proprietary solution exclusive to Anthropic's ecosystem.

MCP Optimizer, on the other hand, delivers on the promise: reliable tool selection (94% accuracy) combined with significant token savings. Built into the ToolHive runtime as a free and open-source solution, it seamlessly integrates with all major AI clients including Claude Code, GitHub Copilot, Cursor, and others, providing vendor flexibility and broader compatibility across different AI platforms. For teams building AI agents that need to work consistently across hundreds or thousands of tools, this performance difference and deployment flexibility are critical.

Looking Forward

The future of AI agents depends on solving context window constraints without sacrificing reliability. For that future to arrive, we need tool selection systems that work reliably. MCP Optimizer proves that hybrid semantic + keyword search can deliver both token efficiency and production-grade accuracy. As Anthropic's Tool Search Tool matures beyond beta, we hope to see similar reliability gains.

For now, if you're deploying AI agents in production and need dependable tool selection across extensive tool catalogs, the data points to MCP Optimizer as the more reliable choice.

Interested in learning more about MCP Optimizer? Check out the ToolHive documentation or visit stacklok.com.

DEV Community