DEV Community: Alejandro Ponce de León

Cut token waste across your entire team with the MCP Optimizer

Alejandro Ponce de León — Wed, 11 Mar 2026 17:14:51 +0000

You already cut your own token bill. Now imagine doing that for every member on your team, without them lifting a finger.

Here's what you'll learn in this post:

Why per-person Optimizer setups don't scale, and what to do instead
How Stacklok's Virtual MCP Server (vMCP) delivers team-wide token savings from a single deployment
How AI agents benefit automatically, with no per-agent configuration required
How to deploy the Optimizer in Kubernetes in two steps

The MCP Optimizer dynamically finds and exposes the right tools to clients only when needed, via a unified vMCP Gateway endpoint.

The problem at scale

If you read Cut Token Waste from Your AI Workflow with the ToolHive MCP Optimizer, you know the local Optimizer works great — download it, run it, and watch your token bill drop by 60-85% per request in our benchmarks. But individual setups aren't enterprise setups. You can't ask every team member to install an embedding model, tune search parameters, and keep the whole thing running alongside their other tools. And you can't ask your platform team to verify that each of those setups is configured correctly and stays that way. You need a solution that everyone benefits from the moment they connect.

Configuration drift is the first headache. One person runs a different embedding model than another. Someone tweaked the hybrid search ratio three weeks ago and forgot to tell anyone. Someone else doesn't even know the Optimizer needs configuring and wonders why their token bill is 3x everyone else's. Meanwhile, each machine burns CPU and memory running its own embedding inference — resources that could be doing literally anything else.

AI agents amplify both the problem and the payoff. Agents that fan out across multiple MCP servers stuff the full tool catalog into the context window on every invocation. When an agent connects to five or six MCP servers, that catalog grows quickly. The token bill climbs, inference slows, and the LLM starts picking the wrong tools because it's drowning in descriptions.

Multiply that by hundreds of agent runs a day. Without a centralized Optimizer, you'd have to manually wire it up for each agent and each server combination.

What you actually want — for users and AI agents alike — is to configure it once, in one place, and have everyone benefit automatically. That's exactly what Stacklok now delivers through the vMCP and Operator.

How the Optimizer works

The core idea is simple. Instead of sending your AI agent the full list of every tool from every MCP server (which can easily run to hundreds of descriptions), the Optimizer collapses them into two meta-tools:

Your agent receives a prompt that requires tool use.
It calls find_tool with a natural language description of what it needs.
The Optimizer runs hybrid search (semantic and keyword) against all registered tools.
Only the relevant tools come back — typically 8 instead of 200+.
The agent calls call_tool to invoke the one it needs.

Your agent never sees the full tool catalog. It discovers tools on demand, pays only for the descriptions it actually needs, and the LLM stays focused on fewer, more relevant options.

For a deeper dive into the mechanics and benchmarks, see the original Optimizer blog post.

All the power of vMCP, now with cost savings

If you're already running Stacklok in Kubernetes, you're likely using the vMCP— a unified gateway that aggregates multiple MCP servers behind a single endpoint. vMCP gives you:

Unified gateway. One endpoint for all your MCP servers. Onboarding a new team member means sharing one URL, not configuring five connections.
Authentication and authorization. Centralized auth for incoming clients (OIDC, anonymous, etc.) and outgoing connections, so you can enforce access policies without modifying each MCP server.
Aggregation and conflict resolution. Automatic prefixing, priority ordering, or manual overrides when tool names collide across MCP servers.

The Optimizer adds one more layer on top:

Token optimization. Every tool behind the gateway gets indexed. Clients see only find_tool and call_tool instead of the full catalog.

The savings are real. The original Optimizer blog post walks through the benchmarks in detail, showing 60-85% token reductions per request. In a head-to-head comparison with Anthropic's tool search tool, the Optimizer matched or exceeded a first-party solution.

Token savings aren't the only benefit. Fewer tool descriptions means less noise for the LLM to wade through, which means better tool selection and fewer hallucinated tool calls. You're saving tokens and getting better results.

How to deploy the Optimizer in Kubernetes

The Kubernetes setup is deliberately minimal. You need two things: an EmbeddingServer and a reference to it from your VirtualMCPServer.

Step 1: Deploy an EmbeddingServer

The EmbeddingServer Custom Resource Definition (CRD) manages a shared embedding model for the whole team. With sensible defaults baked in, the minimal configuration is just this:

apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
  name: optimizer-embedding
spec: {}

The operator defaults to BAAI/bge-small-en-v1.5 as the model and runs the HuggingFace Text Embeddings Inference server. You can increase the replica count via spec.replicas to match your team's throughput needs. One shared instance serves every vMCP in the namespace. For all available configuration options, see the Optimizer docs.

Step 2: Reference it from your VirtualMCPServer

Add a single field to your existing VirtualMCPServer:

embeddingServerRef:
  name: optimizer-embedding

That's the only change. When the operator sees embeddingServerRef without an explicit optimizer config block, it auto-populates the optimizer with sensible defaults and resolves the embedding server URL automatically. You don't need any manual wiring.

For finer control — tuning search parameters, timeouts, and more — see the Optimizer docs for the full reference.

The cost savings add up

The per-request savings are compelling on their own, but they compound quickly when you multiply across a team: every team member, every request, every day. At typical API pricing, those savings add up fast. Fewer tokens also means faster responses and lower latency for your organization.

Beyond the raw savings, the Kubernetes approach gives you operational advantages:

GitOps-friendly. EmbeddingServer and VirtualMCPServer configurations live in Git, get reviewed in PRs, and deploy through your existing CI/CD pipeline. That gives you full change history and rollback for compliance requirements.
One shared embedding server. Instead of every machine running a local embedding model, one instance serves the whole team. Less resource waste, consistent behavior.
Zero end-user setup. Users point their MCP client at the vMCP endpoint. The Optimizer is transparent; they don't need to know it's there.
Centralized security boundary. All tool discovery flows through one place, giving you a single point to audit and control which tools your team can access.

Resources

Here's everything referenced above and some extra resources:

Optimizer docs: Configuration guide
vMCP blog post: Introducing Virtual MCP Server: a unified gateway for multi-MCP workflows
vMCP docs: Virtual MCP Server configuration guide
Quickstart example: vmcp_optimizer_quickstart.yaml: deploys several MCP backends with a fully auto-configured optimizer
All options example: vmcp_optimizer_all_options.yaml: every tuning knob exposed
Original Optimizer blog: Cut Token Waste from Your AI Workflow
ToolHive GitHub: github.com/stacklok/toolhive

Want to see what Stacklok can do for your organization? Book a demo or get started right away with ToolHive, our open source project. Join the conversation and engage directly with our team on Discord.

Build your first enterprise MCP server with GitHub Copilot

Alejandro Ponce de León — Mon, 02 Feb 2026 08:53:54 +0000

Ever wondered how to bridge the gap between your company's private knowledge and AI assistants? You're about to vibecode your way there.

What all the fuss with MCP is about

Back in November 2022, the world changed when OpenAI launched ChatGPT. It wasn't the first Large Language Model (LLM), but it was the most capable at the time, and most importantly, it was available for everyone to explore. To make a small analogy: it got to the moon first. LLMs sparked everyone's imagination and forever changed the way we work. Maybe that's a little far-fetched, but they definitely boosted productivity across many areas.

Yet LLMs weren't (and still aren't) all-mighty. They've been trained on vast amounts of internet content, but they have two critical limitations:

They're not trained on private content. No company wikis, internal docs, or how-tos.
They have a knowledge cutoff. Their training stops at a fixed date, usually months in the past.

So if you ask ChatGPT something like "How was feature X designed in product Y, and how can I integrate it with my new feature Z?", it will have no idea what you're talking about. First of all, it would most probably not have access to the implementation details, since it would fall into the private content of an organization. Even if it did, there’s no guarantee because it’s frozen in time; it doesn’t know what’s changed in the world since that cutoff.

MCP to the rescue

Fortunately, both problems can be solved with tools. Tools empower LLMs with capabilities beyond their training. To solve the two issues above, we can create tools that tell the LLM: "When you're asked about product X at company Y, use tool Z to get the most up-to-date information." That tool might, for example, search an internal knowledge base.

MCP (Model Context Protocol) has quickly become the standard for tool calling. Modern AI systems have two essential parts: the client (VS Code, Cursor, ChatGPT, Claude Code, etc.) and the model itself. Tools live on the client side. When the model doesn't know something, it calls a tool that the client executes. Originally introduced by Anthropic, MCP’s open design and community adoption have made it the clear industry standard, now supported by OpenAI, Google, Microsoft, and others. That means you can write an MCP server once and use it with your favorite clients.

Building your first MCP, the AI scrappy way

Let's say your boss just tasked you with connecting your AI assistants to the corporate Confluence wiki. This is a perfect use case for MCP; you need to expose enterprise knowledge to AI tools in a standardized way.

For this tutorial, we'll assume you already have a querying system in place, whether that's a Retrieval Augmented Generation (RAG) pipeline, a search API, or another knowledge retrieval mechanism. Our job is to wrap that existing system with an MCP server so AI assistants can access it.

Our approach: vibecoding

We're going to build this MCP server using what Andrej Karpathy sarcastically called "vibecoding": letting LLMs do most, if not all, of the code. The term spread like wildfire because, well, it works surprisingly well for certain tasks. It's not a silver bullet, but it's perfect for handling boilerplate code and getting something functional quickly.

Ingredients

Python 3.13+
VS Code with Copilot
uv for package management

Why Copilot?

While editors like Cursor, Codex, Windsurf, and Claude Code have gained wide popularity for their deep AI integration, GitHub Copilot remains the most widely available option for enterprise developers. It’s often already included in Microsoft or GitHub contracts, making it simple to deploy without extra approvals. We’ll still use Copilot here because it’s what most teams already have available and will get the job done.

Implementation

The initial prompt

Getting started with AI-assisted development is all about setting clear expectations. Here's the first prompt I used to kick off the project. Being specific about tooling and goals helps guide the AI toward the implementation you actually want. After this initial prompt, we should have the scaffolding of the project and most of the implementation ready.

This is a new project called enterprise-mcp. It is a Python project using 3.13 or greater. The project is meant to be an MCP server that will access enterprise knowledge and make it available to LLMs. The project should:
- Use uv as package manager
- For adding packages use `uv add <package_name>`
- All configuration should be centralized in pyproject.toml file
- Use uv dependency groups when adding development dependencies like pytest, e.g. `uv add pytest --dev` or `uv add --group dev pytest`
- I would also like a Taskfile to centralize running commands, like `task format`, `task test`, or `task typecheck`
- Use `ruff` for linting and formatting
- Use `ty` for typechecking https://docs.astral.sh/ty/
- Use `async` functions wherever possible and `asyncio.gather` when parallelizing multiple tasks
- Use the official Python MCP SDK: https://github.com/modelcontextprotocol/python-sdk
- For now, make a single tool called search_enterprise_knowledge. Make sure the tool has appropriate descriptions that are descriptive enough for LLM usage
- Make the implementation with tests. I don't care so much about unit tests but about testing the overall functionality of the application

Where the AI got confused

Even with a detailed prompt, the first pass required some corrections. Still, we had a working implementation even at the first prompt, which is also quite impressive. Two main issues emerged, both likely related to the AI's knowledge cutoff:

1. Misunderstanding the MCP SDK

Instead of using the official Python SDK, Copilot attempted to semi-reimplement the MCP protocol from scratch, creating custom list_tools and call_tool endpoints. Since the MCP SDK is fairly recent, it wasn't in the training data, and crucially, the AI didn't check the documentation before implementing.

2. Using Mypy instead of Ty

Similar story here. The AI defaulted to the more established Mypy rather than looking up the newer Ty package I'd specified.

Manual refinements

Beyond fixing the AI's mistakes, I made some personal preference edits:

Structure of pyproject.toml. No coding assistant until this day nails my pyproject.toml preferences on the first try (it may well be a me problem and not an AI problem). I referenced configurations from past projects I liked and adapted them here.
Taskfile.yml adjustments. Same deal with the Taskfile.yml. That said, the AI got me 80-90% of the way there, which is pretty remarkable.

Iterating with prompt #2

After the initial implementation and manual edits, a few minor improvements remained. Rather than handle them manually, I asked Copilot to finish the job because it would certainly take less time than I would:

I have made some changes in my server.py to correctly use the Python SDK. I want you to:
1. Transform my server to a streamable HTTP server.
2. Add a comprehensive docstring for my handle_search method so that it's usable by LLMs whenever enterprise knowledge is needed.
Check the documentation of the Python SDK to know how to correctly transform the server to streamable HTTP: https://github.com/modelcontextprotocol/python-sdk

Closing the loop

The final step is updating project memory: the context file that helps future AI sessions (and human developers) understand your project quickly. For most coding assistants, this lives in AGENTS.md or CLAUDE.md at the project root. Most coding assistants recognize either. It's a good place to:

Document the project structure, so the agent knows where to implement a new feature or fix a bug
Outline the project's best practices
Give instructions that can be repeated across runs, e.g., always run unit tests along with code linting

Perfect, 3 final tasks after some manual modifications:
1. Make sure my commands `task format`, `task typecheck` and `task test` work and return without errors
2. Update the file AGENTS.md with relevant context information for coding agents. Take into account the best practices signaled at the beginning, like centralizing everything in pyproject.toml and using `task ..` commands to run relevant project commands. The code formatting and tests commands should be used every time a coding task is finished. Read the repo again for any other relevant information
3. Finally, update a README.md with a summary of the project and the development process

Key lessons

Tools are not API endpoints

This is crucial to understand when building MCP servers: an MCP tool is fundamentally different from an API endpoint, even though it's tempting to map them one-to-one.

API endpoints are designed as small, atomic, reusable operations. They're the building blocks you compose together: one endpoint to fetch user data, another to update preferences, another to send notifications. Each is focused and modular, meant to serve multiple use cases across your application.

MCP tools, by contrast, are meant to accomplish complete deterministic workflows or actions. Think of an API as giving you a toolbox of small buttons, each doing one thing, that you wire together. An MCP tool is a single big button that says "do the thing." It handles an entire task from start to finish.

For example, instead of separate tools for "search documents," "filter by date," and "format results," you'd create one search_enterprise_knowledge tool that handles the full workflow of finding, filtering, and returning relevant information in one shot.

You're still accountable

Whatever code the AI produces, you own it. If it breaks in production, you can't blame Copilot or Claude. Humans remain accountable for the code we ship.

This means you should always review what gets generated. Not necessarily line-by-line, but at minimum: understand what it does, verify it follows your standards, and run it through your normal quality checks. A quick sanity check is never wasted time, especially when you're the one who'll be called at 2am to fix it.

Testing the MCP server

For this first iteration, it's better to remove all variables like coding assistants and configuration files. The easiest way to do that is with the MCP Inspector, a tool from Anthropic for inspecting an MCP server and querying it directly. To run the inspector:

npx -y @modelcontextprotocol/inspector

Example response

## Result 1
**Title:** API Documentation - Authentication
**Content:**
# API Authentication Guide
## Overview
Our REST API uses OAuth 2.0 for authentication.
## Getting Started
1. Register your application
2. Obtain client credentials
3. Request access token
4. Include token in requests

## Example

curl -H "Authorization: Bearer YOUR_TOKEN" \
  https://api.company.com/v1/users
Access tokens expire after 1 hour.


**Metadata:**
- author: API Team
- created: 2024-02-01
- last_updated: 2024-10-20
- tags: ['api', 'authentication', 'oauth', 'documentation']
- source: confluence
**URL:** https://company.atlassian.net/wiki/spaces/API/pages/987654321

What's next

We've successfully built a working MCP server using Copilot and vibecoding, ready to access enterprise knowledge through a standardized protocol!

By letting GitHub Copilot handle most of the boilerplate code, we created a functional Python MCP server with proper tooling, testing, and documentation, all while maintaining code quality and best practices.

Full code repository: https://github.com/aponcedeleonch/enterprise-mcp

In the next blog post, we're taking this further by introducing ToolHive, a powerful platform that makes deploying and managing MCP servers effortless. ToolHive offers:

Instant deployment using Docker containers or source packages (Python, TypeScript, or Go)
Secure by default with isolated containers, customizable permissions, and encrypted secrets management
Seamless integration with GitHub Copilot, Cursor, and other popular AI clients
Enterprise-ready features, including OAuth-based authorization and Kubernetes deployment via the ToolHive Kubernetes Operator
A curated registry of verified MCP servers you can discover and run immediately, or create your own custom registry

Stay tuned to learn how to evolve our enterprise MCP server from a prototype into a production-ready service!

Stacklok's MCP Optimizer vs Anthropic's Tool Search Tool: A Head-to-Head Comparison

Alejandro Ponce de León — Wed, 10 Dec 2025 15:36:56 +0000

TL;DR

Both solutions tackle the critical problem of token bloat from excessive tool definitions. However, our testing with 2,792 tools reveals a stark performance gap: Stacklok MCP Optimizer achieves 94% accuracy in selecting the right tools, while Anthropic's Tool Search Tool achieves only 34% accuracy. If you're building production AI agents that need reliable tool selection without breaking the bank on tokens, these numbers matter.

The Problem Both Are Solving

When you connect AI agents to multiple Model Context Protocol (MCP) servers, tool definitions quickly consume massive portions of your context window, often before your actual conversation even begins.

The reality? Most queries only need a handful of these tools. Loading all of them wastes tokens (read: money) and degrades model performance as the tool count grows.

Both Stacklok MCP Optimizer (launched October 28, 2025) and Anthropic's Tool Search Tool (launched November 20, 2025 as part of their advanced tool use beta) address this by loading a single search tool that finds and loads only the necessary tools on demand.

Why This Matters: Real Benefits and Trade-offs

The Upside

Token savings are substantial. We've observed up to 80% reductions in input tokens. In their internal testing, Anthropic reports their approach preserves 191,300 tokens of context compared to loading all tools upfront, an 85% reduction. In rate-limited enterprise environments, this translates directly to cost savings and faster response times.

Improved model performance. Reducing token overhead doesn't just save money, it can improve model accuracy. Anthropic's internal testing showed substantial improvements with Tool Search Tool enabled: Opus 4 jumped from 49% to 74%, and Opus 4.5 improved from 79.5% to 88.1% on MCP evaluations. However, it's important to note that Anthropic's experiments and datasets are not publicly available, making direct comparisons challenging.

Our own testing with MCP Optimizer across different model tiers revealed an interesting pattern: while state-of-the-art models like Claude Sonnet 4 maintained strong performance when benchmarking tool selection accuracy (94.6% → 93.4%), mid-tier and smaller models showed significant improvements. Gemini 2.5 Flash increased from 83.2% to 92.4%, and the gpt-oss-20B model nearly doubled its accuracy from 38% to 69.4%. This suggests that efficient tool loading particularly benefits models with tighter context constraints, making MCP Optimizer valuable across different deployment scenarios, from resource-constrained edge deployments to cost-optimized production systems.

The Downside

Risk of tool retrieval failure. The benefits above assume the search tool successfully finds the right tool. But what happens when it doesn't? If the search tool can't find the right tool, your task fails or produces unexpected behavior. While the agent can retry searches, this introduces latency and still consumes tokens. The critical question becomes: How often does the search actually work in practice ? This is precisely what our head-to-head comparison measures.

How Each Approach Works

Both solutions introduce a lightweight search tool, but their algorithms differ significantly:

Stacklok MCP Optimizer: Combines semantic search with BM25 for hybrid tool discovery
Anthropic Tool Search Tool: Offers two variants, BM25-only or regex-based pattern matching

The algorithmic difference has profound implications for real-world performance, as our testing reveals.

The Head-to-Head Comparison

We conducted a comprehensive evaluation to answer the question: Which approach is more effective? (Source code and full results)

Test Methodology

Loaded 2,792 tools from various MCP servers using the MCP-tools dataset
For each tool, generated a synthetic query using an LLM that would naturally require that specific tool
- Example: For GitHub's create_pull_request tool → Generated query: "Create a pull request from feature-branch to main branch in the octocat/Hello-World repository on GitHub"
- Example: Slack's channels_list tool → Generated query: "Show me all channels in my Slack workspace"
Used Claude Sonnet 4.5 to test whether each approach could correctly search and select the original tool that generated the query
- Retrieval Accuracy: Does the correct tool appear anywhere in the search results returned by the search tool?
- Selection Accuracy: Is the correct tool actually selected by the model for use?
- This direct mapping lets us objectively measure retrieval accuracy: we know the ground truth for every query. In the examples above, the correct tools would be GitHub's create_pull_request and Slack's channels_list

Results

The stark difference in selection accuracy between approaches primarily reflects retrieval effectiveness rather than model performance. Since all approaches used the same model (Claude Sonnet 4.5) for tool selection, the 94% vs 34% accuracy gap stems from MCP Optimizer's superior retrieval accuracy (98% vs 48%). Put simply: if the correct tool doesn't appear in the search results, even the best model cannot select it. MCP Optimizer's hybrid semantic + BM25 search successfully surfaces the correct tool in 98% of cases, giving the model the opportunity to make the right selection. In contrast, Tool Search Tool's lower retrieval rates mean the model often never sees the correct tool among its options.

These results align with independent testing from other organizations. Arcade reported that Anthropic's Tool Search achieved only 56% retrieval accuracy with regex and 64% with BM25 across 4,027 tools.

Runtime Performance Characteristics

Approach	Average execution time	Average tools retrieved	Average input tokens*
MCP Optimizer	5.75 seconds	5.2	3296
Tool Search Tool (BM25)	12.05 seconds	5.0	2823
Tool Search Tool (regex)	13.55 seconds	5.2	3679

* Average Input Tokens: The total number of tokens sent to the model per request, including system prompt, tool definitions, and user query.

Beyond accuracy, the operational characteristics of each approach reveal important trade-offs. Tool Search Tool (BM25) achieves the lowest token consumption at 2,823 tokens per request, which likely stems from retrieving slightly fewer tools on average (5.0 vs 5.2). However, MCP Optimizer's token count of 3,296 still represents substantial savings compared to attempting to load all 2,792 tools upfront, which would require 206,073 tokens and cause an error due to context window limitations.

The execution time differences are noteworthy: MCP Optimizer completes searches in 5.75 seconds on average, while Tool Search Tool takes 12.05 (BM25) and 13.55 seconds (regex). However, this comparison requires context. MCP Optimizer was executed locally in our test environment, while Tool Search Tool operates as an internal Anthropic service with unknown infrastructure requirements and potential network latency.

What This Means

The numbers tell a clear story: MCP Optimizer consistently finds the correct tool 94% of the time, while Tool Search Tool's accuracy hovers around 30-34% in environments with thousands of tools. For production systems where reliability and performance matters, this gap is significant.

The Verdict

Anthropic's Tool Search Tool correctly identifies a real problem facing production AI deployments. The concept of on-demand tool loading is sound, and the token savings are genuine. However, the current implementation isn't production-ready for environments with large tool catalogs. Limited to Claude Sonnet 4.5 and Opus 4.5, it remains a proprietary solution exclusive to Anthropic's ecosystem.

MCP Optimizer, on the other hand, delivers on the promise: reliable tool selection (94% accuracy) combined with significant token savings. Built into the ToolHive runtime as a free and open-source solution, it seamlessly integrates with all major AI clients including Claude Code, GitHub Copilot, Cursor, and others, providing vendor flexibility and broader compatibility across different AI platforms. For teams building AI agents that need to work consistently across hundreds or thousands of tools, this performance difference and deployment flexibility are critical.

Looking Forward

The future of AI agents depends on solving context window constraints without sacrificing reliability. For that future to arrive, we need tool selection systems that work reliably. MCP Optimizer proves that hybrid semantic + keyword search can deliver both token efficiency and production-grade accuracy. As Anthropic's Tool Search Tool matures beyond beta, we hope to see similar reliability gains.

For now, if you're deploying AI agents in production and need dependable tool selection across extensive tool catalogs, the data points to MCP Optimizer as the more reliable choice.

Interested in learning more about MCP Optimizer? Check out the ToolHive documentation or visit stacklok.com.