Om Shree

Posted on Sep 18 • Originally published at glama.ai

Running Efficient MCP Servers in Production: Metrics, Patterns & Pitfalls

#beginners #ai #productivity #discuss

The Model Context Protocol (MCP) is emerging as a foundational interface for new types of AI-driven interactions, where intelligent agents act as a new kind of user, or persona. Just as businesses meticulously design web UIs and REST APIs for human users and third-party integrations, they now have the opportunity to design the perfect "agentic experience" for these autonomous entities. The primary objective is to maximize the task completion rate, which is the ability for an MCP client with its underlying model to successfully complete a user-given task. This article explores the challenges and best practices for achieving this efficiency, with a focus on measurement, design patterns, and operational pitfalls.

Measuring Agentic Experience Quality

Measuring an MCP server's efficiency is a complex undertaking. The ideal metric, task completion rate, is often not feasible to track directly in a production environment. This is due to two primary challenges:

Limited Observability: As an MCP server developer, you only observe the requests coming into your server. You do not have full visibility into the agent's entire conversation with the user or the internal workings of the client application and its large language model (LLM). You receive tool and resource requests, but the broader conversational context from which they originated remains opaque.
Disparity in Models and Clients: The performance of an agent is highly dependent on the model and the client application it uses. Tool selection accuracy, a crucial factor in task completion, varies drastically from one model to another. The Berkeley Function Calling Leaderboard (BFCL)¹ has shown a wide disparity in model performance on this front². Similarly, different MCP clients may not support the full spectrum of features offered by a server, further complicating a uniform measure of success.

Given these limitations, a more practical approach is to rely on proxy metrics that provide a qualitative sense of the agentic experience. Two key proxy metrics that can be measured in production are cost and latency.

Cost: Refers to the quantity of tokens the MCP server provides back to the model. Lower token consumption is better, as it reduces the model's context window footprint, increasing the likelihood of task completion before the context is exhausted.
Latency: Refers to the quantity of interactions required between the MCP client and the server to complete a task. Fewer successive calls are better, as each interaction introduces a greater chance of failure or the model diverging from the intended path.

By focusing on these measurable proxies, developers can iterate on their server design to improve efficiency.

Impacting Server Design: Three Actionable Domains

There are three main areas where developers can impact the efficiency of their MCP server: the tool list, tool responses, and notifications.

1. Tool List

The number and structure of tools exposed by a server have a significant impact on tool selection accuracy and token consumption. The more tools a model has to choose from, the more likely it is to select the wrong one. Benchmarks show that increasing the quantity of tools has a logarithmically decreasing negative impact on tool selection accuracy. This leads to a common anti-pattern: the API wrapper one-on-one pattern, which exposes a new MCP tool for every single API endpoint. This approach quickly inflates the number of tools, leading to a dramatic drop in the task completion rate.

A more effective alternative is to favor a polymorphic design, which exposes fewer tools with more parameters. This design approach is exemplified by Block's Layered Tool Pattern, which reduced their entire Square API platform to just three conceptual tools: a discovery tool to explore available services, a planning tool to understand method signatures, and an execution tool to make the final API request[^4, ^5]. This pattern guides the agent through a structured, multi-step reasoning process, improving reliability.

Instead of thinking of tools as a one-to-one mapping to an existing API, developers should think of them as packaged agent stories, encapsulating a complete unit of work or a common user workflow. A great example of this is GitHub's MCP server³, which bundles multiple CLI commands into a single push_files tool to handle a complete file-pushing task⁴.

2. Tool Responses

The way an MCP server formats its response payload back to the client also has a direct impact on efficiency.

Stripping Useless Attributes: Many APIs return bloated JSON responses with a lot of redundant or irrelevant information. For example, a Stripe API response may include a plethora of details about an object, but an agent might only need one or two key values. Reducing the quantity of information returned to the model context window is a simple yet powerful way to decrease token consumption and improve task completion. The general principle is to return as little information as possible, only what is strictly necessary for the LLM to complete its job. This may even involve returning pure text instead of structured JSON for certain tasks.
Leveraging Error Messages: Unlike traditional applications where errors are often a dead end, an agent can leverage a well-designed error message to self-correct and proceed. If a model calls tool B before tool A, the server can return an error that explicitly suggests calling tool A first. Similarly, a validation error for a date parameter can include the current date, helping the model correct its mistake without user intervention. By providing meaningful, actionable error messages, developers can remove undeterministic behavior and improve the agent's ability to recover from failures.

3. Notifications

The MCP standard includes a tool list change notification feature, but this should be used with caution. Most model providers use caching to reduce cost, and this caching relies on a stable tool list. Changing the list of tools mid-session can invalidate the cache, increasing the cost for the client and reducing overall efficiency. It is generally advisable to avoid changing the tool list during a session to ensure a consistent and cost-effective experience.

Behind the Scenes: MCP Logic & Design Trade-offs

The core of an MCP server’s efficiency lies in its ability to manage the delicate balance between the breadth of functionality it offers and the cognitive load it places on the consuming model. The logarithmic decrease in tool selection accuracy as tool count increases is a direct reflection of the LLM’s struggle to parse and choose from a large list of descriptions and signatures within its limited context window.

Consider the contrast between the API wrapper and the Layered Tool Pattern:

Feature	API Wrapper One-on-One	Layered Tool Pattern
Tool Count	High, scales with API endpoints	Low, typically 3 tools
Tool Use Accuracy	Decreases logarithmically	More stable, linear decrease with parameters
Token Consumption	Linear increase with tool count	Linear increase with parameters
Agent Behavior	Relies on single-step function calling	Guides agent through multi-step reasoning
Real-world Example	A server with 200 tools, each for a different API endpoint	Block's Square platform with three tools⁵

The Layered Tool Pattern works because it aligns with a core principle of agentic behavior: self-discovery. By giving the agent tools to discover what’s possible (get_service_info), understand how to use it (get_type_info), and then execute (make_api_request), the server shifts the burden of planning and reasoning from a monolithic tool list to a guided, multi-step process. This improves the agent's reliability and reduces the chances of it "hallucinating" or attempting to use a non-existent tool.

Furthermore, the optimization of response payloads is a direct play on the LLM's context window. By stripping away redundant information, the server ensures that the most relevant data is available for the agent to use, leaving more space for the conversation history and the agent's own reasoning. This increases the probability of task completion, as the agent is not cluttered with irrelevant data.

My Thoughts

The talk provided a clear, practical framework for approaching MCP server design from an efficiency standpoint. The focus on task completion rate as a north-star metric, even if measured through proxies, provides a valuable shift in mindset from traditional API development. The comparison between the API wrapper and Layered Tool patterns is particularly insightful, offering a concrete alternative to a common pitfall.

One area for future exploration is the role of prompts and resources. While the talk briefly mentioned them as ways to give back control to the user, a deeper dive into their architectural implications would be valuable. The use of a prompt as a way to "prime" a conversation or a resource to provide up-front, structured data could further reduce the need for multiple tool calls, thereby improving latency and cost.

Ultimately, the core message is that designing for AI agents is fundamentally different from designing for humans or traditional software integrations. It requires an understanding of how LLMs reason, how they consume context, and how to design APIs that complement their strengths rather than exposing their weaknesses.

Acknowledgements

I would like to extend my gratitude to Frédéric Barthelet (CTO & Co-founder, Alpic) for his insightful presentation, "Running Efficient MCP Servers in Production: Metrics, Patterns & Pitfalls," which provided the foundation for this article. I am also grateful to the broader MCP and AI communities for their continuous work in advancing the field.

DEV Community