Guy for AWS Heroes

Posted on Mar 18 • Edited on Apr 8

MCP Tool Design: Why Your AI Agent Is Failing (And How to Fix It)

#mcp #ai #rust #llm

The Reports of MCP's Death Have Been Greatly Exaggerated

Scroll through developer forums in early 2026, and you'll find a recurring theme: MCP is dead. The takes range from dismissive ("just a fad") to resigned ("we tried it, our agents kept failing"). And the frustrations behind them are real. Teams are building MCP servers with 50+ tools, watching their agents stumble through tool selection, and concluding that the protocol itself is broken.

It isn't. MCP isn't dead; it's being used poorly. And the evidence for how to use it well is now overwhelming.

Over the past year, teams at GitHub, Block, and dozens of smaller shops have converged on the same set of principles. GitHub Copilot cut its tool count from 40 to 13 and saw measurable benchmark improvements. Block rebuilt its Linear MCP server three times, going from 30+ tools to just 2. The pattern is consistent: fewer tools, better descriptions, outcome-oriented design. The problem isn't the protocol. It's tool design.

This article lays out the framework. We'll start with the mental model that makes everything else click, the Capability Square, then walk through the anatomy of a well-designed tool. Subsequent articles in this series cover the quantitative evidence, description quality, and the anti-patterns that cause most failures.

What Is MCP? (The 30-Second Version)

The Model Context Protocol (MCP) is an open protocol that connects AI models to external tools and data sources. The simplest way to think about it: websites and mobile apps are the interface between humans and online services. MCP is the interface between AI and those same services. Over decades, we've invested heavily in improving human interfaces, including the iPhone's gesture language, years of UX research, accessibility standards, and usability testing. AI needs the same investment in its interface to online services. MCP is that interface, and tool design is its UX discipline.

One of our clients came to us with exactly this gap. They wanted AI agents to operate their web forms: filling in fields, clicking buttons, navigating multi-step workflows through a browser. They asked us to run tests evaluating how well browser-based agents could complete their online forms, and to help "fix" the forms for agent compatibility. We explained that this was significant effort in the wrong direction. Their web forms were designed for humans, with visual layout, hover states, drag-and-drop interactions. Instead, we showed them that adding an MCP server to the same API sitting behind those forms gave AI agents a native interface purpose-built for how they work: structured inputs, clear descriptions, typed responses. The agents went from struggling with form fields to completing tasks reliably. The lesson: don't retrofit human interfaces for AI. Build AI-native interfaces alongside them, and MCP servers to your internal and external services.

The parallels between UX design and MCP tool design run deep. Decades of UX research have produced principles that transfer directly:

Affordance - the idea that a door handle should look pullable, maps to tool names and parameter descriptions: if a field is named id but requires a UUID, the affordance is broken.

Recognition over recall - it's easier to pick from a list than type from memory, maps to using enums and example values in schemas so the LLM recognizes valid inputs instead of guessing.

Visibility of system status - users need feedback when something goes wrong, maps to error messages that explain what happened and how to fix it, rather than a cryptic "invalid input." These aren't metaphors. They're the same design discipline applied to a different kind of user.

The Capability Square: Four Parties, One Tool

Even if you've been building MCP servers for months, don't skip this section. The Capability Square reframes tool design around two parties that most MCP discussions collapse into one or ignore entirely: the business analyst who designs the server, and the business user who actually invokes it. Both are domain experts — they know the business the server operates in — but they show up at different times. The analyst shows up at design time, encoding domain knowledge into tool names, descriptions, and schemas. The business user shows up at runtime, asking the questions the server was built to answer. Every MCP tool sits at the intersection of four parties, each with distinct strengths and weaknesses. Understanding this balance is the foundation of good tool design.

The LLM (MCP Client)

A large language model (LLM) is part of each MCP client, such as ChatGPT, Claude Desktop, or a custom agent. It brings language understanding, reasoning, and tool-calling intelligence. It's good at interpreting ambiguous user requests ("where's my package?"), choosing between available tools, composing multi-step plans, and recovering gracefully from errors.

What it's bad at: domain knowledge and symbolic computation. An LLM doesn't know which API capabilities matter for your specific users, and it can't access your databases. It doesn't know that your customer support team needs order tracking, but never touches inventory management. It doesn't know your compliance requirements or your business rules.

The MCP Server

The server provides symbolic computation, data access, and validated operations. It's good at precise calculations, database queries, API calls with proper authentication, input validation, and returning structured results. It runs deterministically, and it is more predictable and easier to validate than LLM reasoning.

What it's bad at: understanding user intent. A server can't interpret "check if we have enough widgets for the Johnson order" without a tool specifically designed for that workflow. It doesn't adapt to ambiguity. It does exactly what it's told, nothing more.

The Business Analyst (Design-Time Domain Expert)

This is the party that's most often overlooked, and it's the one that shapes everything the LLM ever sees. Critically, this is not a technical role. The best server designer is not the software developer, not the platform engineer, not the IT team — it is a domain-expert business analyst: the product manager, the operations lead, the subject-matter expert, or the analyst who sits closest to users and their workflows. They may pair with engineers who implement the server, but the design decisions — which tools to expose, what to name them, how to describe them, when they should be used — belong to someone fluent in the business domain, not the underlying API. Technical fluency is not a substitute for domain fluency, and handing MCP design to the team that happens to own the codebase is one of the most common and expensive mistakes in this space.

The business analyst brings knowledge that neither the LLM nor the server possesses. They know which 20% of an API serves 80% of actual requests. They understand the user personas and their existing processes. They know the vocabulary their users speak in, the business rules that constrain what "correct" means, and the compliance requirements that shape the edges. They know that the customer support team needs order tracking, but never touches inventory management. They know the business context.

What they're bad at: being present at runtime. The business analyst's knowledge has to be encoded into the tool's name, description, schema, and error messages before the first business user ever connects. Every design choice is a message to the LLM about how to use the tool.

But "not present at runtime" doesn't mean "design it and walk away." Tool design is iterative. Your first design is a hypothesis about what your business users need, and like any hypothesis, it needs validation. Usage logs tell you which tools are called, which fail, which are never used, and which requests produce no tool match at all. The business analyst reviews these logs and refines: renaming tools that confuse the LLM, improving descriptions that lead to wrong selections, and adding tools for workflows that users need but the initial design missed.

This iterative loop is where MCP shines compared to direct API integration. Changing a tool's name, description, or input schema is a server-side change, with no client updates, no SDK version bumps, no breaking changes propagated to consumers. The MCP protocol decouples tool discovery from tool invocation, so the LLM rediscovers the improved schema on the next connection. This makes the feedback cycle fast: observe failures, update the tool design, deploy, and measure again. Teams that treat tool design as a one-time exercise miss the biggest advantage of having MCP in the middle. You should put effort into designing the MCP server correctly, as "You never get a second chance to make a first impression." However, you should continue to monitor the MCP server usage logs to adjust to the usage patterns of real business users.

The Business User (Runtime Domain Expert)

The business user is the person who actually opens the MCP client and asks the question. They share the business analyst's domain — a customer support rep, a warehouse manager, a financial analyst, a clinician, an operations planner — but their expertise shows up at runtime, not design time. They bring the one thing no one else in the square possesses when a request is actually being made: the specific intent behind this specific request, in this specific business context, right now. "The Johnson order." "The Q3 reconciliation." "The East Coast warehouse." These references mean nothing to the LLM or the server on their own; they only make sense because the business user lives inside a working context that the design-time parties can't fully predict.

What they're good at: knowing what they actually want, recognizing a wrong answer when they see it, and framing requests in domain language. What they're bad at — and what a well-designed server should protect them from — are the things the analyst has already solved for them: they shouldn't have to know which tool to pick, which parameters to format, or which API endpoint underlies the answer. A well-designed MCP server makes the business user's domain fluency sufficient; they describe the outcome in their own vocabulary, and the system handles the rest.

This is why the two human corners of the square must share a domain. If the business analyst designs tools for a persona they don't understand, no amount of schema polish will save the server: the vocabulary will be wrong, the outcomes won't match real requests, and the "obvious" tool for a given question won't exist. The tightest MCP servers are those where the analyst either is a business user (dogfooding) or spends significant time watching them work. The feedback channel between the two human corners — usage logs, failed requests, "no tool matched" events — is the mechanism that keeps them aligned as users and workflows evolve.

Why the Square Matters

Each corner of the square compensates for the others' weaknesses, and each lives at a different point in time. At design time, the business analyst encodes domain context into the server's tools, descriptions, and schemas — knowledge that neither the LLM nor the server possesses on its own. At runtime, the business user brings the specific intent behind a specific request — the thing the analyst could not predict in advance. The LLM translates that intent into a tool call, interpreting ambiguity that the server can't. The server executes with precision, but the LLM can't match. No single corner can carry the system; remove any one, and task completion collapses.

This has a practical consequence that trips up most teams: the same API should produce different MCP servers for different business users.

Consider the London Transit API. A daily commuter wants trip planning: "fastest route from Paddington to Canary Wharf, avoiding the Jubilee line." An event organizer wants logistics: "How many bus routes serve Wembley Stadium, and what's the last departure after a 10 PM concert?" A municipal planner wants a construction impact analysis: "If we close three stations on the Northern line for six weeks, which bus routes need capacity increases?"

Same API. Three completely different MCP servers. Three different sets of tools, with different names, different descriptions, and different response shapes, because the business analyst for each server shares a domain with their business users and knows how those users actually frame their requests.

Here's the key insight: when you ask an LLM to auto-wrap an API, it lacks this domain context. It can't know which 20% matters because it doesn't know who the user is. Auto-generated MCP servers produce generic tool sets that serve no one well. The business analyst's judgment — encoded in tool selection, naming, and descriptions — is what makes an MCP server effective, and that judgment exists only because the analyst understands the business, not just the API.

How do you know your square is balanced? Measure task completion across the specific requests your business users actually make, not only the three test cases you tried during development. If completion is low, one corner is weak:

The LLM can't understand your tools → fix names, descriptions, and schemas.
The server can't handle the requests → add or redesign tools, or move orchestration server-side.
The business analyst chose the wrong tools to expose → the design doesn't match what users actually ask for; re-observe the real workflows and re-prioritize.
The business user's vocabulary doesn't match the server's → the analyst built for a different persona, or the shared-domain assumption was wrong, and a technical team ended up making design calls they weren't qualified to make.

Tool Anatomy: What Makes an MCP Tool

An MCP tool has six components: a name, a description, an input schema, an output schema, a handler, and error handling. Each one is a communication channel to the LLM, and each one matters.

Here's a complete tool written in Rust that uses the PMCP SDK. Don't worry if you're not fluent in Rust, as the comments walk through every important line:

// -- Dependencies --
// pmcp: the PMCP SDK for building MCP servers
// serde: serialization/deserialization (parses JSON input, formats JSON output)
// schemars: generates JSON Schema from Rust types (so the LLM knows what to send)
use pmcp::server::typed_tool::TypedToolWithOutput;
use pmcp::RequestHandlerExtra;
use serde::{Deserialize, Serialize};
use schemars::JsonSchema;

// -- Input Schema --
// This struct defines what the LLM must send. Each field becomes a property
// in the JSON Schema that the LLM sees when it discovers this tool.
// The doc comments (///) become the schema descriptions automatically.
//
// Annotations on each field define constraints that flow into the
// JSON Schema. The LLM sees "maxLength": 16 on the SKU field and
// "minimum": 1 on quantity BEFORE it calls the tool. A well-behaved
// client respects these; the server enforces them at runtime too.
// deny_unknown_fields rejects any extra fields the LLM might add.
#[derive(Debug, Deserialize, JsonSchema)]
#[schemars(deny_unknown_fields)]
struct CheckInventoryInput {
    /// Product SKU to look up (e.g., "WIDGET-42", "BOLT-7")
    #[schemars(length(max = 16))]
    sku: String,

    /// Number of items needed. Defaults to 1 if not specified.
    /// Use this to check whether a specific quantity is available
    /// before quoting delivery dates.
    #[serde(default = "default_quantity")]
    #[schemars(range(min = 1, max = 10000))]
    quantity_needed: u32,
}

// Default value: if the LLM doesn't specify a quantity, assume 1
fn default_quantity() -> u32 { 1 }

// -- Output Schema --
// Defining the output shape serves two purposes:
// 1. The LLM knows exactly what fields to expect in the response
// 2. Downstream tools or MCP Apps can rely on this structure
#[derive(Debug, Serialize, JsonSchema)]
struct InventoryResult {
    /// The product SKU that was checked
    sku: String,
    /// Whether the requested quantity is currently in stock
    in_stock: bool,
    /// Total quantity available in warehouse
    available: u32,
    /// Whether the requested quantity can be fulfilled
    sufficient: bool,
}

// -- Register the tool with the server --
#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let server = pmcp::ServerBuilder::new()
        .name("inventory-server")
        .version("1.0.0")
        // Register the tool: name, handler with typed input and output,
        // plus a description that tells the LLM WHAT it does, WHEN to
        // use it, and what it RETURNS.
        .tool(
            "check_inventory",
            TypedToolWithOutput::new(
                "check_inventory",
                |input: CheckInventoryInput, _extra: RequestHandlerExtra| {
                    Box::pin(async move {
                        // In production, this queries your inventory database.
                        // Here we return a mock response for clarity.
                        let available = 847_u32;
                        Ok(InventoryResult {
                            sku: input.sku,
                            in_stock: available > 0,
                            available,
                            sufficient: available >= input.quantity_needed,
                        })
                    })
                },
            )
            .with_description(
                "Check inventory levels for a product by SKU. Returns stock \
                 status, available quantity, and whether the requested amount \
                 can be fulfilled. Use this before quoting delivery dates \
                 to customers."
            ),
        )
        .build()?;

    // Start the server over Streamable HTTP -- the production transport.
    // This makes your server accessible to any MCP client over the network:
    // Claude Desktop, ChatGPT, custom agents, or browser-based tools.
    // Unlike stdio (which requires local installation), HTTP lets
    // non-technical users connect without touching a terminal.
    server.run_streamable_http("0.0.0.0:3000").await?;
    Ok(())
}

While we're using Rust and the PMCP SDK throughout this series, the design principles, mainly typed schemas, descriptive names, and structured output, apply to any MCP-compliant server, whether TypeScript, Python, or anything else that speaks the protocol. These are protocol-level concerns, not language-level ones.

Let's walk through each component.

Name ("check_inventory"): The name follows a verb_noun pattern. It's unambiguous, and the LLM won't mistake it for a tool for updating inventory or listing products. Avoid generic names like get_data or process_request. The name is the LLM's first signal about what a tool does.

Description: This is the LLM's primary decision surface. Notice it does three things: says what the tool does ("check inventory levels"), what it returns ("stock status, available quantity, and whether the requested amount can be fulfilled"), which helps the LLM understand if the tool can answer the user's request, and when to use it ("before quoting delivery dates to customers"). That last part is critical. It tells the LLM about the workflow context, which the description's author — the business analyst — knows but the LLM doesn't.

Input schema: The CheckInventoryInput struct defines what the LLM must send. Each field has a type (the LLM can't accidentally pass a string where a number is expected), a doc comment that becomes the JSON Schema description (the LLM sees "Product SKU to look up" when it discovers the tool), and optional defaults (quantity_needed defaults to 1 if omitted). The #[schemars(...)] annotations are the single source of truth for constraints: length(max = 16) on the SKU field generates "maxLength": 16 in the JSON Schema, and range(min = 1, max = 10000) on quantity generates "minimum": 1, "maximum": 10000. The LLM sees these rules when it discovers the tool, before it ever makes a call. And #[schemars(deny_unknown_fields)] on the struct means the LLM can't sneak in extra fields, as anything outside sku and quantity_needed is rejected.

Output schema: The InventoryResult struct defines what the tool returns. This is optional in the MCP spec, but we strongly recommend it. A defined output schema serves two purposes: the LLM knows exactly what fields to expect (it won't hallucinate response fields that don't exist), and downstream consumers, whether another tool in a chain or an MCP App rendering a UI widget, can rely on the structure. The sufficient field is a good example: it performs the comparison server-side rather than asking the LLM to compare available against quantity_needed, which risks getting it wrong.

Handler: The async closure that does the actual work. In this example, it returns a mock response for clarity. In production, this would query your inventory database, call a warehouse API, or perform whatever computation the tool promises. Notice that the handler receives a typed CheckInventoryInput and not raw JSON. The parsing already happened. Your handler code focuses on business logic, not input validation. This is the server's contribution to the Capability Square: reliable, deterministic execution.

Validation: Notice that constraints are declared once, on the struct fields, using #[schemars(...)] annotations. The same annotation serves two purposes: it generates the JSON Schema that the LLM reads at discovery time, and it defines the contract the server enforces at runtime. No duplication between schema and validation logic, where the struct is the single source of truth.

Security in MCP servers operates in layers, and schema constraints are among the easiest to add. First, serde enforces type safety: sku must be a string, quantity_needed must be an unsigned integer, and type-level attacks are blocked at deserialization before your code runs. Second, #[schemars(length(max = 16))] constrains input shape: it won't prevent SQL injection on its own (that's the job of parameterized queries and safe query construction in your database layer), but it does reject obviously malformed or abusive input early, before it reaches any downstream system. Real SKUs are short; a 200-character string is either a mistake or a probe, and there's no reason to let it through. Third, deny_unknown_fields prevents unexpected fields from slipping past the schema entirely. Each layer is simple, but together they significantly reduce the attack surface. The deeper security story, such as parameterized queries, OAuth 2.1, Rust's memory safety guarantees, and the OWASP MCP threat model, gets its own article later in this series.

Error handling: If the LLM sends input that doesn't match CheckInventoryInput, such as, passing "sku": 42 instead of "sku": "WIDGET-42", serde produces an error message explaining the type mismatch. If the SKU exceeds 16 characters, the schema constraint rejects it before the handler runs. For business logic errors inside the handler, use pmcp::Error::validation() with actionable messages following a three-part template: what went wrong, what was expected, and an example of correct input. Good error messages suggest one or two specific fixes, since multiple options force the LLM to guess, and guessing wastes tokens and user patience.

Notice this isn't a local development tool. This is a server designed for a specific business user who needs to quote delivery dates. The business analyst decided that inventory checks matter for their users, and encoded that context into the description and the output shape. The sufficient field exists because the business analyst knows that customers ask "do you have enough?" not "how many do you have?" A different business analyst building for a warehouse manager might expose entirely different tools from the same inventory system.

Can an LLM discover this tool and call it correctly on the first try? If not, your name or description needs work. That's the simplest measurement of tool design quality, and it's one you can test in five minutes with any MCP client.

Outcomes, Not Operations

The business analyst in the Capability Square knows something the LLM never will: what outcome the business user actually wants. When a customer asks "where's my order?", they don't want a customer ID, then a list of order IDs, then a status lookup. They want a tracking link and an ETA. The difference between those two experiences is the difference between operation-oriented and outcome-oriented tool design.

Here's the anti-pattern. A team with a REST background wraps their existing endpoints as MCP tools:

get_customer_by_email(email) returns a customer_id
list_customer_orders(customer_id) returns an array of order_id values
get_order_status(order_id) returns a status string

To answer "where's my order?", the LLM must chain all three calls in the correct sequence. The costs compound at every step:

More tokens. The LLM processes the full response from each tool call and generates the next call. Three round-trips mean three times as many input and output tokens, which is a cost the user incurs without getting any additional value.
More latency. Each step requires a network round-trip to the MCP server plus LLM processing time to interpret the result and formulate the next call. What could be a sub-second single call becomes a multi-second chain.
Growing risk of misstep. The probability of a correct sequence is the product of each step's success rate. If each tool call has a 95% chance of correct execution, three chained calls drop to 85.7%. At five steps, you're at 77.4%. The LLM must remember variable names and values from earlier calls, handle edge cases at each step, and maintain coherence across the full chain. Each step is another opportunity for the model to hallucinate a parameter, misinterpret a response, or lose track of its plan.

	Operation-Oriented (REST style)	Outcome-Oriented (MCP style)
Tool Count	High (1 per endpoint)	Low (1 per user goal)
LLM Effort	High (choreographing multi-step chains)	Low (single-shot invocation)
Token Cost	High (processing every intermediate result)	Low (one request, one response)
Latency	High (N round trips + N LLM inferences)	Low (single round trip)
Reliability	Low (3+ compounding points of failure)	High (deterministic server logic)

Now consider the outcome-oriented alternative:

// -- Input: just the customer's email --
#[derive(Debug, Deserialize, JsonSchema)]
struct TrackOrderInput {
    /// Customer email address (e.g., "alice@company.com")
    #[schemars(length(max = 254))]
    email: String,
}

// -- Status enum: the LLM sees valid values in the schema --
// Instead of a free-form string, an enum lets the LLM 
// "recognize" valid statuses rather than "recall" them from 
// memory.
#[derive(Debug, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
enum OrderStatus {
    Processing,
    Shipped,
    InTransit,
    Delivered,
}

// -- Output: everything the LLM needs to answer the question --
#[derive(Debug, Serialize, JsonSchema)]
struct OrderTrackingResult {
    /// Customer name for the greeting
    customer: String,
    /// Order identifier
    order_id: String,
    /// Current order status
    status: OrderStatus,
    /// Shipping carrier name
    carrier: String,
    /// Estimated delivery date (ISO 8601)
    eta: String,
    /// Direct tracking URL the customer can click
    tracking_url: String,
}

The tool registration follows the same pattern as check_inventory:

.tool(
    "track_latest_order",
    TypedToolWithOutput::new(
        "track_latest_order",
        |input: TrackOrderInput, _extra: RequestHandlerExtra| {
            Box::pin(async move {
                // Internally: resolve customer, find latest order, get status.
                // The server handles the entire chain -- three API calls
                // collapsed into one deterministic operation.
                Ok(OrderTrackingResult {
                    customer: "Alice Chen".into(),
                    order_id: "ORD-8834".into(),
                    status: OrderStatus::InTransit,
                    carrier: "FedEx".into(),
                    eta: "2026-03-20".into(),
                    tracking_url: "https://fedex.com/track/ABC123".into(),
                })
            })
        },
    )
    .with_description(
        "Track the most recent order for a customer using their email. \
         Returns order status, carrier info, and tracking link. Use this \
         when a customer asks 'where is my order?' or 'when will it arrive?'"
    ),
)

One tool. One user outcome. The output struct gives the LLM a rich, typed response, with customer name, status, carrier, ETA, and a clickable tracking URL, which is everything it needs to answer the question in a single turn. The server handles the chaining internally (resolve customer, find latest order, fetch status) because that's what servers are good at: deterministic, multi-step computation. In a production environment, your server handles requests from business users who don't know MCP exists and don't care about your API structure. They just want answers. In the Capability Square, symbolic computation and data access are the server's strengths. Let the server do the work it's built for, and let the LLM do what it's built for: understanding the user's intent and presenting a clear answer.

This isn't a theoretical pattern. Block built 60+ production MCP servers. Their Linear integration started with 30+ tools that mirrored GraphQL endpoints, with one tool per query and one per mutation. After three iterations, they were down to 2 tools. The tool count dropped because the team learned to design for outcomes. Each iteration moved complexity from the LLM (which had to choreograph multi-tool sequences) into the server (which could handle the orchestration deterministically).

Measurement point: Test this yourself. Give 10 users the same task ("find my latest order status"). With the 3-tool REST mapping, measure how many succeed on the first try. Now try the single outcome-oriented tool. The difference in task completion rates is your design-quality signal.

Less Is More: The Evidence for Tool Reduction

Outcome-oriented design naturally reduces tool count. But how much does reduction actually matter? The research is unambiguous.

GitHub reduced their Copilot MCP integration from 40 built-in tools to 13 core tools. The result: 2 to 5 percentage point improvement across SWE-Lancer and SWEbench-Verified benchmarks, plus a 400ms latency reduction. Fewer tools meant the model spent less time on tool selection and more time on the actual task. The gains came not from adding capability, but from removing it.

The Speakeasy team ran a controlled experiment using a Pet Store API. At 107 tools, both large and small models failed completely, and task success collapsed. At 20 tools, large models scored 19 out of 20 correct. At 10 tools, performance was perfect. The failure wasn't gradual. It was a cliff: past a threshold, models don't degrade gracefully. They fall off.

Why does success collapse rather than degrade? Two mechanisms compound. First, context window bloat: every tool name, description, and parameter schema consumes tokens on every request. At 50+ tools, this can eat 5 to 7 percent of the model's context before a single user message arrives, thus crowding out conversation history, document content, and reasoning space. Second, and more insidious, is tool hallucination: when the LLM's attention is spread across too many similar-sounding tools, it starts inventing nonexistent tool names, conflating parameters between tools, or calling the right tool while using arguments from a different tool's schema. This is the same "instruction following degradation" that causes LLMs to drift off-task in long prompts, except here, each hallucinated tool call is a hard failure, not a soft one. The model doesn't produce a slightly wrong answer. It produces no answer at all.

In UX terms, this is information overload. Just as a human can't choose from a menu of 100 items without decision fatigue, an LLM's attention fragments across too many similar-sounding options. The threshold varies by model size. Small models (8B parameters) hit their sweet spot around 19 tools and fail at 46. Even the largest models struggle past 100 tools.

As Hugging Face's Phil Schmid puts it: "Curate ruthlessly. 5 to 15 tools per server. One server, one job."

This raises an obvious question: if you expose only 10 to 15 tools, aren't you leaving functionality on the table? Yes! and deliberately. And that's the right choice. We'll see why shortly, when we look at how much of an API your users actually need.

Measurement point: Count your tools. If you have more than 15 per server, you're likely past the diminishing returns threshold. Benchmark your task completion rate before and after pruning, and the numbers will make the case for you.

The 97% Problem: Tool Description Quality

You can have the right number of tools, designed for the right outcomes, and still fail. A 2025 study analyzing MCP tool descriptions across the ecosystem found that 97.1% contain at least one quality issue. More than half (56%) have unclear purpose statements. Your tools might be well-designed, but if the LLM can't understand when to use them, that design is invisible.

Tool descriptions are not documentation. They are the LLM's primary decision surface. When the LLM sees 15 tools and must choose one, the description is the only signal it has. A vague description is like a restaurant menu that says "food" for every dish, which is technically accurate, but practically useless.

The research identified six components of a quality tool description: Purpose (what the tool does), Guidelines (when and how to use it), Limitations (what it cannot do or when to use something else), Parameter Explanation (input format and constraints), Length (enough detail without overwhelming), and Examples (concrete usage scenarios). Most descriptions fail on multiple components simultaneously.

Here's what the improvement looks like in practice. Consider a flight search tool across three levels of description quality:

// LEVEL 1 -- Vague (56% of MCP tools have this problem)
.with_description("Search for flights")

// LEVEL 2 -- Better purpose, but missing guidelines and limitations
.with_description("Search for available flights between two airports on a given date")

// LEVEL 3 -- Full rubric: purpose + guidelines + limitations
.with_description(
    "Search for available flights between two airports on a specific date. \
     Returns up to 20 results sorted by price. Use 3-letter IATA airport \
     codes (e.g., 'LAX', 'JFK'). Only searches economy class. For business \
     or first class, use the premium_flight_search tool. Dates must be \
     within the next 330 days."
)

Level 1 tells the LLM nothing about parameters, constraints, or when to use an alternative tool. The LLM has to guess at everything: the input format, the result shape, and the scope. Level 2 adds purpose: the LLM knows it needs two airports and a date, but it doesn't know about airport code formats, result limits, or class restrictions. It might pass "Los Angeles" instead of "LAX", or ask for business-class flights and get the wrong results. Level 3 gives the LLM everything it needs to (a) decide to use this tool, (b) provide correct inputs, and (c) know when NOT to use it, that last point being critical for multi-tool servers where the LLM must choose between similar options.

In the same study, augmented descriptions improved task success by 5.85 percentage points in controlled testing. That may sound modest, but at scale it's the difference between a tool that works most of the time and one that works almost all of the time. For a customer-facing agent handling thousands of requests per day, those percentage points represent real users getting real answers.

Description quality extends to error messages. When a tool receives invalid input, the error message is the LLM's only guide for recovery. Compare these two approaches:

// BAD: LLM tries random fixes
return Err(pmcp::Error::validation("Invalid input"));

// GOOD: Problem + expectation + example
return Err(pmcp::Error::validation(
    "Invalid date format for 'departure': '15/04/2026'. \
     Use ISO 8601 format (YYYY-MM-DD). \
     Example: '2026-04-15'"
));

The first error forces the LLM to guess. It might try a different date format, or remove the date entirely, or change a different parameter. Each wrong guess wastes a round trip and user patience. The second error follows a three-part template: what went wrong ("invalid date format for 'departure'"), what was expected ("ISO 8601 format"), and an example of correct input ("2026-04-15"). Suggest one or two fixes maximum. Multiple options force the LLM to guess, and guessing is what we're trying to eliminate.

This is where the typed struct pattern we saw in the Tool Anatomy section pays off. Remember how CheckInventoryInput used doc comments on each field to generate JSON Schema descriptions? The same pattern applies to every tool. When the business analyst writes /// Product SKU to look up (e.g., "WIDGET-42", "BOLT-7") on a struct field, that text becomes the LLM's guide for formatting its input. The type system enforces correctness at parse time, before the handler code ever runs. And the output schema tells the LLM exactly what fields to expect, so it won't hallucinate response fields that don't exist.

This connects back to the Capability Square. The business analyst writes these descriptions at design time, capturing what the business user will eventually ask for—in their own words. The LLM reads the descriptions at runtime and translates the user's phrasing into a tool call. The server validates the call against the same schema the analyst authored. All four corners aligned, with the analyst's design-time knowledge guiding the LLM's runtime decisions and the server's runtime enforcement, in service of a business user who never has to see any of it.

Measurement point: Test your description quality by presenting your tool list to an LLM and asking it to select the right tool for 10 different user requests. If tool selection accuracy is below 90%, your descriptions need work. This test takes five minutes and tells you more about your MCP server's real-world effectiveness than any benchmark.

The Full-API Trap (And the Pareto Escape)

We said that exposing only 10 to 15 tools means leaving functionality on the table. Now let's talk about why that's the right call.

The most tempting mistake in MCP server design is wrapping your entire API. You have 200 endpoints, so you generate 200 tools. The OpenAPI-to-MCP converter makes it easy. The result is a server that does everything and succeeds at nothing. The LLM sees 200 tool descriptions, burns through context window space parsing them, and still picks the wrong one, because 200 options is not a menu, it's a phone book.

The deeper problem is semantic noise. When you auto-wrap an API, you inject your backend's implementation details into the LLM's reasoning space. The LLM shouldn't have to understand your database normalization, your internal microservice boundaries, or your pagination cursor format. It should see tools that map cleanly to user intent. Auto-wrapping exposes tools like get_customer_by_internal_id and list_orders_with_cursor_pagination, which exist because of how your backend is built, not because of what your users need. Every implementation-detail tool is noise that the LLM must parse, evaluate, and reject before it can find the tool that actually answers the user's question.

The Pareto Escape

The way out is the 80/20 rule. In practice, roughly 20% of an API's capabilities serve 80% of user requests. The business analyst — one of the two human corners of the Capability Square — is the person who knows which 20%, because they share a domain with the business users they're designing for.

The left side of the curve is where your MCP tools live: the high-frequency request types that your users ask for every day. These are the 10 to 15 outcome-oriented tools you design carefully, with typed schemas, quality descriptions, and validation constraints. They handle the bulk of traffic reliably and fast.

The right side is the long tail: rare, unpredictable requests that don't justify a dedicated tool. Creating tools for every edge case pushes you back into the 50+ tool zone where LLM performance collapses. The Pareto line is where you stop adding tools and start thinking about a different mechanism for everything to its right.

Block has built over 60 production MCP servers. Their consistent finding: the generate-then-prune workflow is standard practice. Generate from your API spec, then ruthlessly cut. Most teams end up keeping 10 to 15 percent of what they started with. The tools that survive are those that map to actual user outcomes, and identifying those outcomes requires a business analyst who shares a domain with business users.

Measurement point: Provide your MCP server to 20 business users who represent your target persona. Track task completion rates across their actual requests, and not your limited test cases, their requests. If completion is below 80%, you're either exposing too many tools (confusion) or the wrong tools (coverage gap). The Capability Square tells you which: if the LLM selects wrong tools, fix descriptions. If the right tool doesn't exist, the business analyst chose the wrong 20% — re-observe the users. If the tool exists but returns unhelpful results, fix the server's implementation.

For the Other 80%: A Preview of Code Mode

So you've curated your tools to the critical 20%. But users will inevitably ask for something outside that set. What then?

This is where code mode enters. Instead of creating a tool for every possible request, you let the LLM write code that calls your API directly. Anthropic's engineering team found that code execution reduced token usage from 150,000 to 2,000 tokens (98.7% reduction) in a Google Drive to Salesforce workflow. The LLM writes a targeted script, executes it, and returns just the result. No 200-tool context window bloat. No multi-step tool chaining. Just precise, one-shot computation.

The playbook: design 10 to 15 outcome-oriented tools for the common 80% of requests. For the long tail, provide code mode access with appropriate guardrails. This gives you broad coverage without the tool count explosion that kills LLM performance. Your curated tools handle the predictable workflows fast. Code mode handles the unpredictable ones flexibly.

We'll cover code mode in depth in a later article in this series: how to set it up, how to secure it, and when it's the right (and wrong) choice. For now, the key insight is that tool reduction isn't about limiting your users. It's about choosing the right mechanism for each type of request.

Key Takeaways

The Capability Square drives everything. Good tool design requires balancing four parties: what the LLM can do (interpret intent, select tools), what the server should do (precise computation, data access), what the business analyst knows at design time (which capabilities matter, how to describe them, which user persona they serve), and what the business user brings at runtime (the actual request and its business context). When any corner is weak, task completion suffers.
Design for outcomes, not operations. One tool per user goal, not one tool per API endpoint. The customer asking "where's my order?" wants a tracking link, not three chained API calls. Move orchestration complexity into the server, where it runs deterministically.
Less is more. Keep servers to 5 to 15 tools. Evidence from GitHub Copilot, Speakeasy, and Block consistently shows that performance degrades sharply past 20 tools. The failure is not gradual; it's a cliff.
Descriptions are the user interface. Use the six-component rubric: Purpose, Guidelines, Limitations, Parameters, Length, and Examples. With 97% of tool descriptions containing quality issues, this is a big opportunity for immediate improvement.
Error messages are recovery instructions. Use the three-part template: what went wrong, what was expected, and an example of correct input. Suggest one or two fixes, not five. Ambiguity in error messages wastes round trips and user patience.
Know your users — and design from inside their domain. The business analyst corner of the Capability Square determines which 20% of the API capabilities to expose. The same API should produce different MCP servers for different business user personas. Auto-wrapping skips this judgment and produces servers that serve no one well. Handing the design to a purely technical team — engineers who don't live in the business domain — produces the same failure mode for the same reason: design calls made without domain fluency.
Measure task completion across diverse business user requests. Not your three test cases during development, but real requests from real business users representing your target persona. If completion is low, the Capability Square tells you which corner to fix.

Continue the Series

This article covered the foundation: how to design MCP tools that LLMs can actually use. The rest of the series goes deeper.

Want to add user-controlled workflows? Read our article on Prompts and Resources, where we cover MCP's underutilized primitives for guided interactions.
Ready to test your server? See Testing MCP Servers for unit testing, integration testing, and description quality validation.
Concerned about security? MCP Security covers OAuth 2.1, input validation, and the common vulnerabilities that affect 43% of MCP servers.
Building from an existing API spec? Schema-Driven MCP Servers shows the generate-then-prune workflow in detail, from OpenAPI spec to curated tool set.
Interested in code mode? Code Mode for MCP explores the long-tail strategy we previewed above: how to let the LLM write code safely against your API.

For hands-on practice with these patterns, the Advanced MCP course provides guided exercises building production MCP servers in Rust with the PMCP SDK.

Top comments (2)

Raju Dandigam • Jun 30

This is one of the more practical MCP points that deserves more attention: the protocol can be fine while the tool surface is poorly designed. Too many tools, vague descriptions, and overlapping actions create an agent UX problem, not just a backend integration problem. I especially like the framing that tool design should be outcome-oriented, because that makes tool choice easier for the model and easier for engineers to debug later. In production systems, I’d add that teams also need traces showing which tools were available, which tool was selected, and why the selected path failed or succeeded.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.