DEV Community

Guy for AWS Heroes

Posted on

MCP Tool Design: Why Your AI Agent Is Failing (And How to Fix It)

The Reports of MCP's Death Have Been Greatly Exaggerated

Scroll through developer forums in early 2026, and you'll find a recurring theme: MCP is dead. The takes range from dismissive ("just a fad") to resigned ("we tried it, our agents kept failing"). And the frustrations behind them are real. Teams are building MCP servers with 50+ tools, watching their agents stumble through tool selection, and concluding that the protocol itself is broken.

It isn't. MCP isn't dead; it's being used poorly. And the evidence for how to use it well is now overwhelming.

Over the past year, teams at GitHub, Block, and dozens of smaller shops have converged on the same set of principles. GitHub Copilot cut its tool count from 40 to 13 and saw measurable benchmark improvements. Block rebuilt its Linear MCP server three times, going from 30+ tools to just 2. The pattern is consistent: fewer tools, better descriptions, outcome-oriented design. The problem isn't the protocol. It's tool design.

This article lays out the framework. We'll start with the mental model that makes everything else click, the Capability Triangle, then walk through the anatomy of a well-designed tool. Subsequent articles in this series cover the quantitative evidence, description quality, and the anti-patterns that cause most failures.

What Is MCP? (The 30-Second Version)

The Model Context Protocol (MCP) is an open protocol that connects AI models to external tools and data sources. The simplest way to think about it: websites and mobile apps are the interface between humans and online services. MCP is the interface between AI and those same services. Over decades, we've invested heavily in improving human interfaces, including the iPhone's gesture language, years of UX research, accessibility standards, and usability testing. AI needs the same investment in its interface to online services. MCP is that interface, and tool design is its UX discipline.

One of our clients came to us with exactly this gap. They wanted AI agents to operate their web forms: filling in fields, clicking buttons, navigating multi-step workflows through a browser. They asked us to run tests evaluating how well browser-based agents could complete their online forms, and to help "fix" the forms for agent compatibility. We explained that this was significant effort in the wrong direction. Their web forms were designed for humans, with visual layout, hover states, drag-and-drop interactions. Instead, we showed them that adding an MCP server to the same API sitting behind those forms gave AI agents a native interface purpose-built for how they work: structured inputs, clear descriptions, typed responses. The agents went from struggling with form fields to completing tasks reliably. The lesson: don't retrofit human interfaces for AI. Build AI-native interfaces alongside them, and MCP servers to your internal and external services.

The parallels between UX design and MCP tool design run deep. Decades of UX research have produced principles that transfer directly:

Affordance - the idea that a door handle should look pullable, maps to tool names and parameter descriptions: if a field is named id but requires a UUID, the affordance is broken.

Recognition over recall - it's easier to pick from a list than type from memory, maps to using enums and example values in schemas so the LLM recognizes valid inputs instead of guessing.

Visibility of system status - users need feedback when something goes wrong, maps to error messages that explain what happened and how to fix it, rather than a cryptic "invalid input." These aren't metaphors. They're the same design discipline applied to a different kind of user.

The Capability Triangle: Three Parties, One Tool

Even if you've been building MCP servers for months, don't skip this section. The Capability Triangle reframes tool design around a party that most MCP discussions ignore entirely, who is the domain expert. Every MCP tool sits at the intersection of three parties, each with distinct strengths and weaknesses. Understanding this balance is the foundation of good tool design.

The Capability Triangle: Three Parties, One Tool

The LLM (MCP Client)

The large language model (LLM) is part of each MCP client like ChatGPT, Claude Desktop, or a custom agent. It brings language understanding, reasoning, and tool-calling intelligence. It's good at interpreting ambiguous user requests ("where's my package?"), choosing between available tools, composing multi-step plans, and recovering gracefully from errors.

What it's bad at: domain knowledge and symbolic computation. An LLM doesn't know which API capabilities matter for your specific users, and it can't access your databases. It doesn't know that your customer support team needs order tracking but never touches inventory management. It doesn't know your compliance requirements or your business rules.

The MCP Server

The server provides symbolic computation, data access, and validated operations. It's good at precise calculations, database queries, API calls with proper authentication, input validation, and returning structured results. It runs deterministically, and it is more predictable and easier to validate than LLM reasoning.

What it's bad at: understanding user intent. A server can't interpret "check if we have enough widgets for the Johnson order" without a tool specifically designed for that workflow. It doesn't adapt to ambiguity. It does exactly what it's told, nothing more.

The Human (Domain Expert and Server Designer)

This is the party that's most often overlooked, and it's the one that matters most. The human can be the developer, the product manager, the domain expert who designs the MCP server. They bring knowledge that neither the LLM nor the server possesses. They know which 20% of an API serves 80% of their users' actual requests. They understand the user personas and their existing processes. They know the business context.

What they're bad at: being present at runtime. The domain expert's knowledge has to be encoded into the tool's name, description, schema, and error messages. Every design choice is a message to the LLM about how to use the tool.

But "not present at runtime" doesn't mean "design it and walk away." Tool design is iterative. Your first design is a hypothesis about what your users need, and like any hypothesis, it needs validation. Usage logs tell you which tools are called, which fail, which are never used, and which requests produce no tool match at all. The domain expert reviews these logs and refines: renaming tools that confuse the LLM, improving descriptions that lead to wrong selections, adding tools for workflows that users need but the initial design missed.

This iterative loop is where MCP shines compared to direct API integration. Changing a tool's name, description, or input schema is a server-side change, with no client updates, no SDK version bumps, no breaking changes propagated to consumers. The MCP protocol decouples tool discovery from tool invocation, so the LLM rediscovers the improved schema on the next connection. This makes the feedback cycle fast: observe failures, update the tool design, deploy, and measure again. Teams that treat tool design as a one-time exercise miss the biggest advantage of having MCP in the middle. You should spend the effort in designing the MCP server correctly, as "You never get a second chance to make a first impression". However, you should continue to monitor the MCP server usage logs to adjust to the usage patterns of real users.

Why the Triangle Matters

Each party compensates for the others' weaknesses. The LLM handles ambiguity that the server can't. The server provides precision that the LLM can't. The human provides domain context that neither possesses at runtime.

This has a practical consequence that trips up most teams: the same API should produce different MCP servers for different users.

Consider the London Transit API. A daily commuter wants trip planning: "fastest route from Paddington to Canary Wharf avoiding the Jubilee line." An event organizer wants logistics: "how many bus routes serve Wembley Stadium, and what's the last departure after a 10 PM concert?" A municipal planner wants construction impact analysis: "if we close three stations on the Northern line for six weeks, which bus routes need capacity increases?"

Same API. Three completely different MCP servers. Three different sets of tools, with different names, different descriptions, and different response shapes, because the domain expert for each server knows their users.

Here's the key insight: when you ask an LLM to auto-wrap an API, it lacks this domain context. It can't know which 20% matters because it doesn't know who the user is. Auto-generated MCP servers produce generic tool sets that serve no one well. The domain expert's judgment, which is encoded into tool selection, naming, and descriptions, is what makes an MCP server effective.

How do you know your triangle is balanced? Measure task completion across the specific requests your users actually make, and not only the three test cases you tried during development. If completion is low, one vertex of the triangle is weak. Either the LLM can't understand your tools (fix descriptions), the server can't handle the requests (add or redesign tools), or the domain expert chose the wrong tools to expose (talk to your users).

Tool Anatomy: What Makes an MCP Tool

An MCP tool has six components: a name, a description, an input schema, an output schema, a handler, and error handling. Each one is a communication channel to the LLM, and each one matters.

Here's a complete tool in Rust using the PMCP SDK. Don't worry if you're not fluent in Rust, as the comments walk through every important line:

// -- Dependencies --
// pmcp: the PMCP SDK for building MCP servers
// serde: serialization/deserialization (parses JSON input, formats JSON output)
// schemars: generates JSON Schema from Rust types (so the LLM knows what to send)
use pmcp::server::typed_tool::TypedToolWithOutput;
use pmcp::RequestHandlerExtra;
use serde::{Deserialize, Serialize};
use schemars::JsonSchema;

// -- Input Schema --
// This struct defines what the LLM must send. Each field becomes a property
// in the JSON Schema that the LLM sees when it discovers this tool.
// The doc comments (///) become the schema descriptions automatically.
//
// Annotations on each field define constraints that flow into the
// JSON Schema. The LLM sees "maxLength": 16 on the SKU field and
// "minimum": 1 on quantity BEFORE it calls the tool. A well-behaved
// client respects these; the server enforces them at runtime too.
// deny_unknown_fields rejects any extra fields the LLM might add.
#[derive(Debug, Deserialize, JsonSchema)]
#[schemars(deny_unknown_fields)]
struct CheckInventoryInput {
    /// Product SKU to look up (e.g., "WIDGET-42", "BOLT-7")
    #[schemars(length(max = 16))]
    sku: String,

    /// Number of items needed. Defaults to 1 if not specified.
    /// Use this to check whether a specific quantity is available
    /// before quoting delivery dates.
    #[serde(default = "default_quantity")]
    #[schemars(range(min = 1, max = 10000))]
    quantity_needed: u32,
}

// Default value: if the LLM doesn't specify a quantity, assume 1
fn default_quantity() -> u32 { 1 }

// -- Output Schema --
// Defining the output shape serves two purposes:
// 1. The LLM knows exactly what fields to expect in the response
// 2. Downstream tools or MCP Apps can rely on this structure
#[derive(Debug, Serialize, JsonSchema)]
struct InventoryResult {
    /// The product SKU that was checked
    sku: String,
    /// Whether the requested quantity is currently in stock
    in_stock: bool,
    /// Total quantity available in warehouse
    available: u32,
    /// Whether the requested quantity can be fulfilled
    sufficient: bool,
}

// -- Register the tool with the server --
#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let server = pmcp::ServerBuilder::new()
        .name("inventory-server")
        .version("1.0.0")
        // Register the tool: name, handler with typed input and output,
        // plus a description that tells the LLM WHAT it does, WHEN to
        // use it, and what it RETURNS.
        .tool(
            "check_inventory",
            TypedToolWithOutput::new(
                "check_inventory",
                |input: CheckInventoryInput, _extra: RequestHandlerExtra| {
                    Box::pin(async move {
                        // In production, this queries your inventory database.
                        // Here we return a mock response for clarity.
                        let available = 847_u32;
                        Ok(InventoryResult {
                            sku: input.sku,
                            in_stock: available > 0,
                            available,
                            sufficient: available >= input.quantity_needed,
                        })
                    })
                },
            )
            .with_description(
                "Check inventory levels for a product by SKU. Returns stock \
                 status, available quantity, and whether the requested amount \
                 can be fulfilled. Use this before quoting delivery dates \
                 to customers."
            ),
        )
        .build()?;

    // Start the server over Streamable HTTP -- the production transport.
    // This makes your server accessible to any MCP client over the network:
    // Claude Desktop, ChatGPT, custom agents, or browser-based tools.
    // Unlike stdio (which requires local installation), HTTP lets
    // non-technical users connect without touching a terminal.
    server.run_streamable_http("0.0.0.0:3000").await?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

While we're using Rust and the PMCP SDK throughout this series, the design principles, mainly typed schemas, descriptive names, structured output, apply to any MCP-compliant server, whether TypeScript, Python, or anything else that speaks the protocol. These are protocol-level concerns, not language-level ones.

Let's walk through each component.

Name ("check_inventory"): The name follows a verb_noun pattern. It's unambiguous, and the LLM won't confuse this with a tool for updating inventory or listing products. Avoid generic names like get_data or process_request. The name is the LLM's first signal about what a tool does.

Description: This is the LLM's primary decision surface. Notice it does three things: says what the tool does ("check inventory levels"), what it returns ("stock status, available quantity, and whether the requested amount can be fulfilled"), which helps the LLM understand if the tool can answer the user's request, and when to use it ("before quoting delivery dates to customers"). That last part is critical. It tells the LLM about workflow context, which is something the description's author, the domain expert, knows but the LLM doesn't.

Input schema: The CheckInventoryInput struct defines what the LLM must send. Each field has a type (the LLM can't accidentally pass a string where a number is expected), a doc comment that becomes the JSON Schema description (the LLM sees "Product SKU to look up" when it discovers the tool), and optional defaults (quantity_needed defaults to 1 if omitted). The #[schemars(...)] annotations are the single source of truth for constraints: length(max = 16) on the SKU field generates "maxLength": 16 in the JSON Schema, and range(min = 1, max = 10000) on quantity generates "minimum": 1, "maximum": 10000. The LLM sees these rules when it discovers the tool, before it ever makes a call. And #[schemars(deny_unknown_fields)] on the struct means the LLM can't sneak in extra fields, as anything outside sku and quantity_needed is rejected.

Output schema: The InventoryResult struct defines what the tool returns. This is optional in the MCP spec, but we strongly recommend it. A defined output schema serves two purposes: the LLM knows exactly what fields to expect (it won't hallucinate response fields that don't exist), and downstream consumers, whether another tool in a chain or an MCP App rendering a UI widget, can rely on the structure. The sufficient field is a good example: it does the comparison server-side rather than asking the LLM to compare available against quantity_needed and risk getting it wrong.

Handler: The async closure that does the actual work. In this example, it returns a mock response for clarity. In production, this would query your inventory database, call a warehouse API, or perform whatever computation the tool promises. Notice that the handler receives a typed CheckInventoryInput and not raw JSON. The parsing already happened. Your handler code focuses on business logic, not input validation. This is the server's contribution to the Capability Triangle: reliable, deterministic execution.

Validation: Notice that constraints are declared once, on the struct fields, using #[schemars(...)] annotations. The same annotation serves two purposes: it generates the JSON Schema that the LLM reads at discovery time, and it defines the contract the server enforces at runtime. No duplication between schema and validation logic, where the struct is the single source of truth.

Security in MCP servers works in layers, and schema constraints are one of the easiest layers to add. First, serde enforces type safety: sku must be a string, quantity_needed must be an unsigned integer, and type-level attacks are blocked at deserialization before your code runs. Second, #[schemars(length(max = 16))] constrains input shape: it won't prevent SQL injection on its own (that's the job of parameterized queries and safe query construction in your database layer), but it does reject obviously malformed or abusive input early, before it reaches any downstream system. Real SKUs are short; a 200-character string is either a mistake or a probe, and there's no reason to let it through. Third, deny_unknown_fields prevents unexpected fields from slipping past the schema entirely. Each layer is simple, but together they reduce the attack surface significantly. The deeper security story, such as parameterized queries, OAuth 2.1, Rust's memory safety guarantees, and the OWASP MCP threat model, gets its own article later in this series.

Error handling: If the LLM sends input that doesn't match CheckInventoryInput, such as, passing "sku": 42 instead of "sku": "WIDGET-42", serde produces an error message explaining the type mismatch. If the SKU exceeds 16 characters, the schema constraint rejects it before the handler runs. For business logic errors inside the handler, use pmcp::Error::validation() with actionable messages following a three-part template: what went wrong, what was expected, and an example of correct input. Good error messages suggest one or two specific fixes, since multiple options force the LLM to guess, and guessing wastes tokens and user patience.

Notice this isn't a local development tool. This is a server designed for a specific user, who needs to quote delivery dates. The domain expert decided that inventory checks matter for their users, and they encoded that context into the description and the output shape. The sufficient field exists because the domain expert knows that customers ask "do you have enough?" not "how many do you have?" A different domain expert building for a warehouse manager might expose entirely different tools from the same inventory system.

Can an LLM discover this tool and call it correctly on the first try? If not, your name or description needs work. That's the simplest measurement of tool design quality, and it's one you can test in five minutes with any MCP client.

Outcomes, Not Operations

The domain expert in the Capability Triangle knows something the LLM never will: what outcome the user actually wants. When a customer asks "where's my order?", they don't want a customer ID, then a list of order IDs, then a status lookup. They want a tracking link and an ETA. The difference between those two experiences is the difference between operation-oriented and outcome-oriented tool design.

Here's the anti-pattern. A team with a REST background wraps their existing endpoints as MCP tools:

  • get_customer_by_email(email) returns a customer_id
  • list_customer_orders(customer_id) returns an array of order_id values
  • get_order_status(order_id) returns a status string

To answer "where's my order?", the LLM must chain all three calls in the correct sequence. The costs compound at every step:

  • More tokens. The LLM processes the full response from each tool call and generates the next call. Three round trips means three times the input and output tokens, which is cost that the user pays for without getting any additional value.
  • More latency. Each step requires a network round trip to the MCP server plus LLM processing time to interpret the result and formulate the next call. What could be a sub-second single call becomes a multi-second chain.
  • Growing risk of misstep. The probability of a correct sequence is the product of each step's success rate. If each tool call has a 95% chance of correct execution, three chained calls drop to 85.7%. At five steps, you're at 77.4%. The LLM must remember variable names and values from earlier calls, handle edge cases at each step, and maintain coherence across the full chain. Each step is another opportunity for the model to hallucinate a parameter, misinterpret a response, or lose track of its plan.
Operation-Oriented (REST style) Outcome-Oriented (MCP style)
Tool Count High (1 per endpoint) Low (1 per user goal)
LLM Effort High (choreographing multi-step chains) Low (single-shot invocation)
Token Cost High (processing every intermediate result) Low (one request, one response)
Latency High (N round trips + N LLM inferences) Low (single round trip)
Reliability Low (3+ compounding points of failure) High (deterministic server logic)

Now consider the outcome-oriented alternative:

// -- Input: just the customer's email --
#[derive(Debug, Deserialize, JsonSchema)]
struct TrackOrderInput {
    /// Customer email address (e.g., "alice@company.com")
    #[schemars(length(max = 254))]
    email: String,
}

// -- Status enum: the LLM sees valid values in the schema --
// Instead of a free-form string, an enum lets the LLM "recognize"
// valid statuses rather than "recall" them from memory.
#[derive(Debug, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
enum OrderStatus {
    Processing,
    Shipped,
    InTransit,
    Delivered,
}

// -- Output: everything the LLM needs to answer the question --
#[derive(Debug, Serialize, JsonSchema)]
struct OrderTrackingResult {
    /// Customer name for the greeting
    customer: String,
    /// Order identifier
    order_id: String,
    /// Current order status
    status: OrderStatus,
    /// Shipping carrier name
    carrier: String,
    /// Estimated delivery date (ISO 8601)
    eta: String,
    /// Direct tracking URL the customer can click
    tracking_url: String,
}
Enter fullscreen mode Exit fullscreen mode

The tool registration follows the same pattern as check_inventory:

.tool(
    "track_latest_order",
    TypedToolWithOutput::new(
        "track_latest_order",
        |input: TrackOrderInput, _extra: RequestHandlerExtra| {
            Box::pin(async move {
                // Internally: resolve customer, find latest order, get status.
                // The server handles the entire chain -- three API calls
                // collapsed into one deterministic operation.
                Ok(OrderTrackingResult {
                    customer: "Alice Chen".into(),
                    order_id: "ORD-8834".into(),
                    status: OrderStatus::InTransit,
                    carrier: "FedEx".into(),
                    eta: "2026-03-20".into(),
                    tracking_url: "https://fedex.com/track/ABC123".into(),
                })
            })
        },
    )
    .with_description(
        "Track the most recent order for a customer using their email. \
         Returns order status, carrier info, and tracking link. Use this \
         when a customer asks 'where is my order?' or 'when will it arrive?'"
    ),
)
Enter fullscreen mode Exit fullscreen mode

One tool. One user outcome. The output struct gives the LLM a rich, typed response, with customer name, status, carrier, ETA, and a clickable tracking URL, which is everything it needs to answer the question in a single turn. The server handles the chaining internally (resolve customer, find latest order, fetch status) because that's what servers are good at: deterministic, multi-step computation. In a production environment, your server handles requests from users who don't know MCP exists and don't care about your API structure. They just want answers. In the Capability Triangle, symbolic computation and data access are the server's strengths. Let the server do the work it's built for, and let the LLM do what it's built for: understanding the user's intent and presenting a clear answer.

This isn't a theoretical pattern. Block built 60+ production MCP servers. Their Linear integration started with 30+ tools mirroring GraphQL endpoints, with one tool per query, one tool per mutation. After three iterations, they were down to 2 tools. The tool count dropped because the team learned to design for outcomes. Each iteration moved complexity from the LLM (which had to choreograph multi-tool sequences) into the server (which could handle the orchestration deterministically).

Measurement point: Test this yourself. Give 10 users the same task ("find my latest order status"). With the 3-tool REST mapping, measure how many succeed on the first try. Now try the single outcome-oriented tool. The difference in task completion rate is your design quality signal.

Less Is More: The Evidence for Tool Reduction

Outcome-oriented design naturally reduces tool count. But how much does reduction actually matter? The research is unambiguous.

GitHub reduced their Copilot MCP integration from 40 built-in tools to 13 core tools. The result: 2 to 5 percentage point improvement across SWE-Lancer and SWEbench-Verified benchmarks, plus a 400ms latency reduction. Fewer tools meant the model spent less time on tool selection and more time on the actual task. The gains came not from adding capability, but from removing it.

The Speakeasy team ran a controlled experiment using a Pet Store API. At 107 tools, both large and small models failed completely, and task success collapsed. At 20 tools, large models scored 19 out of 20 correct. At 10 tools, performance was perfect. The failure wasn't gradual. It was a cliff: past a threshold, models don't degrade gracefully. They fall off.

Why does success collapse rather than degrade? Two mechanisms compound. First, context window bloat: every tool name, description, and parameter schema consumes tokens on every request. At 50+ tools, this can eat 5 to 7 percent of the model's context before a single user message arrives, thus crowding out conversation history, document content, and reasoning space. Second, and more insidious, is tool hallucination: when the LLM's attention is spread across too many similar-sounding tools, it starts inventing tool names that don't exist, conflating parameters between tools, or calling the right tool with arguments from a different tool's schema. This is the same "instruction following degradation" that causes LLMs to drift off-task in long prompts, except here, each hallucinated tool call is a hard failure, not a soft one. The model doesn't produce a slightly wrong answer. It produces no answer at all.

In UX terms, this is information overload. Just as a human can't choose from a menu of 100 items without decision fatigue, an LLM's attention fragments across too many similar-sounding options. The threshold varies by model size. Small models (8B parameters) hit their sweet spot around 19 tools and fail at 46. Even the largest models struggle past 100 tools.

As Hugging Face's Phil Schmid puts it: "Curate ruthlessly. 5 to 15 tools per server. One server, one job."

This raises an obvious question: if you expose only 10 to 15 tools, aren't you leaving functionality on the table? Yes! and deliberately. And that's the right choice. We'll see why shortly, when we look at how much of an API your users actually need.

Measurement point: Count your tools. If you have more than 15 per server, you're likely past the diminishing returns threshold. Benchmark your task completion rate before and after pruning, and the numbers will make the case for you.

The 97% Problem: Tool Description Quality

You can have the right number of tools, designed for the right outcomes, and still fail. A 2025 study analyzing MCP tool descriptions across the ecosystem found that 97.1% contain at least one quality issue. More than half (56%) have unclear purpose statements. Your tools might be well-designed, but if the LLM can't understand when to use them, that design is invisible.

Tool descriptions are not documentation. They are the LLM's primary decision surface. When the LLM sees 15 tools and must choose one, the description is the only signal it has. A vague description is like a restaurant menu that says "food" for every dish, which is technically accurate, practically useless.

The research identified six components of a quality tool description: Purpose (what the tool does), Guidelines (when and how to use it), Limitations (what it cannot do or when to use something else), Parameter Explanation (input format and constraints), Length (enough detail without overwhelming), and Examples (concrete usage scenarios). Most descriptions fail on multiple components simultaneously.

Here's what the improvement looks like in practice. Consider a flight search tool across three levels of description quality:

// LEVEL 1 -- Vague (56% of MCP tools have this problem)
.with_description("Search for flights")

// LEVEL 2 -- Better purpose, but missing guidelines and limitations
.with_description("Search for available flights between two airports on a given date")

// LEVEL 3 -- Full rubric: purpose + guidelines + limitations
.with_description(
    "Search for available flights between two airports on a specific date. \
     Returns up to 20 results sorted by price. Use 3-letter IATA airport \
     codes (e.g., 'LAX', 'JFK'). Only searches economy class. For business \
     or first class, use the premium_flight_search tool. Dates must be \
     within the next 330 days."
)
Enter fullscreen mode Exit fullscreen mode

Level 1 tells the LLM nothing about parameters, constraints, or when to use an alternative tool. The LLM has to guess at everything: input format, result shape, scope. Level 2 adds purpose and the LLM knows it needs two airports and a date, but it doesn't know about airport code format, result limits, or class restrictions. It might pass "Los Angeles" instead of "LAX", or ask for business class flights and get wrong results. Level 3 gives the LLM everything it needs to (a) decide to use this tool, (b) provide correct inputs, and (c) know when NOT to use it, that last point being critical for multi-tool servers where the LLM must choose between similar options.

In the same study, augmented descriptions improved task success by 5.85 percentage points in controlled testing. That may sound modest, but at scale it's the difference between a tool that works most of the time and one that works almost all of the time. For a customer-facing agent handling thousands of requests per day, those percentage points represent real users getting real answers.

Description quality extends to error messages. When a tool receives invalid input, the error message is the LLM's only guide for recovery. Compare these two approaches:

// BAD: LLM tries random fixes
return Err(pmcp::Error::validation("Invalid input"));

// GOOD: Problem + expectation + example
return Err(pmcp::Error::validation(
    "Invalid date format for 'departure': '15/04/2026'. \
     Use ISO 8601 format (YYYY-MM-DD). \
     Example: '2026-04-15'"
));
Enter fullscreen mode Exit fullscreen mode

The first error forces the LLM to guess. It might try a different date format, or remove the date entirely, or change a different parameter. Each wrong guess wastes a round trip and user patience. The second error follows a three-part template: what went wrong ("invalid date format for 'departure'"), what was expected ("ISO 8601 format"), and an example of correct input ("2026-04-15"). Suggest one or two fixes maximum. Multiple options force the LLM to guess, and guessing is what we're trying to eliminate.

This is where the typed struct pattern we saw in the Tool Anatomy section pays off. Remember how CheckInventoryInput used doc comments on each field to generate JSON Schema descriptions? The same pattern applies to every tool. When the domain expert writes /// Product SKU to look up (e.g., "WIDGET-42", "BOLT-7") on a struct field, that text becomes the LLM's guide for formatting its input. The type system enforces correctness at parse time, before the handler code ever runs. And the output schema tells the LLM exactly what fields to expect, so it won't hallucinate response fields that don't exist.

This connects back to the Capability Triangle. The domain expert writes these descriptions. The LLM reads them. The server validates against them. All three parties of the triangle are aligned, and the domain expert's knowledge, encoded at design time, guides the LLM's decisions at runtime and is enforced by the server's type system.

Measurement point: Test your description quality by presenting your tool list to an LLM and asking it to select the right tool for 10 different user requests. If tool selection accuracy is below 90%, your descriptions need work. This test takes five minutes and tells you more about your MCP server's real-world effectiveness than any benchmark.

The Full-API Trap (And the Pareto Escape)

We said that exposing only 10 to 15 tools means leaving functionality on the table. Now let's talk about why that's the right call.

The most tempting mistake in MCP server design is wrapping your entire API. You have 200 endpoints, so you generate 200 tools. The OpenAPI-to-MCP converter makes it easy. The result is a server that does everything and succeeds at nothing. The LLM sees 200 tool descriptions, burns through context window space parsing them, and still picks the wrong one, because 200 options is not a menu, it's a phone book.

The deeper problem is semantic noise. When you auto-wrap an API, you inject your backend's implementation details into the LLM's reasoning space. The LLM shouldn't have to understand your database normalization, your internal microservice boundaries, or your pagination cursor format. It should see tools that map cleanly to user intent. Auto-wrapping exposes tools like get_customer_by_internal_id and list_orders_with_cursor_pagination, which are concepts that exist because of how your backend is built, not because of what your users need. Every implementation-detail tool is noise that the LLM must parse, evaluate, and reject before it can find the tool that actually answers the user's question.

The Pareto Escape

The way out is the 80/20 rule. In practice, roughly 20% of an API's capabilities serve 80% of user requests. The domain expert, who is the third vertex of the Capability Triangle, is the person who knows which 20%.

Request Distribution: The 80/20 Rule

The left side of the curve is where your MCP tools live: the high-frequency request types that your users ask for every day. These are the 10 to 15 outcome-oriented tools you design carefully, with typed schemas, quality descriptions, and validation constraints. They handle the bulk of traffic reliably and fast.

The right side is the long tail: rare, unpredictable requests that don't justify a dedicated tool. Creating tools for every edge case pushes you back into the 50+ tool zone where LLM performance collapses. The Pareto line is where you stop adding tools, and start thinking about a different mechanism for everything to its right.

Block has built over 60 production MCP servers. Their consistent finding: the generate-then-prune workflow is standard practice. Generate from your API spec, then ruthlessly cut. Most teams end up keeping 10 to 15 percent of what they started with. The tools that survive are the ones that map to actual user outcomes, and identifying those outcomes requires the domain expert who knows the users.

Measurement point: Give your MCP server to 20 users representing your target persona. Track task completion rates across their actual requests, and not your limited test cases, their requests. If completion is below 80%, you're either exposing too many tools (confusion) or the wrong tools (coverage gap). The Capability Triangle tells you which: if the LLM selects wrong tools, fix descriptions. If the right tool doesn't exist, you chose the wrong 20%. If the tool exists but returns unhelpful results, fix the server's implementation.

For the Other 80%: A Preview of Code Mode

So you've curated your tools to the critical 20%. But users will inevitably ask for something outside that set. What then?

This is where code mode enters. Instead of creating a tool for every possible request, you let the LLM write code that calls your API directly. Anthropic's engineering team found that code execution reduced token usage from 150,000 to 2,000 tokens (98.7% reduction) in a Google Drive to Salesforce workflow. The LLM writes a targeted script, executes it, and returns just the result. No 200-tool context window bloat. No multi-step tool chaining. Just precise, one-shot computation.

The playbook: design 10 to 15 outcome-oriented tools for the common 80% of requests. For the long tail, provide code mode access with appropriate guardrails. This gives you broad coverage without the tool count explosion that kills LLM performance. Your curated tools handle the predictable workflows fast. Code mode handles the unpredictable ones flexibly.

We'll cover code mode in depth in a later article in this series: how to set it up, how to secure it, and when it's the right (and wrong) choice. For now, the key insight is that tool reduction isn't about limiting your users. It's about choosing the right mechanism for each type of request.

Key Takeaways

  1. The Capability Triangle drives everything. Good tool design requires balancing what the LLM can do (interpret intent, select tools), what the server should do (precise computation, data access), and what the domain expert knows (which capabilities matter for which users). When one vertex is weak, task completion suffers.

  2. Design for outcomes, not operations. One tool per user goal, not one tool per API endpoint. The customer asking "where's my order?" wants a tracking link, not three chained API calls. Move orchestration complexity into the server, where it runs deterministically.

  3. Less is more. Keep servers to 5 to 15 tools. Evidence from GitHub Copilot, Speakeasy, and Block consistently shows that performance degrades sharply past 20 tools. The failure is not gradual; it's a cliff.

  4. Descriptions are the user interface. Use the six-component rubric: Purpose, Guidelines, Limitations, Parameters, Length, and Examples. With 97% of tool descriptions containing quality issues, this is a big opportunity for immediate improvement.

  5. Error messages are recovery instructions. Use the three-part template: what went wrong, what was expected, and an example of correct input. Suggest one or two fixes, not five. Ambiguity in error messages wastes round trips and user patience.

  6. Know your users. The domain expert vertex of the Capability Triangle determines which 20% of API capability to expose. The same API should produce different MCP servers for different user personas. Auto-wrapping skips this judgment and produces servers that serve no one well.

  7. Measure task completion across diverse user requests. Not your three test cases during development, but real requests from real users representing your target persona. If completion is low, the Capability Triangle tells you which vertex to fix.

Continue the Series

This article covered the foundation: how to design MCP tools that LLMs can actually use. The rest of the series goes deeper.

  • Want to add user-controlled workflows? Read our article on Prompts and Resources, where we cover MCP's underutilized primitives for guided interactions.
  • Ready to test your server? See Testing MCP Servers for unit testing, integration testing, and description quality validation.
  • Concerned about security? MCP Security covers OAuth 2.1, input validation, and the common vulnerabilities that affect 43% of MCP servers.
  • Building from an existing API spec? Schema-Driven MCP Servers shows the generate-then-prune workflow in detail, from OpenAPI spec to curated tool set.
  • Interested in code mode? Code Mode for MCP explores the long-tail strategy we previewed above: how to let the LLM write code safely against your API.

For hands-on practice with these patterns, the Advanced MCP course provides guided exercises building production MCP servers in Rust with the PMCP SDK.

Top comments (0)