Wojciech Wentland

Posted on Jul 1 • Originally published at blog.wentland.io

Designing tools that AI can actually use

#ai #programming #productivity #mcp

Here's a tool definition I see all the time:

@tool
def get_data(id: str):
    """Gets data."""
    ...

The model has no way to know when to call this. "Gets data" could match almost any request. And when an agent is choosing among thirty tools, the description does most of the routing. To the model, that two-word docstring is the interface; the code underneath barely enters the decision.

Most of the agent bugs I chase turn out to be tool-selection bugs. The model picks the wrong tool, or picks nothing when it should have acted. Sometimes it calls the right tool with the wrong arguments. The fix is almost always in the description, not the code.

Descriptions are prompts, not docstrings

A description has to answer two questions for the model: when should I call this, and what do I get back. Here's the same tool written for a human reader:

@tool
def search_entities(query: str):
    """Search for entities."""

And written for the model that has to route to it:

@tool
def search_entities(
    query: Annotated[str, "Search term, matches name and domain. Use exact terms, NO wildcards"],
):
    """Find entities by name or domain.

    Use this when the user refers to an entity but you don't
    have its exact ID. Partial and approximate names work.
    Returns a ranked list of matches with IDs, names, and types.

    If you can't find an entity by name, try its domain
    (e.g., "example.com").

    Example questions:
        - "Find the entity for Acme Corp"
        - "Search for the vendor with domain example.com"
    """

The second one says when to reach for it, what comes back, and what to do when the obvious approach fails. The Example questions block is pulling more weight than it looks: it anchors the kinds of user requests that should route here, which is the exact decision the model is making.

Notice what's missing. My first draft said "find entities by name using fuzzy matching." The model doesn't care that the matching is fuzzy; that's an implementation detail. It cares that it can type something approximate and get results. So I describe the behavior the model can act on, and how the matching works underneath never makes it into the text.

One caution on those example questions: keep them varied. If they all look alike, the model decides the tool is narrower than it is and skips it when a question doesn't match the shape. I lean on examples mostly for parameters with non-obvious formats, and I keep the ones I do use deliberately different from each other.

Literal is the highest-value annotation you can add

A parameter called country could be a country name, a two-letter code, a three-letter code, or a UN numeric code. The model will guess, and it'll guess wrong often enough to matter.

from typing import Annotated, Literal
from pydantic import Field

Status = Literal["active", "inactive", "undetermined"]

@tool
def search_records(
    # Inline description, fine for simple params
    query: Annotated[str, "Search term. Use exact names, NO wildcards"],

    # Literal pins the value to a known set
    status: Annotated[list[Status] | None, "Filter by status"] = None,

    # Field adds examples and validation
    country_code: Annotated[
        str | None,
        Field(description="Two-letter ISO 3166-1 alpha-2 country code", examples=["DE", "US", "PL"]),
    ] = None,

    # ge=1 stops the model sending page 0 or a negative
    page: Annotated[int, Field(description="Page number, 1-indexed", ge=1)] = 1,
):
    ...

When the model sees Literal["active", "inactive", "undetermined"], it knows the exact set of valid values. No guessing, no sending "Active" with a capital A. For open-ended parameters you can't enumerate, the examples field gives the model a sense of the format without pinning it down, and the ge=1 constraint on page is a guardrail you get for free.

The country code taught me where guidance actually belongs. The parameter said "two-letter code," and the model still passed "Germany" when the user said Germany. Tightening the parameter description didn't help. What helped was a line in the tool description: "If the user gives a country name, convert it to the ISO alpha-2 code before calling." The parameter description sets the format, but the model only reads it once it's already decided to send a country, and by then the user's "Germany" is sitting in the slot. The instruction to convert had to live where the model looks earlier, in the tool description.

Empty does not mean zero

A model queries something, gets back an empty list, and decides the tool failed. So it retries with different parameters, switches tools, or tells the user nothing's wrong when the real answer was "zero results matched." It took me a while to even see this one, because the tool was returning exactly what it should the whole time.

The reinforcement has to land in more than one place:

In the tool description: "Returns an empty list when nothing matches. An empty result is a valid answer, not an error."
In the system prompt or server instructions: "Empty results from these tools mean the query ran and found no matching data."
In the response itself. Instead of [], return {"results": [], "count": 0, "message": "No records matched"}. The structure carries enough context that the model reports the result instead of assuming the call broke.

After that, the "let me try another approach" false retries mostly stopped.

Where guidance lives

The model picks up tool behavior from four levels, and they do different jobs.

Level	Scope	What it's for
Tool description	one tool	when to call it and what comes back; this is where routing happens
Parameter description	one field	format, valid values, defaults
System prompt	the conversation	cross-tool behavior: date formats, empty-result semantics, rate-limit awareness
Server instructions	one MCP server	the same behavioral guidance, delivered over the protocol

The top two are specific to a single tool or field. The bottom two carry behavior that spans every tool from one source, and they get filled in last, if at all.

The system prompt is where I put cross-tool rules. You can inject them as structured context:

<plugins>
  <plugin name="analytics">
    This plugin provides tools for querying the analytics warehouse.
    All dates in this system are UTC.
    Empty results mean the query found nothing, not that it failed.
    Report them as "no data found" rather than retrying with different parameters.
  </plugin>
</plugins>

Keep that block behavioral, not referential: date formats, empty-result semantics, rate-limit awareness. It's the right home for the kind of guidance that would bloat every individual tool description if you repeated it.

MCP server instructions carry the same kind of guidance, but they ship as protocol metadata alongside the tool list rather than in your prompt. Clients surface them inconsistently and some ignore them entirely, so I treat the instructions field (FastMCP puts it on the server) as a supplement to the system prompt, not a replacement.

I've seen setups where everything gets crammed into the tool descriptions. That holds up until you have fifteen tools on one server and each description runs two hundred words.

The fastest way to break tool selection

Mention another tool by name. The moment a description says "for full history use get_entity_details," you've created a dependency on that name. Filter the tool out or rename it and the model still tries to call it. I've watched models hallucinate calls to tools they only saw referenced in another tool's description. Even the soft version, "call the tool that returns full history," is fragile. So each description only talks about its own tool now; the moment one starts referencing another, I treat it as a bug.

Two smaller ones. Vague parameter names like q or data give the model nothing to reason from; the name itself is a signal, so search_query and date_range_start earn their length. And implementation details ("calls /api/v2/entities with pagination") are for whoever's reading the source, not the model deciding whether to call.

Sometimes the fix is one fewer tool

I had a tool that exposed a big reference dataset, millions of rows of geographic data. It worked. But its description was broad enough to match all sorts of questions it wasn't meant for, so the model kept reaching for it. Removing it, and routing those questions through a narrower tool, raised accuracy across the whole neighborhood of related queries.

That lines up with what Anthropic's tool use docs recommend: detailed descriptions, and fewer capable tools over many overlapping ones. Their tool search takes it further, deferring a tool's full schema until the model actually selects it, which is how Claude Code holds selection accuracy across a large tool count. I can't always defer like that, so the lever I'm left with is blunt: if a tool causes more confusion than it earns, it comes out.

The descriptions that work don't read like docstrings. They read like a handoff note to whoever's covering your shift: when you'd actually reach for this, and the input that always trips people up. How long that note can run before the model starts skimming it, I don't know. Somewhere past a couple hundred words, but I've never actually measured where.

Originally published on blog.wentland.io. Related: Why I only build read-only MCP servers and Your MCP server is not an API adapter.

Top comments (1)

Alex Shev • Jul 1

This is the heart of tool design for agents: the model should not have to infer the product manual from a vague function name. A good tool is almost boring: narrow verb, typed inputs, examples, failure cases, and an error message that tells the agent what to try next.