Alex Cloudstar

Posted on May 4 • Originally published at alexcloudstar.com

Designing Tools For AI Agents In 2026: Schemas, Descriptions, And The Pitfalls That Make LLMs Fail Silently

#ai #architecture #devtools #productivity

The first agent I shipped that got real usage failed in a way I did not expect. The model was fine. The prompt was fine. The traces showed the agent reaching for a tool called search_docs and confidently passing the user's entire question as the query, including the polite preamble and the trailing thanks. The tool was returning irrelevant results because nobody had told the model that the query parameter wanted a keyword phrase, not a sentence. I had written a one-line description that said "search the documentation" and called it good. The model did exactly what that description told it to do. It searched the documentation. With the wrong input. Because I never said what the input was supposed to look like.

That bug took me three days to find, because the agent looked like it was working. The traces had successful tool calls. The outputs were grammatical. The user was getting answers that sounded plausible and were quietly wrong. The fix was not to swap the model. The fix was to rewrite the tool description and tighten the schema. After that, the agent worked. The model had been trying to help me the whole time. I had been the limiting factor.

That was eighteen months ago. Since then I have shipped a dozen agents in production, debugged a few dozen more, and watched the failure modes converge. The boring truth is that most agent failures are tool design failures. The model is the easy part. The tool is where you make or break the thing. By 2026, the patterns for designing tools that LLMs can use without falling on their face have stabilized enough to write down. This is what I wish someone had handed me before I shipped that first agent.

The Mental Model: Tools Are A Public API For The Model

The frame that fixed my tool design was treating each tool as a public API consumed by a customer who has read the docs once, has no ability to ask questions, and gets one shot per call. That customer is the model. Every assumption you do not document, the model has to guess. Every overlap between two tools, the model has to disambiguate. Every error message that does not say what to do next, the model has to invent a recovery strategy from scratch. The same discipline that produces a usable REST API produces a usable tool surface for an agent.

The thing that makes tool design harder than REST design is that the consumer has no integration phase. The model does not write code against your tool, run it, see the error, fix the code, and try again. It calls the tool with whatever it inferred from the description, gets whatever it gets back, and either uses the result or tries again with another guess. The feedback loop is one shot, in the middle of a conversation, with the user watching. Tool descriptions and schemas have to be self-documenting in a way that human-consumed docs do not. There is no Stack Overflow for the model to fall back on.

The other thing that makes it harder is that tool surface compounds. Two tools is two tools. Twenty tools is forty pairs of "are these two the same thing?" decisions for the model to make on every call. The cost of an extra tool is not linear. It is roughly the cost of explaining how it differs from every other tool you already have. Most agents I have seen with twenty-plus tools were one redesign away from being agents with eight tools and a happier model.

Naming Is Most Of The Job

The single highest-leverage decision in tool design is the name of the tool. Names are what the model uses to retrieve the tool from its working memory when it is deciding what to call. A vague name puts the burden of disambiguation on the description. A precise name does most of the work for free.

The pattern that has worked for me is verb-object, written like a function signature in a codebase you would want to inherit. search_documentation, get_user_by_id, create_calendar_event, summarize_thread. Not docs, not user, not calendar, not summarize. Those names tell the model what the tool returns instead of what the tool does, and that matters because the model is choosing tools by action, not by domain.

Avoid names that overlap. If you have find_user and get_user, the model will pick one of them at random the first time it sees a request that fits both, and the choice will not be the one you wanted. If they do different things, name them differently enough that the difference is obvious from the name alone. search_users_by_name and get_user_by_id is a much better pair than find_user and get_user, because the names make the input shape part of the contract.

Avoid names that are too clever. The model is good with conventional names because it has seen millions of them in training. It is worse with names you invented for branding reasons. summon_compass is not a tool name. find_directions is. The tool description can carry the brand voice. The name should carry the function.

The last naming rule is the one I keep relearning: be willing to rename a tool when the agent starts using it for the wrong thing. If the model keeps reaching for search_docs when it should be reaching for lookup_pricing, the name search_docs is too broad, or the name lookup_pricing is too narrow, or the descriptions need work. The names are the first thing to fix. Renaming is cheap. Living with a confused agent is not.

Descriptions: Write Like You Are Onboarding A New Engineer

The description field on a tool schema is where most teams underspend their effort and pay for it in production. A one-line description is rarely enough. The model needs to know, in plain language, what the tool does, when to use it, when not to use it, and what to expect back. That is four things, not one, and squeezing them into a single sentence is the most common reason agents pick the wrong tool.

The structure that has worked is: lead with what the tool does, then say when to call it, then say when not to call it, then describe the shape of the response. The "when not to call it" line is the one that does the most work. It is the equivalent of disambiguating from neighboring tools. If the model knows that search_documentation is for finding article content and is not for looking up product pricing or user data, it will not reach for it when the user asks about pricing. Without that line, it might.

A worked example. The bad description: "Search the documentation." The good description: "Searches the product documentation for articles matching a keyword phrase. Use this when the user asks a how-to or conceptual question about the product. Do not use this for pricing lookups (use lookup_pricing) or for user account questions (use get_user_account). Returns up to five articles with title, snippet, and URL."

The bad version is six words. The good version is sixty. The good version eliminates a class of bugs that the bad version invites. Descriptions are the cheapest debugging tool you have. Spend the words.

The other discipline is to write descriptions that match the schema. If a parameter is supposed to be a keyword phrase, the description should say so, and the parameter description should say so, and there should be an example. If the model is allowed to pass natural language, say that. If it is not, say it is not. The number of agents I have seen pass entire user questions into a parameter that wanted a SQL-safe identifier is more than I want to admit. The fix was always to rewrite the parameter description. The model had been doing what the description allowed.

Parameter Schemas Are A Contract, Not A Hint

JSON Schema is the contract between the agent and the tool. If the schema is loose, the model will exploit the looseness. If the schema is tight, the model will work harder to produce valid input. The framing that made tool calls reliable for me was treating the schema as the place where I forbid wrong inputs, not the place where I describe what right inputs look like.

Use enums when the parameter has a small set of valid values. Do not let the model invent a status string when there are only four valid statuses. Put them in an enum. The model gets to pick from a list, the list constrains the output, and the runtime validator rejects anything else. The cost is one line of schema. The benefit is that the agent stops inventing statuses.

Use string formats when they exist. ISO 8601 dates, email addresses, UUIDs, URLs. Format hints are part of the contract the model is trained against. The model knows what a date in ISO 8601 looks like. Tell it that is what you want, and it will produce one. Leave the format ambiguous, and you will get "tomorrow" passed in as a date string.

Use min and max constraints. If a search query has to be at least three characters, say so. If a list parameter has a max length, say so. The model will respect the constraints if you state them. It will violate them if you do not, because the description said "search query" and the model interpreted that as "any string."

Use required vs optional deliberately. Every required parameter is one more thing the model has to figure out before it can call the tool. Every optional parameter is one more way the call can go subtly wrong. Default optional to off and required to on. Add optional parameters only when they meaningfully change the behavior. Do not add optional parameters as a way to expose every flag your function supports.

Use descriptions on every parameter. The schema description for the tool talks about the tool. The descriptions on each parameter talk about that parameter. The model reads both. A parameter description that says "the search query" is doing nothing the type does not already do. A parameter description that says "the search query as a keyword phrase, not a full sentence, two to ten words, lowercase" is doing real work.

The same rigor that goes into the structured outputs developer guide belongs in tool schemas. Structured outputs and tool schemas are the same problem dressed differently: how do you make the model produce something machine-readable that your code can rely on. Tight schemas are the answer in both cases.

Error Returns Are Where The Agent Recovers Or Spins

The hardest tool design problem is what the tool returns when it fails. A bad error message turns a recoverable mistake into a stuck agent. A good error message tells the agent exactly what to fix and try again with.

The pattern is to return errors as structured objects, not as string blobs. An error with a code field, a message field, and ideally a hint field is something the model can pattern-match on. An error that just says "invalid input" is something the model has to interpret. The interpretation is sometimes correct and sometimes a guess.

The hint field is the one that punches above its weight. When the tool rejects a call, say what the agent should do differently. "The user_id parameter must be a UUID. Try calling search_users_by_name first to get the UUID, then call this tool with that value." That is a hint that turns a stuck agent into a working one. Without it, the agent retries with another guess, then another, until it gives up or hits the iteration limit.

Avoid errors that look like success. A tool that returns an empty array on a misspelled query is silently failing. The agent gets back zero results, assumes the query was correct and the answer is "nothing matched," and reports that to the user. The fix is to return an explicit "no results" object with a hint that the query may be wrong, not an empty array that looks the same as a successful empty result.

Avoid errors that are hostile. "An error occurred." "Something went wrong." These are the worst possible response for an agent. The agent has nothing to act on. It will retry with the same input, fail again, and either give up or hallucinate an answer. Every error message is a chance to recover the run. Spend the words.

The same observability shape I covered in agent observability and debugging needs to extend to tool errors specifically. The tool error rate, segmented by tool and error code, is one of the most useful health signals an agent has. A spike in a specific error code on a specific tool is a fix you can ship. A spike in generic errors is a debugging session that will eat your week.

Tool Granularity: The Goldilocks Problem

How much should one tool do. The answer that broke for me at the extremes is "neither too much nor too little," which is unhelpful, so let me be more specific. The right granularity is the unit of work a competent human would think of as one step.

Too coarse: a single manage_calendar tool that takes an action parameter and dispatches to create, update, delete, or query depending on the value. The model has to decide which action it wants, then encode that into the parameter, then construct the right body for that action. The error surface is huge because the schema has to permit all possible action shapes. The descriptions have to cover four functions in one. The agent gets confused. Split it.

Too fine: separate tools for set_event_title, set_event_start, set_event_end, set_event_attendees, where each one mutates one field of a calendar event. The agent has to chain six calls to do what a human thinks of as "create the event." The token cost goes up. The latency goes up. The chance of one of the six calls failing goes up. Combine them.

The right grain is create_calendar_event, update_calendar_event, delete_calendar_event, list_calendar_events. Each is one verb-object pair. Each is one unit of work. Each has a focused schema. The agent picks one, calls it, and moves on. Four tools instead of one tool with four hidden modes, or twenty tools that are all the same thing.

The exception is when the underlying API is genuinely composite and exposing the composite as one tool would force ugly schemas. In those cases, splitting is right. The test is whether the descriptions and schemas of the split tools are simpler than the description and schema of the combined tool. If they are, split. If they are not, combine.

Authentication, Idempotency, And The Boring Things That Save You

Tools that mutate state need to be safe to retry. The model will retry tools when it thinks the previous call did not work, and the previous call may have actually worked. If your create_invoice tool is not idempotent, the agent will create duplicate invoices when the network blips, and the user will be unhappy.

The pattern is to require an idempotency key on any state-mutating tool. The model can generate one and pass it. The tool stores the result of the first call against that key. The second call returns the same result without doing the work again. This is straight out of the payments world and it works just as well for agents. The same principle applies to anything that bills, sends notifications, or moves money.

Authentication should be invisible to the model. Do not expose API keys as parameters. Do not require the agent to construct authorization headers. The host application holds the credentials, attaches them at call time, and the model never sees them. Every tool spec I have ever seen that exposed authentication to the model leaked credentials into traces, logs, or model outputs. Treat auth as plumbing, not as part of the contract.

Permissions should be checked in the tool, not in the prompt. The prompt cannot enforce that the user is allowed to call the tool. The tool can. Pass the user identity into the tool, check the permission server-side, and reject the call if the user is not authorized. This is the same security model that I covered in securing AI agents in production, and it applies to every tool that touches user data.

Documentation By Example

The single most effective addition to a tool spec is a worked example. One example, in the description, showing what a typical call looks like and what the response looks like. The model will pattern-match on the example more strongly than on the prose description. If the example is a keyword search, the model will produce keyword searches. If the example is a sentence, the model will produce sentences.

Examples in the description belong on the tool itself, not in the prompt. The prompt is shared across the conversation. The tool description is loaded with the tool. Putting examples on the tool means every agent that uses the tool sees the examples, including agents you have not built yet. It is the cheapest way to ship a tool that other people on your team will use correctly.

The format that has worked is: a one-line input example, a one-line output example, and a one-line gotcha if there is one. "Example: search_documentation with query='deploy webhook' returns up to five articles. Note: queries longer than ten words tend to underperform; prefer keyword phrases."

That is three lines. Those three lines have prevented more bugs than any other piece of documentation I have written for an agent.

Versioning And Change Management

Tool surfaces change. New parameters get added. Old ones get deprecated. The agent does not know that the v2 of your tool no longer accepts the legacy_id parameter, because the agent was prompted with the v1 schema and you redeployed with v2 yesterday.

The discipline is to version tool specs the way you version APIs. Major changes get a new tool name, not a silent breaking change. Minor changes that add optional parameters or relax constraints can go in place. Removing a parameter or changing its meaning is a major change. The agent's prompt cache should be invalidated when major changes ship, because the agent's instinct is going to be tuned to the old shape.

The other piece is to monitor tool call rates per tool. A tool you deprecated should be at zero. A tool you launched should be ramping up. A tool whose call rate dropped to zero overnight is a tool the agent stopped reaching for, which usually means a description change made it look like a worse fit for the requests it was handling. The metric that catches this is per-tool call volume over time. The metric is boring. The bugs it catches are not.

What This Looks Like When It Works

A well-designed tool surface for an agent has the shape of a small, focused API. Eight to fifteen tools is a comfortable range for a domain-specific agent. Each tool has a verb-object name. Each tool has a description that says what it does, when to use it, when not to use it, and what it returns. Each parameter is typed, constrained, and described. Each error is structured, coded, and hinted. The mutating tools are idempotent. The auth and permissions are server-side. There is at least one example per tool.

The agent built on top of that surface picks the right tool the first time on the requests you have anticipated. It picks a reasonable tool on the requests you have not. When it picks wrong, the error tells it what to do next, and it recovers. The traces show short tool-call chains because each call does its job. The token cost is low because the schemas are tight. The latency is low because the calls are not retried.

That is the agent you can ship. That is the agent that does not embarrass you in a customer call. The model is the same model everyone else is calling. The prompt is the same prompt everyone else is writing. The tool surface is the part you control, and it is the part that makes the agent yours.

Where This Is Going

The frontier models are getting better at handling messy tool specs. They are also getting better at composing tools that should never have been split. Both of those mean the cost of bad tool design is dropping, but it is not zero, and the gap between an agent built on tight tools and an agent built on loose tools is still wider than the gap between any two recent model versions. Investing in tool design pays off across model upgrades. Investing in prompt tricks does not always.

The other shift is that tool specs are starting to be shared across agents the way SDKs are shared across applications. The MCP protocol I covered in the MCP developer guide is one expression of this. A tool you design well can be reused. A tool you design badly is a liability that ships with every agent that imports it. The half-life of a tool spec is now longer than the half-life of a model version, and that is a good reason to spend more time on the spec.

The thing that is not changing is that the model can only work with what you hand it. The whole job of tool design is making sure what you hand it is something a competent agent can use. The discipline is the discipline of any good API. The reward is an agent that works in production, on the first try, on requests you did not write the prompt for. That, more than any model upgrade, is the difference between an agent demo and an agent product.

If your agent is failing in ways that look like the model is the problem, look at the tools first. The model is almost never the limiting factor. The tools almost always are.

DEV Community