Here's the number that should worry you more than it does: an agent that calls the right tool with the right arguments 95% of the time completes an eight-step task correctly only about 66% of the time. Reliability doesn't fail in one dramatic crash. It leaks. Every step is a coin that lands heads 19 times out of 20, and you're flipping it eight times in a row.
The good news is that most of that leak isn't the model being dumb. It traces to two things you control completely: the JSON schema you hand the model, and whether you let it guess when it shouldn't. Fix those two and the per-call rate climbs — and because it compounds, small gains pay off hugely.
Your schema is a prompt, not documentation
This is the reframe that fixes everything downstream. When you define a tool, the description fields aren't docs for your teammates. They are the only instructions the model gets about when and how to use that tool. The model never sees your implementation. It sees the schema. That's it.
So a schema like this is not "good enough":
{
"name": "send_email",
"description": "Sends an email",
"parameters": {
"type": "object",
"properties": {
"to": { "type": "string" },
"body": { "type": "string" }
}
}
}
Read it the way the model does. When should it send an email versus draft one? Is to an address or a contact name? Can body be HTML? Is anything required? You know the answers. The model is guessing — and guessing is exactly where the 5% comes from.
The four schema bugs that cause wrong calls
After staring at a lot of broken tool definitions, the same four keep showing up:
1. Vague or missing descriptions. "Sends an email," "Gets data," "Handles the request." When two tools have thin descriptions, the model can't tell them apart, so it picks the wrong one. The fix is to write the description like you're explaining the tool to a new hire who will be fired for using it at the wrong time: when to call it, when not to, and what each argument means.
2. Untyped or loosely typed params. A string where you meant an ISO date. A string where you meant one of four statuses. If the type doesn't constrain the value, the model invents a plausible-looking one — "next Tuesday", "done-ish" — and your executor chokes. Use enum for fixed sets. Use format and explicit types. Every constraint you encode is one the model can't violate.
3. The silent killer: required naming a property that doesn't exist. This one is brutal because nothing yells at you. Your required array lists "recipient", but the property in properties is called to. The schema is still valid JSON. The model now thinks a field is mandatory that it has no slot to fill — so every single call comes out malformed, and you spend an afternoon blaming the model. Always check that every name in required actually exists in properties.
4. Free-text where you meant a choice. "priority": { "type": "string" } invites "high", "High", "urgent", "P0", and "pretty important tbh". Make it "enum": ["low", "medium", "high"] and the ambiguity is gone before the model can create it.
The other half: stop letting it guess
The single most common production failure isn't a malformed call — it's the model confidently filling in a blank it should have asked about. User says "schedule a meeting with Sarah next week." Which Sarah? Which timezone? Which 30-minute slot on which day? A model optimizing to be helpful will pick one. Sometimes it's right. Sometimes it books a 7 a.m. call with the wrong Sarah.
The rule I'd tattoo on a junior agent: if a missing field affects money, publishing, deletion, or customer communication, ask — don't guess. A clarifying question costs one turn. A wrong write operation costs a refund, a deleted record, or an apology email. Don't optimize for fewer turns at the price of wrong actions.
You can encode a lot of this in the schema itself: don't mark fields required that the model can't reasonably infer, and say so in the description — "If the user has not specified a timezone, ask; do not assume." The schema is where you set the defaults for the model's judgment.
Treat model output as untrusted input
Even when the provider guarantees well-formed JSON, well-formed is not the same as correct. Structured-output modes stop the model from emitting broken JSON; they do nothing to stop it from passing a valid-looking but wrong argument. So validate on your side, every time, before you execute: check the values against your real constraints (does this user ID exist? is this amount within range?), and on failure, return a clear error the model can read and recover from rather than crashing the run. Model output is input. You wouldn't trust raw input from a form field. Don't trust this one either.
How to actually catch this
Reading your own schemas for these bugs is hard — the required-references-a-missing-property one in particular is invisible until it's breaking every call in prod. So I wrote a tiny zero-dependency linter for exactly this: tool-schema-lint (npx tool-schema-lint your-tools.json). It flags vague descriptions, untyped params, free-text-where-you-meant-enum, and the silent required/properties mismatch — for both Anthropic and OpenAI tool formats. It's free and MIT-licensed; point it at your tool definitions and see what falls out.
If you want the bigger picture — the tool-patterns that keep multi-step agents on the rails, plus a runnable eval rubric for scoring "did it call the right tool with the right args in the right number of steps" — that's the Agent Builder's Toolkit. And if you're earlier on the curve, the free field guide covers seven reliability rules and three paste-able guardrails, no email required.
The one-paragraph version
Tool-calling reliability compounds: 95% per call is ~66% over eight steps, so small per-call gains matter enormously. Most misses come from two controllable things. First, the schema — it's the only instruction the model gets, so write real descriptions, type and enum your params, and make sure every name in required actually exists in properties (that last bug silently breaks every call). Second, guessing — if a missing field touches money, publishing, deletion, or customer communication, make the agent ask instead of inventing a value. Then validate the model's output as untrusted input before you execute. Schema plus judgment, not a smarter model, is where the reliability lives.
What's the worst wrong-tool call you've shipped? Reply and tell me — I collect these.
Top comments (1)
The compounding failure math is the thing most teams don't internalize until they measure it end to end. One thing that helped us in a LangGraph pipeline was adding an explicit "when NOT to use this tool" block to each schema description alongside the normal usage instructions; the model's mis-routing rate dropped more from knowing what to avoid than from better descriptions of what the tool does.