Tool Calling Is Where Agents Break: A Reliability Guide

#ai #agents #backend #api

If you instrument a production agent and watch where it actually goes wrong, most of the wreckage is at the tool boundary, not in the model's reasoning. The model decided correctly; the call did the wrong thing, ran twice, or quietly half-succeeded. Tool calling is the primary failure surface of agentic systems, and it's also the most fixable, because the fixes are ordinary backend engineering.

I build AI agents for a living, and this is the layer I tell new engineers to obsess over before they touch a prompt.

The four ways tool calls actually fail

Strip away the hype and production tool failures cluster into four shapes:

Wrong argument types that slip past validation, so the tool runs on garbage.
Timeouts that aren't retried, so a transient blip looks like a permanent failure.
Partial successes the agent reads as full completion (the email queued but didn't send; the record was created but the link step failed).
Non-idempotent writes that execute twice because the agent retried, so you get duplicate tickets, duplicate charges, duplicate everything.

Three of those four are not model problems. They're contract problems between the agent and your tools. Fix the contract and the agent gets dramatically more reliable without changing the model at all.

Idempotency is non-negotiable for writes

This is the one I'll die on. Every tool that performs a write (creates a record, sends a message, starts a transaction) must accept an idempotency key and enforce idempotent behavior at the tool layer.

The reason is simple: agents retry. Models stutter, networks blip, a timeout fires and the orchestrator tries again. Without an idempotency guard, the second create_ticket call produces a second ticket. With one, the tool sees a key it's already processed and returns the cached result from the first run instead of re-executing.

def create_ticket(payload, idempotency_key):
    cached = store.get(idempotency_key)   # check before doing anything
    if cached:
        return cached                     # same key in window -> same result
    result = db.tickets.create(payload)
    store.put(idempotency_key, result, ttl=86400)  # 24h window
    return result

The key comes from the caller, the window is configurable (24 hours is a sane default), and a duplicate key returns the first execution's result. This single pattern eliminates an entire category of "the agent charged the customer twice" incidents. If a tool writes, it gets an idempotency key. No exceptions.

Make error messages readable by the model

When a tool call fails, what you send back to the model matters as much as the failure itself. A raw stack trace is noise. A structured error with field-level descriptions and an example of the correct shape is something the model can act on.

Compare these two responses to a bad argument. The bad one makes the model guess; the good one lets it self-correct on the next turn by reading the field, the problem, and an example of correct input. Treat your tool errors as a prompt you're writing for the model, because that's exactly what they are.

Retry like you mean it

Naive retry makes things worse. Immediate retries with no backoff turn a rate-limited API into a thundering herd, which gets you rate-limited harder. The rules that actually work:

Backoff with jitter on transient errors (timeouts, 429s, 503s). Never retry tight.
Don't retry non-idempotent writes unless the tool enforces idempotency. If it doesn't, a retry is a duplicate.
Cap retries and surface the failure to the orchestrator rather than looping forever. An agent stuck retrying is an agent burning tokens and going nowhere.
Distinguish retryable from terminal. A 400 won't fix itself on retry; a 503 might.

MCP standardizes the wire, not the hard parts

The Model Context Protocol is genuinely useful. It standardizes how tools are described and called, which kills a lot of glue code and makes tools portable across agents. But be clear about what it doesn't do: MCP doesn't handle OAuth, rate-limit strategy, idempotency, partial-failure semantics, or compliance logging. Those still live in your tool implementation. MCP gets the request to your tool cleanly; everything that makes the tool reliable is still on you.

This maps directly to a lesson we learned running multi-step agent workflows: the failures cluster at the handoffs and the I/O, not in the agents themselves. We wrote that up here: I ran a company on AI-agent departments, here's what actually broke.

Key takeaways

Tool calling, not reasoning, is where most production agents fail.
Every write tool needs a caller-supplied idempotency key and a dedup window. This kills double-execution bugs.
Return structured, example-bearing errors so the model can self-correct in one turn.
Retry with backoff and jitter, never retry unguarded writes, and cap the loop.
MCP standardizes the interface; OAuth, rate limits, and idempotency are still your job.

FAQ

Where should idempotency keys come from?
The caller (your orchestrator) generates them, typically derived from the logical action so a retry of the same action reuses the same key. The tool enforces dedup against that key.

Should I retry every failed tool call?
Only retryable ones (timeouts, 429, 503) and only with backoff. Terminal errors like a 400 won't improve on retry, and unguarded writes shouldn't be retried at all.

Does using a better model reduce tool-call failures?
A bit, since better models pick arguments more accurately. But timeouts, partial successes, and double-writes are infrastructure problems no model fixes.

If your agents are flaky and you suspect the tool layer, that's usually a good bet. We're happy to compare tool-contract patterns with anyone building in this space, find us at Shanti Infosoft.