Last year I shipped a refund agent that processed a customer's return three times for the same order. Same call. Same inputs. Three Stripe charges. The model didn't go rogue. The tools didn't fail. The agent retried the tool call on a transient API timeout, the second call succeeded, and the third call hit a slow path that no one had instrumented. There was no idempotency key on the refund endpoint, so the operation ran three times.
That kind of duplicate side effect — payments, emails, API writes, database deletes — is the single most common production failure I see in AI agent systems. It is not a model problem. It is not a prompt problem. It is a retry problem, and the agent has more retries than your old HTTP service ever had.
If you are shipping agents that touch money, send mail, or call any side-effecting API, this guide is for you.
The shape of the failure
Production agents retry tool calls aggressively. The model sees a timeout, a 5xx, or a missing response, and on the next turn it reissues the same call. That is correct behavior for a read. It is catastrophic for a write.
The four patterns I see over and over:
- Retry storms on a flaky write endpoint. The tool returned 504, but the side effect did happen. The agent retries. Duplicate charge, duplicate email, duplicate fulfillment.
- Concurrent branches that both "succeed." A planner splits a task into two sub-agents. Both call the same payment tool with the same intent. Both think they are the canonical call. Both succeed.
- Compensating retries after a partial failure. The first attempt wrote half a record, the second attempt timed out before the response, and the third attempt succeeded — but the system has no idea the first attempt was real.
- Crash-recovery replays without result caching. The agent process restarted, the durable execution framework replayed every step, and the steps with side effects ran twice because the framework's step cache did not include the side effect's idempotency key.
Every one of these is a distributed-systems problem dressed up in a prompt. The agent is doing what distributed systems have done badly for decades: retrying a non-idempotent operation and hoping the second one wins.
The fix is not "retry less"
The instinct is to throttle retries. That makes the problem worse. A flaky tool deserves a retry. The fix is to make every side-effecting call retry-safe. That means an idempotency key on every write, scoped tightly enough to be unique and broadly enough to catch the right duplicate.
Concretely, three things:
Generate a stable idempotency key per logical operation, not per tool call. The key should be derived from the operation's intent — order id, customer id, action type — and the same logical operation should produce the same key across retries. If you generate a fresh UUID per retry, you have not solved the problem. Stripe's idempotency key works this way for a reason.
Pass the key to the downstream API and let it dedupe. Most modern write APIs (Stripe, Twilio, SendGrid, HubSpot, GitHub, Slack, any database transaction) accept an idempotency key. Use it. If your API does not, wrap the call in your own dedupe layer — a small table of "seen keys within the last 24h" is enough.
Cache the result, not just the call. Durable execution frameworks (Inngest, Temporal, Restate) cache step results, so the replay sees the cached result and skips the side effect. If you are not using one, you need a side-effect ledger: a record of "this idempotency key, this outcome" that the next retry consults.
A 10-minute audit you can run today
Open your last 5 production incidents. For each one, answer:
- Did the agent call a side-effecting tool more than once for the same logical operation? If yes, that is the root cause.
- Does the tool call include an idempotency key? If no, the duplicate was structurally guaranteed.
- Did the framework replay the step? If yes, and the step has no result cache, the duplicate was guaranteed.
If the answer to the first question is yes for even one incident, idempotency is your top operational debt. Everything else is downstream of it.
The more interesting signal: incidents that are "intermittent" and "involve duplicate X." That is almost always idempotency. Every agent team I have worked with that has more than three of these in a quarter has a tooling gap, not a model gap.
What the tooling layer does not give you
LangSmith, Langfuse, Helicone, Arize, and Braintrust all give you traces. None of them dedupe your side effects. They will show you the duplicate tool call. They will not stop it. The retry policy, the idempotency key, the result cache, and the durable execution framework are all things you have to build or buy on top.
That is the part most teams skip. It is the part that matters.
If you are hitting a wall with duplicate side effects in a production agent and want a forensic read of your traces — five of them, a one-page report on the actual root cause, and a concrete fix list — I do that for $149. Milo Antaeus, AI Ops Checkup: miloantaeus.com/ai-ops-checkup.html.
Top comments (0)