Abdullah Shahin

Posted on May 28

Stopping the LLM from calling the same tool twice (and other things it shouldn't)

#ai #llm #agents #production

A user gave one of our agents this query:

"Get the products from our catalog, summarize them in a nice doc, share the doc with X, and send them an email asking for feedback."

The agent called create_doc seven times. Seven empty Google Docs showed up in the user's Drive. No catalog summary. No sharing. No email. The trace looked roughly like this:

[
  { "role": "assistant", "tool_call": { "name": "create_doc", "arguments": { "title": "Product Catalog Summary" } } },
  { "role": "tool",      "tool_call_id": "call_01", "content": "{\"doc_id\":\"1ab...\",\"url\":\"...\"}" },
  { "role": "assistant", "tool_call": { "name": "create_doc", "arguments": { "title": "Product Catalog Summary" } } },
  { "role": "tool",      "tool_call_id": "call_02", "content": "{\"doc_id\":\"2bc...\",\"url\":\"...\"}" }
  // ... five more rounds of the same shape
]

Same tool, same arguments, valid response on every attempt. We don't usually get to see why a model did what it did, but in this case we could: the platform surfaces the planner's reasoning and the tool list available to it at every step. The cause was unspectacular and depressing in equal measure — the agent's tool list was incomplete. There was no fetch_catalog, no write_to_doc, no share_doc, no send_email. The only writing-shaped tool it had was create_doc. Faced with a four-step task and one tool, it reached for that tool, watched the task not get any closer to done, and reached again. Seven times.

Google Drive was happy to create a seventh empty document; refusing isn't its job. The agent wasn't doing anything that looked wrong on the individual call — it was making valid, well-formed tool calls. The bug was visible only at the shape level: same tool, identical arguments, seven times in a row, with no observable progress between calls.

We pulled the thread on that. The conclusion is what this post is about: tool calls are side effects, and side effects need a policy layer that runs before the call, not an audit log that runs after. Whether the side effect is "charge a credit card" or "create an empty doc in someone's Drive," the layer that sees the call shape — and ideally the planner's reasoning behind it — can refuse before the seventh attempt. Well, before the second.

That one incident is the only place in this post where I'm reporting something we actually saw. The rest is the design space we've been thinking through as a result. I'll mark hypotheticals as hypotheticals.

If you can only catch a bad call after it happens, you can only apologize. The cheapest thing you can do for any agent that touches the outside world is build a thin gate the model has to pass through — and then make that gate intelligent about what "duplicate," "authorized," and "refused" actually mean.

What "duplicate" actually means

"Don't call the same tool twice" sounds like one rule. We think it's at least four, and the Google Docs case only exercises the first one.

Byte-identical arguments. The clearest case, and the one above. Same tool name, same JSON payload, fired within some window. Trivial to detect with a hash of (tool_name, canonicalized_args) stored in a per-conversation set, refuse on hit. Time-to-implement: an afternoon. Catches the empty-docs case directly.

Semantically-equal arguments (hypothetical). Imagine an agent that calls create_invoice once with {"amount_cents": 184000, "currency": "USD"} and then with {"currency": "USD", "amount_cents": 184000.0}. Or it passes a phone number as +1 (415) 555-0142 once and 4155550142 the next time. The hash check fails; the downstream system happily double-acts. The fix is per-tool canonicalization — each tool declares which arguments are identifying and how to normalize them. amount_cents is an int. phone runs through a normalizer. email lowercases. It's annoying to write. We haven't been bitten by this one yet because the writes we've connected so far have been simple enough that byte-equality catches the dupes; the moment the tools touch money I expect that to change.

Idempotency-key collisions (hypothetical). If you're calling Stripe, you're probably passing Idempotency-Key headers. Now suppose the model decides to retry and generates a new idempotency key because it considers the retry a "new attempt." The downstream system, doing its job, treats them as two separate operations. The shape of fix we like: the agent doesn't get to mint idempotency keys; the layer does, derived from the canonicalized arg hash. The model can ask for create_invoice ten times and the layer hands Stripe the same idempotency key every time, and Stripe returns the same invoice. The model's freedom to retry stays intact. The blast radius doesn't. This is design reasoning, not measured outcomes.

Intent-equal calls with different arguments (hypothetical, and the one I'd most like to never see). Consider an agent that calls create_invoice for an order, then calls notify_customer(template="receipt", order_id=...) — which, because of how the receipt template was written months ago, internally re-charges the saved payment method. Two different tools, two different argument shapes, one duplicate charge. You can't detect that with hashing. You'd need a side-effect graph: each tool declares which downstream resources it mutates, and the layer refuses a second mutation of the same resource within a conversation without explicit confirmation. This is the most expensive of the four to build, and it's still on our list rather than behind us.

Authorization without bureaucracy

Once you have a gate, the next question is what passes through it. There's a tempting failure mode here, which is to require human approval on every write tool call. This works for about a week, after which the human stops reading the approvals and clicks "approve" on everything. Now you have a worse system than no approvals, because you have a paper trail of someone having allegedly approved a duplicate charge.

The layered shape we've converged on, again as design reasoning rather than measured deployments:

Allowlist for low-stakes, high-frequency calls. Reads, mostly. Idempotent writes against scratch space. These shouldn't need authorization at all; they need rate limits and arg validation. Failing to allowlist them is how you end up with a 45-second reply latency because a human in Slack is being asked to "approve" a customer lookup.

Per-conversation grants for medium-stakes calls. Granted once at the top of a session by an authenticated user, applies to all subsequent calls in that conversation within a stated ceiling. The model can iterate, the user isn't interrupted, and the grant has a ceiling and a TTL. Cost: one round-trip at the start of the session.

Inline HITL for high-stakes or out-of-policy calls. Anything that crossed the side-effect-graph boundary, anything past a per-tool ceiling, anything explicitly destructive. These pause the agent, surface a structured prompt to a human, and resume on approval or denial. Done correctly, these are rare enough that the human actually reads them. Done badly, they degrade into the "approve all" pattern.

The cost-UX tradeoff is real and not something I can tell you the right answer to. Every grant decision you push to the user is a chance for them to turn the agent off. Every grant you don't push is a chance for the agent to do something expensive. The position I'd defend is: reads always allowlisted, writes under a per-tool ceiling get a per-conversation grant, everything else interrupts. The exact thresholds are tunable per deployment and you'll get them wrong on first try.

What happens when the model insists

You denied the call. The model wants to make it anyway.

There's a specific failure shape we expect once denials are in the loop: the model proposes a call, the layer refuses, the model proposes it again (slightly differently worded), refused. Third try, refused. This is the same loop shape as the empty-docs case at the top of the post, only with the layer playing the role Google Drive played there (saying no instead of saying yes).

Two design moves that we think address it:

Loop detection. Count refusals per tool per conversation; trip a circuit at N=3. Past the trip, the layer stops engaging the tool entirely for the rest of the conversation and surfaces the refusal upstream. The Google Docs case wouldn't have been helped by this directly (each call was successful, so there were no refusals to count), but the framing — track per-tool patterns at the conversation level — would have caught it.

Graceful refusal. The refusal isn't an exception. It's a structured tool result the planner can read and reason about:

{
  "status": "refused",
  "reason": "duplicate_of_call_01",
  "policy": "no_duplicate_create_invoice_per_conversation",
  "previous_result": { "invoice_id": "inv_77b2", "status": "created" },
  "suggested_next": ["lookup_invoice", "notify_customer"]
}

Worth saying: I don't have before/after metrics on what graceful refusal does to loop length in production traffic, because we don't have that production traffic yet. The intuition is that a structured refusal with the previous result and a suggested next action lets the model say "ah, the invoice already exists, I'll just send the receipt" and move on, whereas an opaque "tool error" leaves the planner with nowhere to go. If you've shipped this pattern with real volume, I'd love to hear what you actually saw.

Replan vs. hard-fail is the last branch. Replan is the default. Hard-fail is for unrecoverable cases: the user is unauthenticated, the grant ceiling was exceeded, the side-effect graph says the resource is locked. In those cases the layer returns a refusal and sets a conversation-level flag the agent reads as "stop attempting this category of action." The model still gets to say something useful to the user; it just can't try the action again.

What this doesn't catch

Two things I want to be honest about.

One: graceful refusal doesn't save you from the case where the model paraphrases its way around the denial. If the layer refuses create_invoice and the model two turns later calls create_charge — a sibling tool from the same SDK with the same downstream effect — the byte-hash won't catch it, the semantic canonicalizer won't (different argument shapes), and the side-effect graph will only catch it if you declared the shared resource. Declaring shared resources for every new tool is exactly the kind of thing that's easy to forget.

Two: nothing in this post is in production at scale. The Google Docs incident is real. The lessons we drew from it are honest. The design we've been building toward is what's described here. But "we caught X% of duplicates" or "loop length dropped Y%" — I'd be making those numbers up if I quoted them. Don't trust anyone's policy-layer pitch that comes with crisp metrics this early; ours included.

The thing I do believe, having pulled the thread on one boring incident for a few weeks: the layer isn't the product. The catalog of failure modes you've actually seen and encoded is the product. The layer is just the substrate that lets you encode them.

Close

If you've shipped agents that did something they shouldn't and you're tired of finding out post-hoc, we'd be useful to compare notes with. hivein is the layer that takes your agentic app to production — the tool-execution policy described above is one piece of the stack, alongside the agent orchestration, planner observability, and trust-to-act primitives the post draws on. The beta is invite-only through W6. If your failure pattern resembles anything here, or you've seen one we haven't, ping us at https://hivein.ai — the landing page is itself an agent built on hivein, so you can talk to it directly.

Top comments (4)

Abdullah Shahin • Jun 3

Yeah, the in-process scoping is the honest limitation of the article's version — it's the cheap win, not the durable one. Pushing the canonical key into the effect's own boundary as the primary key is the right collapse; we ended up doing the same in HiveIn's tool-execution layer for exactly the restart case you mentioned. The "policy before vs audit log after" split was always uncomfortable, and turning the audit row into the gate makes it one fact instead of two.

The identity-bearing-vs-incidental-fields point is the real unsolved one, and I don't think there's a generic answer. What's worked for us is letting the tool author declare an identity projection alongside the schema — a small list of fields that participate in the key, with everything else (timestamps, nonces, request IDs) ignored. It pushes the judgment to the person who actually knows whether create_invoice is intent-equal across days, but at least the runtime doesn't have to guess.

Abdullah Shahin • Jun 3

"Constrain at the tool layer, don't pray at the prompt layer" is the line — I'm stealing that. Agreed that feeding results back cleanly is necessary but not sufficient; the model registering its own success still fails under load, and the boundary is the only honest enforcement point.

On your question: HiveIn does both, but they're solving different problems. Idempotency keys per logical action are the safety net for the repeat case — they fire whether the model meant to repeat or got rolled back into one. The planner-commits-to-a-step-list piece handles the mis-order and over-call cases you flagged; the executor reads from a frozen step list and can't invent a tool call the planner didn't authorize. The keys are about "what happens if you ask twice," the plan is about "what are you even allowed to ask for." We needed both because either one alone leaves a class of failure on the table.

ANP2 Network • May 29

The per-conversation hash-set catches the within-loop case well, but the spot that bit us was one level down: that set is in-process and scoped to a single run, so it misses the retry-after-crash path and the same-effect-from-a-different-conversation path. What helped was pushing the idempotency key through to the effect's own boundary instead of holding it in the agent's memory — derive it from the canonicalized (tool, args) exactly as you describe, then hand that key to the side-effecting call so the downstream system (or a ledger row) dedupes on it too.

Concretely, in a settlement flow I work on the side effect is a keyed insert: the canonical key IS the primary key, so a duplicate request hits a key conflict and the second write silently no-ops. That collapses your "policy before vs audit log after" split — the same row is both the gate and the record, and it survives a restart because it isn't living in process memory.

The genuinely hard part is the "semantically-equal" case you flagged. Byte-hashing breaks the moment the payload carries a timestamp or a nonce, so you end up declaring per tool which fields are identity-bearing and which are incidental — there's no generic canonicalizer that knows create_doc("Summary") on a retry is the same intent but create_invoice with today's date is not.

Harjot Singh • May 31

Seven empty Google Docs is the perfect horror-story screenshot, because it shows the failure isn't the model being dumb, it's the harness having no memory of what already happened. The agent re-called create_doc because nothing told it the first call succeeded and nothing made a second call a no-op. Two fixes, and they're different layers: first, feed tool results back clearly so the model knows the doc exists (a lot of these loops are the model not registering its own success); second, and more robustly, make the tools idempotent at the boundary so even if the model asks twice, the system creates once. The second is the one I trust, because relying on the model to track state is exactly the thing that fails under load. The deeper principle: an agent will eventually repeat, mis-order, or over-call, so the harness has to enforce do-once and correct-sequence rather than hope the model behaves. Constrain at the tool layer, don't pray at the prompt layer. That's core to how I build agent runtimes in Moonshift. Did you land on idempotency keys per logical action, or a planner that commits to a step list the executor can't deviate from?