AI agents are starting to call real production tools: Stripe, CRMs, databases, email, internal APIs.
The part that scares me most is not the model reasoning. It’s the boring failure mode after the model decides what to do:
An agent calls stripe.charge_customer.
Stripe times out.
Did the charge fail? Or did it succeed and the response got lost?
Most agent systems treat that as a normal failure and retry. That is how you get duplicate charges, duplicate refunds, duplicate emails, duplicate database writes, etc.
I’m building Invoke as an execution layer that sits between agents and tools.
Instead of letting agents call tools directly, Invoke wraps each action with:
idempotency keys
policy checks
approval gates
execution receipts
outcome reconciliation
retry blocking when the action already happened
audit logs for every tool call
Example flow:
Agent calls stripe.charge_customer
Stripe times out
Invoke marks the execution as UNKNOWN, not failed
Invoke reconciles against live Stripe state
Stripe says the charge already succeeded
Invoke blocks the retry
Agent receives an execution receipt and continues safely
The goal is not “AI governance” as a buzzword. It’s more like Stripe-style execution infrastructure for agents: make every real-world action visible, scoped, idempotent, reviewable, and auditable.
We also added an MCP/API surface so agents and MCP clients can query context, simulate policies, inspect approvals, and read execution receipts through Invoke.
Curious if other people building agents have hit this exact timeout/retry problem yet, or if this is still mostly theoretical for your use cases.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (1)
The hardest part of the UNKNOWN state isn't blocking the retry — it's that reconciliation only works against tools that expose queryable canonical state. Stripe gives you that (idempotency keys plus a GET-able charge object), but a lot of what agents actually call — fire-and-forget email, POST-only internal webhooks, third-party APIs with no read-back — has nothing to reconcile against. For those, the guarantee silently degrades to the weakest downstream's observability, so it's worth classifying tools as "reconcilable vs not" up front instead of handing back one uniform receipt that implies the same confidence either way.
Two things that bit us running something similar: (1) the idempotency key has to be derived from semantic intent (customer + amount + purpose-window), not from the call site — otherwise an agent that retries by re-deciding emits a fresh key and walks straight past the dedup. (2) the moment more than one instance of the execution layer can process the same UNKNOWN, the key store has to be linearizable (compare-and-set on the key), or two reconcilers race against live Stripe state and you reintroduce the exact duplicate you were preventing. The receipt is only as trustworthy as the consensus behind that one write.