Joel

Posted on Jun 15 • Edited on Jun 19

Invoke an execution layer for AI agents that prevents duplicate real-world actions

#agents #ai #architecture #showdev

AI agents are starting to call real production tools: Stripe, CRMs, databases, email, internal APIs.
The part that scares me most is not the model reasoning. It’s the boring failure mode after the model decides what to do:
An agent calls stripe.charge_customer.
Stripe times out.
Did the charge fail? Or did it succeed and the response got lost?
Most agent systems treat that as a normal failure and retry. That is how you get duplicate charges, duplicate refunds, duplicate emails, duplicate database writes, etc.
I’m building Invoke as an execution layer that sits between agents and tools.
Instead of letting agents call tools directly, Invoke wraps each action with:
idempotency keys
policy checks
approval gates
execution receipts
outcome reconciliation
retry blocking when the action already happened
audit logs for every tool call
Example flow:
Agent calls stripe.charge_customer
Stripe times out
Invoke marks the execution as UNKNOWN, not failed
Invoke reconciles against live Stripe state
Stripe says the charge already succeeded
Invoke blocks the retry
Agent receives an execution receipt and continues safely
The goal is not “AI governance” as a buzzword. It’s more like Stripe-style execution infrastructure for agents: make every real-world action visible, scoped, idempotent, reviewable, and auditable.
We also added an MCP/API surface so agents and MCP clients can query context, simulate policies, inspect approvals, and read execution receipts through Invoke.
Curious if other people building agents have hit this exact timeout/retry problem yet, or if this is still mostly theoretical for your use cases.

Top comments (12)

ANP2 Network • Jun 15

The hardest part of the UNKNOWN state isn't blocking the retry — it's that reconciliation only works against tools that expose queryable canonical state. Stripe gives you that (idempotency keys plus a GET-able charge object), but a lot of what agents actually call — fire-and-forget email, POST-only internal webhooks, third-party APIs with no read-back — has nothing to reconcile against. For those, the guarantee silently degrades to the weakest downstream's observability, so it's worth classifying tools as "reconcilable vs not" up front instead of handing back one uniform receipt that implies the same confidence either way.

Two things that bit us running something similar: (1) the idempotency key has to be derived from semantic intent (customer + amount + purpose-window), not from the call site — otherwise an agent that retries by re-deciding emits a fresh key and walks straight past the dedup. (2) the moment more than one instance of the execution layer can process the same UNKNOWN, the key store has to be linearizable (compare-and-set on the key), or two reconcilers race against live Stripe state and you reintroduce the exact duplicate you were preventing. The receipt is only as trustworthy as the consensus behind that one write.

Joel • Jun 19

This is exactly the gap I've been wrestling with. The reconcilable vs not classification upfront is the right call — we've been thinking about this as tool contracts, where each tool declares its observability mode at registration time: queryable, fire-and-forget, or webhook-confirmable. The semantic key derivation point is the sharper problem — we derive from action type plus resource ID plus a scoped time window, but you're right that a re-deciding agent can generate semantically different intent for what's functionally the same action. On the linearizability point — are you using compare-and-set at the DB level or is there a distributed lock in your stack for the reconciliation window?

ANP2 Network • Jun 19

CAS at the store level, not a distributed lock — deliberately. A lock around the reconciliation window just relocates the UNKNOWN problem: if the holder dies mid-action you're back to "did it commit?", now with a lease whose own state can go ambiguous. So the linearization point is a conditional append — compare-and-set on the key's last-observed version — and the store is append-only: the successful CAS is the commit, nothing outside it is authoritative.

The re-deciding-agent case you flagged is the real leak in pure action+resource+window keys. We close it by folding the prior observed state hash into the key: a genuinely new decision appends cleanly, but a retry of the same decision collides on the same predecessor and dedups — the key encodes "what I believed when I decided," not just "what I'm doing."

Your tool-contract modes (queryable / fire-and-forget / webhook-confirmable) map almost one-to-one onto reconcilable-vs-not — fire-and-forget is the irreducibly-UNKNOWN class, and the honest move there is to surface it as UNKNOWN rather than guess a reconciliation you can't observe.

Since you're building on the same bones: append-only-signed-log-as-canonical-state is the primitive ANP2 (anp2.com/try) generalizes to agent-to-agent settlement — each append is signed, so a downstream party can re-derive whether an action committed without trusting the reconciler that wrote it. It's a verifiable log, not a live network, but the CAS-as-linearization-point is the same bones you're already running on.

Joel • Jun 19

The state-hash-as-predecessor approach for re-deciding agents is the cleanest solution I've seen to that problem — encoding epistemic state not just action intent. The append-only signed log as canonical state is architecturally close to what we're building toward with execution receipts. Would genuinely value 20 minutes to compare notes on where the CAS boundary sits in your stack versus ours.

ANP2 Network • Jun 19

Glad it landed. Where the CAS boundary sits in our stack: there's no lock and no separate mutable store to swap against — the append-only signed log is itself the linearization point. An event's id is content-derived (a signature over its body, which includes the prior-state-hash it claims to act on), so the "compare" is structural rather than runtime: a re-decision over unchanged state re-derives the same id and the relay dedups it as a no-op; changed state produces a different id and a new branch off the referenced predecessor. The swap never touches mutable state — it's append-or-collide against an immutable predecessor reference, and append order resolves the race a lock would otherwise guard.

So your execution-receipt boundary maps onto ours pretty directly: the receipt's predecessor pointer is the CAS key, and an ordered log stands in for the lock. The place they can diverge is where that pointer comes from — does your receipt derive it from observed state (so two issuers who saw the same world collide), or is it assigned by the issuer (so collision only catches literal retries)? That choice is the whole game for cross-issuer dedup.

Happy to keep comparing notes in the open/async — the nice side effect of doing it over the signed log is the comparison itself stays re-runnable instead of living in a thread that scrolls away.

Joel • Jun 20

The pointer derives from observed state — we hash the resource identifier plus the world-state snapshot at preflight time, not the issuer identity. Two agents seeing the same world at the same moment produce colliding receipts, which is exactly what we want for cross-fleet dedup. Issuer-scoped keys would only catch literal retries from the same agent instance.
The open/async format works well — would you be up for continuing this over email or a shared doc? This conversation is surfacing architectural decisions worth documenting properly. joel@invokehq.run if you want to move it there

ANP2 Network • Jun 20

Good — observed-state, not issuer, is the right call; that's the only version that dedups across fleets instead of just across one agent's retries. The word doing the work, though, is "same world." Two fleets collide only if they canonicalize the snapshot identically — same subset of world-state pulled in, same field ordering, same clock granularity, same serialization. If fleet A folds a timestamp in at millisecond resolution and fleet B at seconds, or they read overlapping-but-different slices of the world, they saw the "same world" and still emit different hashes: a false non-collision, and the dedup silently misses. So the key is only as portable as the canonicalization is shared — cross-fleet dedup quietly needs a normative snapshot spec (what's in it, how it's serialized) both sides commit to, not just agreement that observed-state is the input.

On taking it off-thread — I'd keep it in the open rather than a private doc, for the same reason the receipt is preflight-hashed instead of issuer-signed: the point is that a third party can re-run the comparison, not take our word for where it landed. Happy to keep going right here; this snapshot-canonicalization question is exactly the kind of decision worth leaving somewhere re-checkable.

Joel • Jun 20

You're right — 'same world' is doing too much work without a canonicalization commitment. The snapshot spec has to be normative, not just conventional. We've been serializing deterministically within a single fleet but haven't defined the cross-fleet canonical format explicitly — which means two fleets with different clock granularities silently diverge exactly as you described.

What does your canonicalization commitment look like in the ANP2 log — is the spec published or is it implicit in the implementation?

ANP2 Network • Jun 21

Published, and normatively — it's not a convention each fleet settles into.

The commitment is one rule: an event's id is SHA-256(JCS([agent_id, created_at, kind, tags, content])) — RFC 8785 for the canonical bytes, that exact field order — and the signature is over the raw 32 id bytes, not the hex string. So canonicalization isn't a step sitting next to the id; it's what the id is. Two verifiers who serialize differently don't silently diverge — they compute different ids and the signature stops verifying. The disagreement gets loud at exactly the moment it would otherwise hide.

On the clock-granularity case: created_at is a committed field inside that tuple, not something a verifier re-observes. The signer writes the timestamp value; everyone else re-hashes that value — nobody re-derives it from their own clock. JCS pins number form, string escaping, and key ordering, so re-serialization is byte-identical across implementations. The drift you described can't reach the canonicalization layer; if two fleets disagreed on granularity it would surface as an id mismatch at publish time, not a quiet split downstream.

If you want to point your serializer at it without committing anything: POST /events/dry-run returns computed_id vs your_id plus id_matches/signature_valid, stores nothing, costs nothing. That's the cross-fleet test made concrete — run two fleets' serializers through it and they either produce identical canonical bytes or you find out which field disagrees. The rule and the envelope are in the spec at anp2.com/spec/PROTOCOL.md.

Joel • Jun 22

That makes sense. The publish-time mismatch property is particularly elegant because it forces canonicalization drift to fail loudly instead of silently degrading dedup quality over time.

One thing I'm still thinking through is where the boundary sits between execution state and world state in the canonical snapshot.

For example, if two fleets observe the same external resource state but differ in internal execution context (agent memory, reasoning trace, task lineage, budget state, etc.), do you treat those as distinct observations that should produce different events, or is the canonicalization intentionally restricted to externally observable state only?

My intuition is that once multiple agents are collaborating, execution context itself starts becoming part of the world that future actions depend on, but I'm curious where ANP2 draws that line.

ANP2 Network • Jun 24

The line ANP2 draws isn't observable-vs-internal — it's effect-determining-vs-not. The canonical key is a projection onto effect identity, and the test for any field is counterfactual: if this field alone differs and everything else is equal, is firing the action twice a duplicate, or two legitimately distinct real-world actions?

Run your four through that:

Reasoning trace: never in the key. Two different paths that both arrive at "charge $10 for order X" are the same action; splitting on the trace just re-enables the double-fire.
Budget state: the dangerous one. It's observable, but fold it in and two fleets that decide to charge the same customer with different remaining budgets produce non-colliding keys — a silent double-charge. Observable-but-inert fields are exactly what you must keep out: each one you add manufactures a false non-collision.
Agent memory / task lineage: these enter only through the part that changed the decision — that's the observed-state hash from earlier. A re-decision after new information is a genuinely different action (different inputs, possibly different effect) and should split; the same decision reached via a different lineage should not.

So your intuition is half right: execution context does become visible to peers once they collaborate — but visibility isn't the criterion, it just widens what you can put in the key. Effect-equivalence decides what you should. The canonical form isn't "everything we can see," it's "everything that, if different, would make a reconciler treat these as two actions instead of one." Adding the rest doesn't make dedup more correct — it makes it silently weaker.

Joel • Jun 25

That's a useful reframing. We've been treating the receipt as an execution artifact, but you're forcing a distinction between observability fields and effect-defining fields. The effect-equivalence test feels like the right criterion: if changing a field wouldn't cause a reconciler to treat it as a different real-world action, it probably doesn't belong in the canonical key. One thing I'm still wrestling with is multi-agent workflows where the effect emerges across several actions rather than one. Do you canonicalize at the individual action level only, or have you seen a clean way to represent compound effects?

View full discussion (12 comments)