Prompts and RBAC won't stop your agent from refunding someone twice

#ai #security #llm #opensource

If you let an agent call tools that have real side effects — refunds, emails, exports, writes to prod — there's a class of failure that none of the usual safety layers actually catch. I keep running into it while building in this space, so I want to lay out the problem clearly and then show the approach I've landed on. It's opinionated, and I'd genuinely like to be told where I'm wrong.
The problem
Picture an agent that can issue refunds. Something upstream retries a step — a timeout, a transient error, an orchestrator that re-runs the node — and the agent calls the refund tool a second time. Same customer, same invoice. A duplicate refund goes out.
Now look at what was supposed to prevent that, and notice that nothing was technically "broken":

OAuth proved the agent could reach the payment server.
RBAC confirmed the agent's identity was allowed to call the refund tool.
The MCP/tool schema said the payload was well-formed.
The prompt told the agent not to double-refund.

Every one of those checks passed, and the second refund still executed. They all answer questions like "is this agent allowed to use this tool" or "is this a valid call." None of them answer the question that actually matters here:

Should this specific call, with this payload, in this job state, execute right now?

That's a different kind of question, and it's the one I think we're mostly not asking.
Why the usual layers miss it
The reason is that the dangerous failures are stateful, and the layers above are either static or advisory.
"Has this exact refund already gone out?" depends on what already executed — history, not the shape of the current call. A few more in the same family:

the same "your order shipped" email got sent twice
the same export ran twice
a tool loop fired forty times before anyone looked at the bill
an approval that was granted for one action got reused for a whole session
a prior denial didn't stop the agent from trying again

And here's the part that took me a while to internalize: by default, the only place that "what already happened" state lives is the agent's own context. Which is exactly the thing that gets summarized away, truncated, or rewritten — especially on the retry path, which is where the duplicate happens in the first place. The agent is the worst possible place to store the memory of what the agent already did.
Stateless vs stateful checks
It's worth separating two kinds of checks, because the easy ones lull you into thinking you're covered.
Stateless (easy, do in-process)Stateful (the ones that bite)Is this tool blocked?Did this exact action already execute?Is the amount over the cap?Has this loop already run N times?Is this a PII field?Has the job blown its token/cost budget?Is the destination allowed?Was this approval scoped to this action, or reused?
Stateless checks are genuinely useful and you should do them. But they're table stakes, and they're not where the expensive incidents come from.
The approach: a gate outside the loop
The direction I ended up going is a small gate that wraps each tool call and decides allow / deny / challenge before the call executes, backed by state that lives outside the agent's context:
agent proposes tool call -> gate checks policy + state -> allow / deny / challenge
The non-negotiable design constraint, for me, is that the gate runs outside the model context and outside agent-editable memory. If the agent can rewrite the rule, hand itself the approval, or erase the record of what it already did, it isn't a guardrail — it's a sticky note.
Concretely the gate handles:

Idempotency for side-effectful tools. A refund/email/export carries a call fingerprint or idempotency key the gate remembers outside the loop, so a duplicate is denied even if the agent "forgot" it already ran.
Circuit breakers. Caps on tool-call count, retries, tokens, cost, and runtime, so a runaway loop trips a breaker instead of draining a budget.
Single-action approvals. An approval scoped to a tool + resource + payload hash + amount + expiry. Change any of those and it needs re-approval — you can't reuse one "yes" for the whole session.
PII / data-flow rules. Block specific fields and destinations (e.g. customer_records -> external_email).
Audit. An event on every allow/deny/challenge, so the gate log is the record of what was proposed and what happened.

What it looks like in code
I packaged this as an open-source project called AgentPass (Apache-2.0). Fair warning: it's mine, so take the framing with the appropriate grain of salt. The local TypeScript guard is on npm as @dinpd/ai-agent-guard. Wrapping a tool call looks like this:
tsimport { createToolGate } from "@dinpd/ai-agent-guard";

const gate = createToolGate({ policy });

const execution = await gate.run(
{
agentId: "support-agent",
jobId: "case-1042",
tool: "stripe.refund",
action: "pay",
resource: "payment/pi_123",
amountUsd: 49,
idempotencyKey: "refund-case-1042-pi_123",
},
() => stripe.refunds.create({ payment_intent: "pi_123", amount: 4900 })
);

if (!execution.executed) {
return execution.decision; // denied or needs approval — the refund never fired
}
The second time that same idempotencyKey shows up, the gate denies it and the underlying stripe.refunds.create never runs. Same idea for loops (the breaker trips), approvals (scoped to that payload hash), and PII (the destination is blocked).
There's a Python CLI for policy/manifests too, and the repo has runnable demos for the refund-dedup, circuit-breaker, PII-egress, and MCP tools/call cases if you want to poke at it rather than take my word for it.
What this is not
To be clear about scope, because this gets conflated a lot: this is not a replacement for IAM, OAuth, OPA, Cedar, or your MCP gateway. Those decide who you are and what you're broadly allowed to touch. This is the small checkpoint immediately before execution that they don't really cover. It sits alongside them.
Where I'm not sure
I'd rather end on the open questions than a pitch, because I genuinely don't think I have the model fully right:

Where's the correct line between checks you do in-process and state that needs a durable boundary?
If you're running agents against real systems today, are you solving duplicate side effects — and if so, where does the dedup live? Idempotency keys at the API layer, a wrapper, the orchestrator, careful retry logic?
Is a per-call gate the wrong tool for some of this — e.g. should every tool just be idempotent at the source instead?

If you've shipped agents that touch production systems, I'd really like to hear how you handle this. The repo is here if you want to tear into the approach.

DEV Community

Prompts and RBAC won't stop your agent from refunding someone twice

Top comments (0)