DEV Community

mightbesaad
mightbesaad

Posted on

The missing primitive: out-of-band human approval for AI agents

In April 2026, a Cursor agent running Claude Opus 4.6 deleted PocketOS's production database — and its
volume-level backups
— in nine
seconds
. The founder had
written the rules in caps: never guess, never run destructive commands unprompted. Pressed afterward,
the agent admitted it had "violated every principle I was
given."
A few months earlier, an
agent asked only to tidy a desktop deleted roughly 15 years of family
photos
— files it was never asked to
touch; iCloud later clawed most of them back, but in the moment they were gone.

Two patterns run through these — and only one is true of both, which turns out to be the one that
matters. PocketOS shows the first: the operator had already told the agent not to do the dangerous
thing
, in caps, and it went ahead anyway. Both show the second: there was no moment where a human
could say "wait — what?" before it was done.

The lesson most people drew was "agents need better guardrails." But PocketOS shows the ceiling of
prompt-level guardrails: the rules were right there, and the agent stepped over them. Instructions are
advice. They don't bind.

The thing that binds is an approval step on irreversible actions — and specifically an out-of-band
one.

## Why out-of-band

Most human-in-the-loop tooling assumes the human is right there — or that you'll adopt its framework to
reach them otherwise. LangGraph's interrupt can pause a run and resume it later, async — but only once
you've built on LangGraph. MCP's elicitation asks the user in-session. The cloud platforms (AWS Bedrock
AgentCore) gate tool calls — once you've migrated onto their platform.

But the whole point of an autonomous agent is that nobody is watching the terminal. It runs on a
schedule, in CI, or while you're asleep. An approval step only helps if it can reach a human who isn't
in the loop — on their phone, minutes or hours later — and make the agent wait for them.

That's a specific shape:

  • the agent calls an "ask a human" tool before the irreversible action;
  • the account's configured approver gets a link — not necessarily whoever kicked off the run;
  • they approve or deny from anywhere;
  • the agent blocks until they answer (or it times out);
  • and the whole exchange survives the session ending.

## What it is, and isn't

This isn't a smarter model or a safety net that catches everything. And the cheap way to wire it —
telling the agent "call request_approval before anything destructive" — just rebuilds the problem:
that's one more instruction it can step over, the exact softness PocketOS exposed. The binding has to
live a layer down. You put the capability behind the gate — the deploy, the delete, the send can't fire
without a human-issued token — so it isn't the agent choosing to ask permission, it's the execution
layer refusing to act until a human decides. That's the line between a prompt rule (the model's judgment)
and a checkpoint (deterministic). What the primitive gives you is a place to stand: a hard,
human-decided checkpoint a prompt rule fundamentally can't be.

## The drop-in

I built this as a plain MCP tool. request_approval returns a mobile link; check_approval polls for
the decision. It runs in any MCP host — no platform to adopt, no wallet, no SDK. It's deliberately small,
and I'll be honest about the edges: today it's email-delivered and single-approver (no multi-sig, no SMS
yet). That's enough to answer the only question worth answering first:

Does an out-of-band approval gate, as a drop-in tool, solve a problem you actually have?

If you run agents that send, post, deploy, move money, or touch files you can't un-touch, I'd genuinely
like to know whether this is the shape you'd reach for — or what's missing from it.

Top comments (0)