DEV Community

Stop Building AI Assistants. Build AI Firewalls.

yongrean on May 28, 2026

Every week another "AI agent for X" launches. Email triage. Calendar coordination. Sales follow-up. PR reviewer. Slack monitor. Meeting summarizer....

Read full post

TxDesk • May 29

The four tiers earn their keep because email gives you a graded Reversibility axis. Crypto collapses it. A signed transaction is irreversible in practice, so for firewall purposes there are exactly two states: READ (infinite AUTO, no surface to worry about) and SIGN (infinite PUSH, no AUTO possible).

What that means for the framework: Reversibility isn't one feature among four, it's the gating feature for whether AUTO is even in the tier set for the domain. Domains where Reversibility is graded keep all four tiers. Domains where it's binary collapse to two.

The harder firewall problem in my domain becomes the Confidence axis. AUTO doesn't exist, PUSH is forced for anything state-moving, so the question is whether the agent's recommendation reaches sign-worthy certainty before the user signs. The firewall ends up being a provenance gate, not a notification gate.

The override-as-ground-truth wedge is the part that travels across both shapes, which is reassuring.

yongrean • May 30

Reversibility-as-gating is the cleanest framing of it I've seen. Our scorer treats it as one of four weights right now — fine while AUTO is label-only, but the moment AUTO actually executes the score-based path stops being safe and reversibility has to become a hard precondition, not a vote. The notification gate → provenance gate transition is the same fork from the other end. Curious how override-as-ground-truth converges in your domain — when corrections are rare-but-expensive (signing context) versus frequent-and-cheap (email tiers), the loop probably needs a different sample efficiency.

TxDesk • May 31

You nailed the asymmetry. In signing context corrections aren't just rare-but-expensive, they're often unrecoverable the user signs, funds move, no override exists. So the loop can't lean on user corrections as ground truth at all.

What works instead is a multi-source verification model: the agent's interpretation of a signature gets cross-checked against the protocol's own decoded ABI and the chain's actual state-change semantics before the user sees a recommendation. Three independent sources have to agree.

Sample efficiency is then irrelevant because the loop isn't learning from user behavior - it's learning from agreement between deterministic decoders. When they disagree, that itself is the high-confidence signal and the UI surfaces the disagreement rather than picking a winner.

Email tiers can probably get away with cheaper signals because the cost of a wrong move is bounded. Signing context can't.

yongrean • Jun 1

Yeah — bounded vs unbounded cost is the right line. Klorn's override-as-truth works because the worst case is "important email lands in SILENT for a day," recoverable later.

The closest analog to your three-decoder model in our domain is more degraded: classifier confidence + sender-trust history + reversibility flag. Three sources, but all probabilistic, not deterministic. When they disagree we lean to the more conservative tier (PUSH over QUEUE, QUEUE over SILENT) and surface the disagreement to the user as "we weren't sure why this landed here." Cheap signal, useful, not load-bearing the way yours has to be.

What we don't have is your third source — the protocol-level decoded state-change. For email, that'd be "what does Gmail's API say will actually happen if this AUTO-archive runs." We don't surface that today, and your framing makes me want to. The thread is making me wonder if every category of irreversible action deserves its own deterministic ground-truth source — chain state for signing, Gmail history API for archive, etc. — paired with the probabilistic classifier rather than relying on it.

Curious: when your three decoders disagree, what's the typical breakdown? Agent-vs-ABI most common? Or chain-vs-ABI (unexpected state path)? Asking because in our domain we don't actually know what the most common disagreement pattern is yet, and that probably shapes where modeling effort should go.

TxDesk • Jun 2

Agent-vs-ABI is the most common by a long way. The agent narrates a high-level intent ("swap 1 ETH for USDC on Uniswap") and the ABI decode shows the calldata is doing something the narration glossed over: wrong router, wrong slippage tolerance, approval to a different contract than the one being called. The agent isn't lying, it's compressing, and compression drops the bits that matter for signing.

Chain-vs-ABI disagreements are rarer but more dangerous when they happen. The ABI decode reads cleanly ("approve spender X for amount Y") but the chain state shows X is a freshly-deployed contract with no verified source, or the same address has a different bytecode than the user saw last time, or the approval would push cumulative exposure on this token over a threshold the user set. The ABI looks fine in isolation, the chain context is what flags it.

Your point about every irreversible action deserving its own deterministic source is the right framing. The probabilistic layer can stay probabilistic if the deterministic floor catches the cases where probability isn't enough. Gmail history API for archive is exactly the right analog. The question is which actions are worth the engineering cost of a dedicated decoder and which can ride on classifier confidence alone. We've ended up drawing the line at "anything that can't be undone client-side."

yongrean • Jun 4

The line we ended up drawing was "any action that can't be undone with
a single user click" — which in email turns out to be just three: send,
permanent delete, forward-to-external. Everything else (archive, trash,
label, mark read, tier override) is one click to reverse, so it rides
on classifier confidence + the input-hash from #468.

On the compression point — we read it as "don't sign on the narration,
sign on the deterministic artifact." For the 3-action list, our artifact
is an ActionReceipt that pins (recipient, body bytes, intent, the
input-hash from #468) as a sha256 over the canonical bytes. Two PRs
landed:

PR #480 (github.com/k08200/klorn/pull/480) — doctrine +
helpers: FLOOR_ACTIONS, ActionReceipt schema, payloadHash functions
for each floor action, verifyReceipt, mismatch errors. Doctrine
written up at docs/doctrine/deterministic-floor.md so the line is
enforceable in code review, not just in conversation.
PR #481 (github.com/k08200/klorn/pull/481) — enforcement:
PendingAction.actionReceipt column, mint at /approve, verify at
execute. send_email now throws FloorReceiptRequiredError without a
receipt and ActionReceiptMismatchError if the bytes mutated between
approve and execute. The autonomous-agent's direct-invocation path
fails closed.

On Gmail History API as the email-side ABI analog: it gives a
replayable log of label-level state transitions, so for "did this
archive atomically move INBOX→Archive" it works as a deterministic
floor. For send_email the body bytes don't live in History, so the
body-hash has to be ours. The floor for send is necessarily on our
side, not Gmail's — but the principle (deterministic, replayable,
content-addressed) is the same.

Curious what shape verify-or-escalate took for Moonshift — what's the
action class you keep refusing to let the agent ship?