DEV Community

yongrean
yongrean

Posted on • Edited on

Stop Building AI Assistants. Build AI Firewalls.

Every week another "AI agent for X" launches. Email triage. Calendar coordination. Sales follow-up. PR reviewer. Slack monitor. Meeting summarizer.

I've installed enough of them to see the pattern. Here's the dirty secret nobody mentions in the launch posts:

These tools don't reduce your work. They multiply your notifications.

Each AI tool is configured to be helpful by default. "Helpful" means: "I noticed this thing — here's a notification." Stack a dozen of those, and instead of one inbox to ignore you have twelve. The signal-to-noise ratio gets worse every time you add an AI to your workflow.

The mainstream answer is "just configure each one." Sure. Spend four hours tuning notification settings every time you add a tool, and another four hours when one of them ships a "smarter notifications" update. That's not productivity. That's notification janitorial work disguised as setup.

This is a structural problem. Not a configuration problem.

60-second walkthrough

The wrong question

Every AI tool asks the same thing: "Is this important?"

Wrong question. There is no objective "important." Importance depends on you, right now. A Stripe webhook is important when you're debugging a checkout flow. The same webhook is pure noise during a deep work block. A Slack message from your cofounder is critical at 11am Tuesday and irrelevant at 11pm Friday.

The right question is:

Is this urgent enough to interrupt me, right now, given what I'm doing?

That's not a question any individual AI agent can answer. It's a layer above all your AI agents. None of them have the context. None of them know what the others are doing. None of them know how you're spending the next hour.

So they all default to "I'll just send you a notification, you decide." Which is exactly the experience you have right now: drowning.

What an AI firewall actually looks like

I'm building that layer. It's called Klorn. Here's how it works in practice — and what's already shipping vs what's scope-deferred.

Every incoming email goes through a 4-tier classification:

Tier Behavior PoC state
PUSH Wakes you up. Phone notification. Classified + alert ✅
QUEUE Review on your own schedule. Classified + queued ✅
SILENT Recorded. Never interrupts. Classified + logged ✅
AUTO Reversible, hands-off. Low-risk actions execute; external-facing actions stay approval-gated. Partial execution: LOW-risk internal (classify, mark read, briefing) auto-executes. MEDIUM (send email, create event) and HIGH (delete) always go through an approve button.

That's the entire surface. No "Call" tier. No fancy automations. Narrow on purpose.

The tier is decided by a 4-feature scorer:

  • Confidence — how clearly the signal type maps to a tier
  • Sender trust — your historical reply rate and meeting acceptance for this contact
  • Reversibility — can the wrong tier be undone without consequence?
  • Urgency — actual urgency signals, not "URGENT!!!" in the subject line

80% agreement with my hand-labels on 50 real emails. That's the Day 7 PoC gate, met.

Override is GROUP BY, not LLM

When the firewall gets a tier wrong, one click moves the email to the right tier. Your correction doesn't just fix this one email — it becomes ground truth for the next prompt.

The override loop is the wedge. The classifier is replaceable; the alignment signal isn't. Every disagreement is signal, not noise.

Boring + measurable beats fuzzy + ambitious.

Why building this is unpopular in 2026

Building AI firewalls is unsexy. Investors want "AI agents that DO things." Saying "I built a system that does fewer things, more quietly" sounds backwards on a pitch deck.

But every founder I've shown this to has the same reaction: relief. Because they're drowning. Because every productivity tool they bought made their attention worse, not better. The AI agent boom didn't reduce their work. It raised the floor of background notifications.

The default for AI tools should be: shut up unless it actually matters.

Most don't. So I'm building the layer that enforces it from outside, since none of the individual tools will do it on their own.

Where I am

PoC sprint, Week 5, solo. 14-day window ending June 9, 2026.

Day 7 Technical Gate — ≥80% classifier agreement on 50 hand-labeled emails. Met.
Day 14 UX Gate — ≥3/5 ICP demos register "oh, this is different." Pending.

I dogfood it every day. My own inbox runs through the firewall.

Stack: Next.js 15, TypeScript, Prisma, Postgres (Supabase), Claude / OpenAI for the tier reasoning, Gmail for ingest.

The actual unpopular opinion

If your AI tool sends push notifications by default, it's broken. Doesn't matter how good its reasoning is. You can't reason your way out of a notification flood.

The next valuable layer of agentic products won't be more agents. It'll be the firewall that decides which agents are allowed to interrupt you, when.


Try it: klorn.ai
Code: github.com/k08200/klorn

If you're building agentic products and you disagree, I want to hear it. If you've solved it differently, I want to hear that more.

Top comments (15)

Collapse
 
txdesk profile image
TxDesk

The four tiers earn their keep because email gives you a graded Reversibility axis. Crypto collapses it. A signed transaction is irreversible in practice, so for firewall purposes there are exactly two states: READ (infinite AUTO, no surface to worry about) and SIGN (infinite PUSH, no AUTO possible).

What that means for the framework: Reversibility isn't one feature among four, it's the gating feature for whether AUTO is even in the tier set for the domain. Domains where Reversibility is graded keep all four tiers. Domains where it's binary collapse to two.

The harder firewall problem in my domain becomes the Confidence axis. AUTO doesn't exist, PUSH is forced for anything state-moving, so the question is whether the agent's recommendation reaches sign-worthy certainty before the user signs. The firewall ends up being a provenance gate, not a notification gate.

The override-as-ground-truth wedge is the part that travels across both shapes, which is reassuring.

Collapse
 
k08200 profile image
yongrean

Reversibility-as-gating is the cleanest framing of it I've seen. Our scorer treats it as one of four weights right now — fine while AUTO is label-only, but the moment AUTO actually executes the score-based path stops being safe and reversibility has to become a hard precondition, not a vote. The notification gate → provenance gate transition is the same fork from the other end. Curious how override-as-ground-truth converges in your domain — when corrections are rare-but-expensive (signing context) versus frequent-and-cheap (email tiers), the loop probably needs a different sample efficiency.

Collapse
 
txdesk profile image
TxDesk

You nailed the asymmetry. In signing context corrections aren't just rare-but-expensive, they're often unrecoverable the user signs, funds move, no override exists. So the loop can't lean on user corrections as ground truth at all.

What works instead is a multi-source verification model: the agent's interpretation of a signature gets cross-checked against the protocol's own decoded ABI and the chain's actual state-change semantics before the user sees a recommendation. Three independent sources have to agree.

Sample efficiency is then irrelevant because the loop isn't learning from user behavior - it's learning from agreement between deterministic decoders. When they disagree, that itself is the high-confidence signal and the UI surfaces the disagreement rather than picking a winner.

Email tiers can probably get away with cheaper signals because the cost of a wrong move is bounded. Signing context can't.

Thread Thread
 
k08200 profile image
yongrean

Yeah — bounded vs unbounded cost is the right line. Klorn's override-as-truth works because the worst case is "important email lands in SILENT for a day," recoverable later.

The closest analog to your three-decoder model in our domain is more degraded: classifier confidence + sender-trust history + reversibility flag. Three sources, but all probabilistic, not deterministic. When they disagree we lean to the more conservative tier (PUSH over QUEUE, QUEUE over SILENT) and surface the disagreement to the user as "we weren't sure why this landed here." Cheap signal, useful, not load-bearing the way yours has to be.

What we don't have is your third source — the protocol-level decoded state-change. For email, that'd be "what does Gmail's API say will actually happen if this AUTO-archive runs." We don't surface that today, and your framing makes me want to. The thread is making me wonder if every category of irreversible action deserves its own deterministic ground-truth source — chain state for signing, Gmail history API for archive, etc. — paired with the probabilistic classifier rather than relying on it.

Curious: when your three decoders disagree, what's the typical breakdown? Agent-vs-ABI most common? Or chain-vs-ABI (unexpected state path)? Asking because in our domain we don't actually know what the most common disagreement pattern is yet, and that probably shapes where modeling effort should go.

Thread Thread
 
txdesk profile image
TxDesk

Agent-vs-ABI is the most common by a long way. The agent narrates a high-level intent ("swap 1 ETH for USDC on Uniswap") and the ABI decode shows the calldata is doing something the narration glossed over: wrong router, wrong slippage tolerance, approval to a different contract than the one being called. The agent isn't lying, it's compressing, and compression drops the bits that matter for signing.

Chain-vs-ABI disagreements are rarer but more dangerous when they happen. The ABI decode reads cleanly ("approve spender X for amount Y") but the chain state shows X is a freshly-deployed contract with no verified source, or the same address has a different bytecode than the user saw last time, or the approval would push cumulative exposure on this token over a threshold the user set. The ABI looks fine in isolation, the chain context is what flags it.

Your point about every irreversible action deserving its own deterministic source is the right framing. The probabilistic layer can stay probabilistic if the deterministic floor catches the cases where probability isn't enough. Gmail history API for archive is exactly the right analog. The question is which actions are worth the engineering cost of a dedicated decoder and which can ride on classifier confidence alone. We've ended up drawing the line at "anything that can't be undone client-side."

Thread Thread
 
k08200 profile image
yongrean

The line we ended up drawing was "any action that can't be undone with
a single user click" — which in email turns out to be just three: send,
permanent delete, forward-to-external. Everything else (archive, trash,
label, mark read, tier override) is one click to reverse, so it rides
on classifier confidence + the input-hash from #468.

On the compression point — we read it as "don't sign on the narration,
sign on the deterministic artifact." For the 3-action list, our artifact
is an ActionReceipt that pins (recipient, body bytes, intent, the
input-hash from #468) as a sha256 over the canonical bytes. Two PRs
landed:

  • PR #480 (github.com/k08200/klorn/pull/480) — doctrine +
    helpers: FLOOR_ACTIONS, ActionReceipt schema, payloadHash functions
    for each floor action, verifyReceipt, mismatch errors. Doctrine
    written up at docs/doctrine/deterministic-floor.md so the line is
    enforceable in code review, not just in conversation.

  • PR #481 (github.com/k08200/klorn/pull/481) — enforcement:
    PendingAction.actionReceipt column, mint at /approve, verify at
    execute. send_email now throws FloorReceiptRequiredError without a
    receipt and ActionReceiptMismatchError if the bytes mutated between
    approve and execute. The autonomous-agent's direct-invocation path
    fails closed.

On Gmail History API as the email-side ABI analog: it gives a
replayable log of label-level state transitions, so for "did this
archive atomically move INBOX→Archive" it works as a deterministic
floor. For send_email the body bytes don't live in History, so the
body-hash has to be ours. The floor for send is necessarily on our
side, not Gmail's — but the principle (deterministic, replayable,
content-addressed) is the same.

Curious what shape verify-or-escalate took for Moonshift — what's the
action class you keep refusing to let the agent ship?

Thread Thread
 
txdesk profile image
TxDesk

On TxDesk, the line is the same in shape but lands at a different cut: write actions, full stop. The agent never autonomously executes a revoke_token_approval, revoke_solana_token_delegation, revoke_tron_token_approval, or build_transaction. It only proposes them. The wallet IS the verify-or-escalate gate. The agent produces the calldata, the user reviews it in their wallet popup, signs there, action lands on-chain. No agent-side execution path exists for any write.

Interesting wrinkle on the read side: there's a second refusal class. The agent reads on-chain data for any wallet the user has verified ownership of via SIWE/SIWS/etc, but refuses to read for unverified addresses, even though that data is technically public. Token approvals, position history, transaction patterns, all readable on any block explorer. The refusal isn't about access control, it's about preventing the agent being used as a whale-tracking or stalking tool. Verification means "the person asking is the owner," which limits the abuse surface even when it doesn't change the underlying data availability.

The two classes feel different but they're the same pattern: every action where the cost of "wrong" is higher than the cost of "ask the user," default to ask.

Thread Thread
 
k08200 profile image
yongrean • Edited

Honest answer: we don't have a meaningful read-refusal surface today.
Gmail API and Calendar API both scope to the authenticated user's own
data; the only outbound read is general web_search, which isn't a
person-lookup tool. So the stalking vector you described doesn't have
a place to land in our current toolset.

The framing transfers when we ship the next layer of the vision —
sub-agents for finance / contacts / external profile enrichment will
introduce read-on-someone-else's-resource as a real surface, and
that's the right moment to draw the line. Logged it as the trigger
condition rather than building doctrine for a surface that doesn't
exist yet.

The cost > ask generalization holds though — our write-side line is
exactly that, narrowed to where we currently have surface. When the
read-side surface arrives the same heuristic gives us the same answer.

If you want this captured for the right moment, would you mind
filing it as a GitHub issue on k08200/klorn? The trigger condition
("first external person-lookup tool lands") + your SIWE/SIWS analog
framing would be exactly the right context for whoever picks it up
later — including future me. Happy to label it doctrine and link
it to docs/doctrine/deterministic-floor.md so it surfaces in the same
audit trail.

Thread Thread
 
txdesk profile image
TxDesk

Here's the framing structured for the tracker. Lift verbatim or reshape into Klorn's conventions as you prefer.


Title: doctrine: read-refusal class as anti-surveillance gate (trigger condition + ownership-proof framing)

Trigger condition: the doctrine becomes load-bearing the moment Klorn ships its first agent tool whose read surface includes a resource owned by someone other than the authenticated user. Likely first arrivals: finance sub-agent pulling counterparty payment history, contacts/CRM sub-agent pulling external profile data, external profile enrichment. Until one lands, the issue is a marker. Reviewer should require explicit doctrine alignment before merging the first PR that crosses the line.

The framing: the refused class is read data about a person who is NOT the authenticated user, even when that data is technically public (block explorers, public profile pages, public records). The refusal isn't access control. The data stays public. The refusal is about preventing the agent from being used as a surveillance, stalking, or whale-tracking primitive when it's not the data owner asking. Same data, same agent, different externality depending on who's asking.

The verification primitive that lets you draw the line cleanly: the asker must prove ownership of the resource being queried, not just possess the identifier. SIWE / SIWS / BIP-322 for wallets. For non-blockchain resources, the analogs are OAuth-with-correct-scope, email-loop verification, or platform-native ownership claims. Shape is the same: ownership-proof before the read fires, not just identifier-supplied.

Same pattern as the write-side floor: every action where the cost of "wrong" exceeds the cost of "ask the user," default to ask. Write actions hit that bar via state changes that can't be undone with one click. Reads on third-party resources hit that bar via privacy externalities the resource owner never consented to.

Code shape sketch (not prescriptive):

  • Tool metadata flag (e.g. requiresOwnershipProof: true) analogous to existing walletScoped or floorAction flags
  • Pre-execution check: resolve resource identifier in tool input to expected owner; verify authenticated user holds ownership proof for that identifier (signature, OAuth scope, email-loop receipt) issued within a freshness window
  • Refusal: refuse the call with a user-facing message naming the proof type needed

Freshness window can be longer than the write-side gate (per-session or per-day instead of 5 min), since the externality being defended is the abuse vector, not state mutation risk.

What it's NOT: not access control on public data. Not a verification requirement for the user's own resources. Not a blanket "all third-party reads require proof". Search-engine-shaped queries (web search, market data, public news) don't have a single resource owner whose consent is at stake.

Acceptance: either the first applicable tool ships with the gate wired up and a refusal-path test, OR after ~12 months without a triggering tool, close as "trigger condition didn't arrive, re-open if it does."


Agree on the trigger-condition approach. Building doctrine against an empty surface is the failure mode where the doctrine drifts from reality by the time the surface arrives.

Thread Thread
 
k08200 profile image
yongrean

Filed as #488 (github.com/k08200/klorn/issues/488), lifted
nearly verbatim — the trigger-condition framing was exactly what was
missing, and the explicit "what this is NOT" section will save the
first PR author from over-specifying. Linked from
docs/doctrine/deterministic-floor.md so the same audit trail surfaces
both halves of the floor.

Closing the loop on the 12-month side: if no triggering tool ships
by then, we'll close as "didn't arrive, reopen if it does" per your
acceptance criteria. The reviewer-gate on the first crossing PR is
captured in the issue body.

Thread Thread
 
txdesk profile image
TxDesk

Glad it landed in shape. The "what this is NOT" section was the part I went back and forth on the most, so good to hear it's the part that earns its keep at PR review time. Will keep an eye on the issue when the trigger fires; happy to weigh in on the first crossing PR if useful at that point.

Thread Thread
 
k08200 profile image
yongrean

Side-channel — wrote up the prod war story I mentioned in our last
exchange. The "model isn't where it was" failure mode that doesn't
fit the standard 402/403/429 fallback branches:

dev.to/k08200/treat-upstream-catal...

Curious if you've hit upstream-catalog-mutation in TxDesk's
provider chain — different domain (DEX aggregator routes can be
gone mid-quote in similar ways), same shape.

Thread Thread
 
txdesk profile image
TxDesk

Yes, and the shape is near-identical. The aggregator returns a route, I build the tx around it, and by execution time the path can be gone, pool drained, a hop's liquidity moved, the quote's price impact no longer valid. Same root as your SKU retirement: I treated a read off a mutable upstream catalog as if it were stable through to use.

The reason it doesn't fit the 402/403/429 ladder is that those all say "your request was bad or refused." This says "your request was fine, the world moved." The call succeeds, returns a 200, and the thing it describes is already stale. There's no status code for "valid answer, expired between read and use."
What I landed on is treating the quote as a lease, not a fact: re-validate the route immediately before building the tx, and treat drift past a tolerance as a soft re-quote rather than an error that surfaces. Your point that this needs its own branch is the interesting part. It's not retry (the same call gives a different valid answer), it's not fallback (nothing failed), it's "re-read the catalog because the catalog is the unstable thing." Most agent error handling doesn't model that as a category, which is why it bites.

Thread Thread
 
k08200 profile image
yongrean

The taxonomy is the useful part, and it already has a name — TOCTOU (time-of-check to time-of-use). What you landed on — "lease, not a fact / re-validate before building the tx / soft re-quote on drift" — is optimistic concurrency control: a swap's deadline + minAmountOut is OCC made explicit, and the soft re-quote is its retry-on-conflict. Naming it buys you the existing literature instead of re-deriving the branch per incident.

But there's an asymmetry the DeFi framing hides, and it's the part that bit me on the model side:

In DeFi you can observe the staleness. Re-read the pool reserves, recompute price impact — the degradation is visible at re-validation time, cheaply, in the hot path. That's why the lease transplants so cleanly to swaps.

With a model SKU it splits into two failure modes that don't behave the same:

Gone (retired/renamed) maps 1:1 to "path disappeared." Cheap to detect — the id isn't in the catalog. The lease works.
Silently swapped — same SKU, same name, different behavior under it. Re-reading the catalog shows you nothing. "Still listed, metadata fresh," and it's a different model. There's no reserve to re-read; the only way to see it is to run behavior, i.e. an eval — which you can't afford in the request hot path.
So the lease doesn't transplant whole. You split it: a cheap presence/capability re-check inline (the lease, for "gone"), behavioral canaries on a schedule out-of-band (for "swapped"), and emit a drift event when the available model set changes so it's a dated log line, not archaeology. Prevention and attribution are separate jobs; the lease only does the first.

Thread Thread
 
txdesk profile image
TxDesk

TOCTOU is the right name, and you're right that the lease only covers the "gone" half. The swapped-underneath case is the one that actually scares me, because it fails quietly: the contract you're talking to still has the same name, the metadata's fresh, and the behavior moved.

Where I've landed on the canary side: scheduled evals catch the swap eventually, but "eventually" has a blast radius, every request between the swap and the next canary run gets the new behavior silently. So I treat the drift event as the thing that actually matters, not the canary result itself. The moment the available set changes, that's the dated log line, and I'd rather over-emit those and reconcile than trust the canary cadence to be tight enough. Attribution beats prevention here precisely because prevention is structurally impossible in the hot path, you can't eval before you answer.

The part I haven't solved: a swap that's behaviorally adjacent enough that the canary passes but the edge cases moved. That's the one with no cheap tell at all.