DEV Community

yongrean
yongrean

Posted on

Stop building AI inboxes. Build decision layers instead.

Subtraction over interface noise

I spent six months building an AI-powered email tool. Then I deleted half of it.

Not because the model was bad. Not because the embeddings were off. Because I finally noticed what every "AI inbox" on the market — including the one I was building — was actually doing.

They were surfacing more.

More "smart suggestions". More "priority signals". More "AI-drafted replies waiting for your review". More badges, more banners, more nudges. Every product in the category was racing to add a new surface and call it intelligence.

My six-month-old prototype did all of that. I used it every day. And every morning the inbox was just as loud as the day I started. The model was right about which emails mattered. I still read all the other ones anyway, because they were right there, with a little colored dot suggesting maybe-they-mattered-too.

The model was solving the wrong problem.

The category bug

Look at the leading email tools through this lens:

  • Superhuman made reading faster. You still read everything.
  • Shortwave classified smarter. You still read everything.
  • Motion / Reclaim got more proactive. They added a calendar layer on top of the noise.

None of them subtract. They all add. "AI assistant" became a license to put one more thing in front of you.

The deeper bug: these tools treat email as the primary surface and try to make it better. But email is not what you want. What you want is decisions you have to make. Email is one cheap, unreliable transport that occasionally contains those decisions, buried under hundreds that don't.

Making the transport prettier doesn't fix the signal-to-noise problem. It hides it.

The right abstraction: decision layer

A decision layer doesn't replace your inbox. It sits above mail, calendar, Slack, and any other transport, and it surfaces exactly one thing: items where the system genuinely needs your judgment.

Three properties make a layer a decision layer rather than just "a better inbox":

  1. It subtracts more than it adds. A signal that you've ignored four times in a row should never reach you again. Not muted. Gone.
  2. It treats relationships as data. Two people asking for the same thing are not the same ask. One of them has hit every deadline you've ever had with them; the other ships +3 days late, every time. That should weight the queue.
  3. It refuses to act without your approval. The model can draft, propose, plan. It cannot send, modify, or commit. Approval-before-action has to be a schema-level constraint, not a UI nicety.

None of these are AI features. They are boundary features. The AI is helpful for the classification underneath, but the value lives in what the system refuses to surface.

Here is what each of them actually looks like in production.

Pattern 1 — Closed-loop suppression learning

The single most useful thing the system does is forget.

Every time the user dismisses an attention item, we record a FeedbackEvent with the signal DISMISSED or IGNORED. That table is the cheap part. The interesting part is a job that reads it weekly:

export async function runFeedbackAdaptation(userId: string): Promise<number> {
  const since = new Date(Date.now() - LOOK_BACK_DAYS * 24 * 60 * 60 * 1000);

  const events = await prisma.feedbackEvent.findMany({
    where: {
      userId,
      source: "ATTENTION_ITEM",
      signal: { in: ["DISMISSED", "IGNORED"] },
      createdAt: { gte: since },
    },
    select: { sourceId: true },
  });

  // Join to the attention items themselves so we can bucket by (source, type,
  // priority) instead of just (source, type) — the bucket prevents an
  // over-broad rule from silencing legitimate high-priority signals.
  const items = await prisma.attentionItem.findMany({
    where: { id: { in: events.map(e => e.sourceId) } },
    select: { id: true, source: true, type: true, priority: true },
  });

  const counts = new Map<string, { key: CountKey; count: number }>();
  for (const event of events) {
    const item = itemMap.get(event.sourceId);
    if (!item) continue;
    const bucket = priorityBucket(item.priority);
    const k = suppressionKey(item.source, item.type, bucket);
    const existing = counts.get(k);
    if (existing) existing.count += 1;
    else counts.set(k, { key: { source: item.source, type: item.type, bucket }, count: 1 });
  }

  // Threshold: same tuple dismissed ≥4 times in 30 days → suppress forever.
  const suppressed = [...counts.values()]
    .filter(({ count }) => count >= DISMISS_THRESHOLD)
    .map(({ key, count }) => ({ ...key, dismissCount: count }));

  await remember(userId, "CONTEXT", "attention_suppression_v2", JSON.stringify(suppressed));
  return suppressed.length;
}
Enter fullscreen mode Exit fullscreen mode

The suppression set is then read at the upsert path for every new attention item:

export function isSuppressed(
  set: Set<string>,
  source: string,
  type: string,
  priority?: number,
): boolean {
  if (typeof priority === "number") {
    const bucket = priorityBucket(priority);
    if (set.has(suppressionKey(source, type, bucket))) return true;
  }
  return set.has(suppressionKey(source, type));
}
Enter fullscreen mode Exit fullscreen mode

If the tuple is in the suppression set, the new attention item is forced into SILENT tier — it gets recorded for the audit log, but the user is never paged about it.

A few design choices worth pointing out:

  • Priority buckets matter. The first version keyed only on (source, type). Dismissing four "due-today commitment" notifications would silence every commitment-due signal, including overdue ones. The current version buckets priority into HIGH / MEDIUM / LOW, so the user can train "I don't care about LOW-priority due commitments" without losing the HIGH ones.
  • Backwards-compatible key. Memory rows from the previous version are still read; a v1 row without a bucket matches every bucket, so a rollback doesn't lose learned behavior.
  • 10-minute in-process cache. The upsert path is hot — checking the suppression set on every new item against the DB would be wasteful. A 10-minute TTL is short enough that a weekly adaptation run propagates fast and long enough to be free at request time.

Notice what's missing: an LLM. The classifier underneath uses one, but the suppression loop itself is plain counting. The model is not the right tool for "remember what the user doesn't care about". A GROUP BY is.

Pattern 2 — Contact Trust Score

The second feature changed how I think about every productivity tool I've ever used.

When someone makes a commitment to you — "I'll send the deck by Thursday", "let's reconnect next week" — that's a tracked row in a commitment ledger. When the commitment is fulfilled, we record whether it was on-time or late, and update a running tally per contact:

export async function updateTrustScore(
  userId: string,
  contactEmail: string,
  displayName: string | null,
  wasOnTime: boolean,
  daysLate = 0,
): Promise<void> {
  await prisma.contactTrustScore.upsert({
    where: { userId_contactEmail: { userId, contactEmail: email } },
    create: {
      userId,
      contactEmail: email,
      displayName,
      totalCount: 1,
      onTimeCount: wasOnTime ? 1 : 0,
      lateCount: wasOnTime ? 0 : 1,
      totalDelayDays: Math.max(0, daysLate),
      lastUpdatedAt: new Date(),
    },
    update: {
      totalCount: { increment: 1 },
      ...(wasOnTime ? { onTimeCount: { increment: 1 } } : { lateCount: { increment: 1 } }),
      ...(daysLate > 0 ? { totalDelayDays: { increment: daysLate } } : {}),
      lastUpdatedAt: new Date(),
    },
  });
}
Enter fullscreen mode Exit fullscreen mode

That tally rolls up to a badge:

  • reliable — ≥80% on-time, ≥3 data points
  • mostly reliable — ≥50% on-time, ≥3 data points
  • unreliable — <50% on-time, ≥3 data points
  • unknown — fewer than 3 data points, or stale (no signal in 60+ days)

The stale check is doing real work. A year-old "reliable" badge on someone who has since gone dark shouldn't be load-bearing. Until we get full exponential decay, we demote anyone untouched in two half-lives back to unknown.

The badge gets surfaced as a small chip on the inbox card. But the actually-useful place is inside the agent prompt itself:

export async function buildTrustHintForPrompt(userId: string): Promise<string> {
  const rows = await prisma.contactTrustScore.findMany({
    where: { userId, totalCount: { gte: MIN_DATA_POINTS } },
    orderBy: { lastUpdatedAt: "desc" },
    take: 10,
  });
  if (rows.length === 0) return "";

  const lines = rows.map((row) => {
    const r = computeResult(row);
    const name = r.displayName || r.contactEmail;
    if (r.badge === "reliable")
      return `- ${name}: reliable (${Math.round(r.onTimeRate * 100)}% on-time)`;
    if (r.badge === "mostly_reliable") {
      const delay = r.avgDelayDays > 0 ? `, avg +${Math.round(r.avgDelayDays)}d late` : "";
      return `- ${name}: mostly reliable (${Math.round(r.onTimeRate * 100)}% on-time${delay})`;
    }
    return `- ${name}: unreliable (${Math.round(r.onTimeRate * 100)}% on-time, avg +${Math.round(r.avgDelayDays)}d late) — factor in extra buffer`;
  });

  return `\n## Contact Reliability\nBased on tracked commitments:\n${lines.join("\n")}`;
}
Enter fullscreen mode Exit fullscreen mode

Now when the model decides how urgently to surface "Mina is asking for an update" vs "Sarah is asking for an update", it has actual data on which of them is going to deliver if you give them a polite nudge versus which one needs the deadline restated three times. The prompt isn't fed any feelings about either person. It is fed numbers.

The productivity-tool industry has spent ten years building calendars that don't know which meeting attendees actually show up on time. That's strange.

Pattern 3 — Approval-before-action as a schema constraint

The third pattern is the boring one, and it's the one most AI assistants get wrong.

The model is allowed to draft a reply. It is allowed to propose a calendar move. It is allowed to plan a sequence of actions. It is not allowed to send, move, or commit any of it. Not because we don't trust the model — we sometimes do — but because the user needs to know the surface area of what the system is doing on their behalf, and "silently sent" is a category of bug that never recovers user trust once it happens.

This is enforced at the schema level. Every action the agent proposes lives in a PendingAction row with a status enum. The state machine for that enum is the contract: only one transition (approve()) gets the side effect to actually run. The agent can propose() all day; nothing ships without a deliberate user transition.

The lowest-risk class of actions — internal-only things like blocking calendar time for focus, snoozing an item, setting a reminder — can be marked auto and skip approval. Everything that touches an outside party (sending mail, modifying someone else's calendar) is always gated. The boundary is conservative on purpose. The day a single user discovers their AI assistant silently sent an apology to their VC is the day every AI assistant in the category becomes harder to sell.

What this looks like in practice

The sum of these three patterns is not a smarter inbox. It is a small, quiet queue that contains roughly six to twelve items on any given day. Each item is either an explicit ask, a tracked commitment coming due, or a proposed action waiting for confirmation. The model spent the morning reading and reasoning about a few hundred other things, all of which the system decided you don't need to know about.

When you dismiss an item, the system learns. When a contact reliably delivers, their asks rise. When the model wants to act outside a narrow safelist, it asks first. The result, after a few weeks of training the noise floor, is a queue that feels like it was assembled by someone who actually knows what you ignore.

None of this requires a frontier model. The classifier underneath is a small, cheap LLM with strict cost guards. Almost all of the value is in the boundaries — what the system refuses to surface, what it refuses to do without you, and what it remembers about people you work with.

If you're building anything in this category and you find yourself adding a new surface that shows the user more things, stop and ask whether you'd rather build the thing that subtracts. The market is crowded with smarter inboxes. There is no good decision layer yet.

I'm shipping one at klorn.ai. Not asking for signups — sharing the pattern because I think more people should be building toward it. The closed-loop suppression and trust-score code above are excerpts from the real thing.


Built in TypeScript on Fastify, Prisma, and Postgres. Code patterns shown are production excerpts.

Top comments (10)

Collapse
 
txdesk profile image
TxDesk

Pattern 3 is the one I'd defend hardest, and it gets even more load-bearing the moment the actions touch money instead of email. I build an AI agent over crypto wallets, and "the model can propose but a deliberate transition is the only thing with the side effect" is the same constraint, just with a worse failure mode if you get it wrong: a silently-sent apology to a VC is recoverable, a silently-signed transaction is not.

One thing I'd add from doing this in a higher-stakes domain: the schema constraint is necessary but it's worth backing it with a test that fails loudly if the boundary ever regresses. I have a read path that must never invoke the action/billing engine, and the test re-mocks those entry points to throw if called, then asserts the response still succeeds. A refactor that accidentally wires the action capability into the read path doesn't just slip through review, it breaks the build with an explicit "invariant violated" message. The boundary lives in the schema, but the guarantee lives in the test.

The suppression-loop-is-a-GROUP-BY-not-an-LLM point is also exactly right and underrated. The model is for the fuzzy classification; the boundary logic should be dumb and deterministic so you can actually reason about it.

Collapse
 
k08200 profile image
yongrean

Strong agree on the test backing the schema. The schema makes the boundary expressible, the test makes it durable through refactors — and I've watched well-designed boundaries silently erode three or four PRs later when only the first half is in place.

In my email triage agent the action surface is PendingAction + an explicit approval transition — the model produces as many proposals as it wants, only the user-triggered /approve route invokes the executor. The structural separation is there, but you nailed the gap I currently have: no "executor must never be called from the read path" test. Stealing the throw-on-invoke mock pattern wholesale for the v1 hardening pass.

The GROUP BY point is the one I think more agents will quietly regret skipping. We dedup proposals on a deterministic content hash + sliding-window suppression — boring SQL, but the agent never produces 14 duplicate "remind X to reply" pings, and we can reason about why something was suppressed without re-running the model. LLMs for fuzzy classification, deterministic logic for invariants — once you draw that line, debugging shrinks dramatically.

Curious about the crypto case: are you running the signing transition through a separate hardware/service boundary, or is the executor in-process with extra schema gating? Where do you put the trust root?

Collapse
 
txdesk profile image
TxDesk

Honest answer: in-process, with extra schema gating and a separate signing-transition test rather than a hardware boundary. The reason is that TxDesk's executor doesn't sign. The user's wallet does. The agent proposes a transaction, the frontend hands the proposal to the user's wallet extension (MetaMask, Phantom, etc), the wallet shows its own confirmation UI, the user signs there. The trust root sits in the wallet, not in our process.

Which means our schema gating is really gating the proposal, not the signature. The invariant we test is "the proposal payload that reaches the frontend was produced by a path the user authorized" rather than "the signing key is air-gapped." Different shape from email-with-an-executor, but the same underlying discipline: structurally separate the layer that reasons from the layer that holds the irreversible capability, then test the separation.

The version of your test pattern I run is roughly: re-mock the agent engine entry points to throw on call, hit the read endpoint, assert 200 + .not.toHaveBeenCalled(). A refactor that wires reasoning back into the read path breaks the build with an explicit "invariant violated" message rather than silently shipping. The throw-on-invoke pattern works for any boundary you want to be load-bearing. The engine-vs-read split for me, the executor-vs-read split for you.
If you ever want the executor in-process with a real signing capability (e.g. a service-account workflow rather than user-confirmed), the trust-root question gets harder fast. That's when the hardware boundary or a separate signing service starts to earn its keep. For now, "the human's wallet is the signing service" is doing a lot of work for me.

Thread Thread
 
k08200 profile image
yongrean

Trust root in the wallet — that lands. The throw-on-invoke pattern just landed (PR #454); the read path I most want to lock down is the firewall classifier itself, where any future "enrichment" helper that quietly re-invokes the scorer would silently inflate sender-trust signals on already-classified emails. The wallet-vs-service-account fork is the cleanest articulation of when the in-process gate stops being enough — bookmarking that for the AUTO-actually-executes milestone.

Thread Thread
 
txdesk profile image
TxDesk

The silent re-invocation problem is the right one to be worried about. The shape that scares me most is when enrichment gets added 6 months later by someone who doesn't know the original gate exists. PR review catches "is this calling the scorer" but not "is this implicitly triggering a re-classification of something already trusted."

What helps is making the classifier read-only against a versioned snapshot of the input. Once a signature is classified at v1 of the input bytes, the result is keyed against the input hash. Any enrichment that mutates the input invalidates the classification and forces a re-decision. Mutating without invalidating is a type error. Not a perfect defense but the regression gets caught at compile time rather than at runtime.

The wallet vs service account fork is exactly where I draw the line too.

Thread Thread
 
k08200 profile image
yongrean

You're right that the throw-on-invoke test in #454 only proves the read path doesn't call the scorer today. It doesn't prove that some future enrichment can't re-invoke the classifier through a different code path.

The strongest version of the property you're describing — "classifier is read-only against a versioned snapshot of input bytes" — just landed as PR #468 (github.com/k08200/klorn/pull/468).

Concretely: at classify time we sha256 the exact fields the scorer reads (from, subject, snippet, labels — sorted, NFC-normalized, prefixed with a HASH_SCHEMA_VERSION), store the digest on the AttentionItem row, and ship verify/check helpers. If anything mutates those bytes between decision and read, the stored hash and the recomputed hash diverge and verify throws AttentionHashMismatchError. Bumping HASH_SCHEMA_VERSION invalidates every existing row on purpose.

What it doesn't try to do: cover MEDIUM/HIGH-risk action payloads. Those have a separate integrity story (PendingAction.toolArgs JSONB + an explicit /approve transition), which is a different problem and deserves its own pass.

PendingAction.toolArgs is the next surface I'm thinking about content-addressing (hash the resolved tool args after template expansion). The case I'm not sure about is partial revocation — when a multi-step plan has one invalidated step, do you abort the whole plan or rebuild from the surviving prefix? Curious how your decoders handle this when one of the three goes stale mid-decision.

Thread Thread
 
txdesk profile image
TxDesk

PR #468 is the right shape, content-addressing the bytes you actually scored is the only way to make the read-only property auditable.

On partial revocation: we abort the whole plan and rebuild. The reasoning is that the steps aren't independent, step N's decision was made against the world that included step N-1 succeeding, so once one step goes stale the downstream steps are decisions about a world that no longer exists. Rebuilding from prefix sounds cheaper but you end up reasoning about a state the planner never actually saw, which is the exact failure mode the hash is meant to prevent in the first place.

The cost we pay for this is wasted work on the surviving steps. The cost we avoid is silently acting on a stale plan. The tradeoff only makes sense because in our domain (wallet transactions) the surviving prefix is rarely worth more than the cost of re-deriving it. In yours (email actions) that calculus might flip if the surviving steps are expensive enough to justify re-validating them individually against the new state. Worth measuring before committing to one strategy.

Thread Thread
 
k08200 profile image
yongrean

Thanks — that framing matches what we were after with #468. The
content-addressed read-only property is half of it; the other half is
the same shape applied to the output artifact (the bytes about to
leave for Gmail), which just landed as PR #481
(github.com/k08200/klorn/pull/481). Same pattern: pin the
bytes the user actually approved, verify a fresh hash at execute,
throw on mismatch. Schema versions on both halves so a future shape
change forces a deliberate batch invalidation.

On partial revocation — that calculus is moot for us today since we
don't ship multi-step action plans yet (single tool calls only, each
gated by its own /approve transition). Logged the "measure both
strategies before committing" note for when we do. The instinct to
default to full-rebuild + measure-before-deviating is the same
direction I'd lean if I had to guess today, but you're right that the
surviving-prefix value calculus is domain-specific. Email actions
that touch user state (sent mail, scheduled events) probably can't
survive a stale plan; pure-read steps probably can. Won't know until
we have plans to measure against.

Collapse
 
harjjotsinghh profile image
Harjot Singh

"They were surfacing more" is the indictment of an entire product category in three words. The tell is that almost every AI feature ships as a new surface (a suggestion, a badge, a draft-waiting-for-review) because surfacing is easy to build and easy to demo, while actually deciding is hard and scary. But surfacing just relocates the work back to the human, now you read the AI's suggestions on top of the original emails, so the inbox got louder, not quieter. A decision layer is the harder, correct bet: the AI's job isn't to show you more, it's to remove items from your attention by handling them, archive this, auto-reply to that, escalate only the three that genuinely need you. That requires the system to commit to an action, which means it needs the confidence to act and the humility to abstain to you when unsure, exactly the verify-or-escalate line. Deleting half your own product to get there is the most credible part of this. That act-don't-surface principle is core to how I think about agents in Moonshift. What was the scariest decision to let it make autonomously, and where did you keep the human in the loop?

Collapse
 
k08200 profile image
yongrean

The most credible point first: yes, the deletion was the part I had to fight myself on, not the part I built.

On scariest-autonomous: the one we actually ship is SILENT, not AUTO. AUTO is the loud one in demos, but SILENT is the more consequential decision in aggregate — false negatives there compound silently, because the user has no organic feedback loop to discover them. PUSH errors get flagged within minutes (the wrong interrupt is felt). SILENT errors might never get flagged at all. AUTO tier is currently classification-only; the action side is still gated behind a PendingAction → /approve transition, because I don't trust the calibration enough yet to let the model write to the world without a human-shaped checkpoint.

Where the human stays in the loop: every PUSH and QUEUE review, plus every MEDIUM/HIGH-risk action regardless of confidence. The line I'm holding is "the model can decide what to not show you, but not yet what to send on your behalf."

Two PRs that pinned this down in code, in case it's useful: #454 (read-path throw-on-invoke — proves the firewall page never re-invokes the scorer) and #468 (content-hash binding — any post-decision mutation of the input bytes invalidates the cached tier). The point of both is the same: make silent re-invocation a type error, not a code-review convention.

Curious what shape verify-or-escalate took for Moonshift — what's the action class you keep refusing to let the agent ship?