I spent six months building an AI-powered email tool. Then I deleted half of it.
Not because the model was bad. Not because the embeddings were off. Because I finally noticed what every "AI inbox" on the market — including the one I was building — was actually doing.
They were surfacing more.
More "smart suggestions". More "priority signals". More "AI-drafted replies waiting for your review". More badges, more banners, more nudges. Every product in the category was racing to add a new surface and call it intelligence.
My six-month-old prototype did all of that. I used it every day. And every morning the inbox was just as loud as the day I started. The model was right about which emails mattered. I still read all the other ones anyway, because they were right there, with a little colored dot suggesting maybe-they-mattered-too.
The model was solving the wrong problem.
The category bug
Look at the leading email tools through this lens:
- Superhuman made reading faster. You still read everything.
- Shortwave classified smarter. You still read everything.
- Motion / Reclaim got more proactive. They added a calendar layer on top of the noise.
None of them subtract. They all add. "AI assistant" became a license to put one more thing in front of you.
The deeper bug: these tools treat email as the primary surface and try to make it better. But email is not what you want. What you want is decisions you have to make. Email is one cheap, unreliable transport that occasionally contains those decisions, buried under hundreds that don't.
Making the transport prettier doesn't fix the signal-to-noise problem. It hides it.
The right abstraction: decision layer
A decision layer doesn't replace your inbox. It sits above mail, calendar, Slack, and any other transport, and it surfaces exactly one thing: items where the system genuinely needs your judgment.
Three properties make a layer a decision layer rather than just "a better inbox":
- It subtracts more than it adds. A signal that you've ignored four times in a row should never reach you again. Not muted. Gone.
- It treats relationships as data. Two people asking for the same thing are not the same ask. One of them has hit every deadline you've ever had with them; the other ships +3 days late, every time. That should weight the queue.
- It refuses to act without your approval. The model can draft, propose, plan. It cannot send, modify, or commit. Approval-before-action has to be a schema-level constraint, not a UI nicety.
None of these are AI features. They are boundary features. The AI is helpful for the classification underneath, but the value lives in what the system refuses to surface.
Here is what each of them actually looks like in production.
Pattern 1 — Closed-loop suppression learning
The single most useful thing the system does is forget.
Every time the user dismisses an attention item, we record a FeedbackEvent with the signal DISMISSED or IGNORED. That table is the cheap part. The interesting part is a job that reads it weekly:
export async function runFeedbackAdaptation(userId: string): Promise<number> {
const since = new Date(Date.now() - LOOK_BACK_DAYS * 24 * 60 * 60 * 1000);
const events = await prisma.feedbackEvent.findMany({
where: {
userId,
source: "ATTENTION_ITEM",
signal: { in: ["DISMISSED", "IGNORED"] },
createdAt: { gte: since },
},
select: { sourceId: true },
});
// Join to the attention items themselves so we can bucket by (source, type,
// priority) instead of just (source, type) — the bucket prevents an
// over-broad rule from silencing legitimate high-priority signals.
const items = await prisma.attentionItem.findMany({
where: { id: { in: events.map(e => e.sourceId) } },
select: { id: true, source: true, type: true, priority: true },
});
const counts = new Map<string, { key: CountKey; count: number }>();
for (const event of events) {
const item = itemMap.get(event.sourceId);
if (!item) continue;
const bucket = priorityBucket(item.priority);
const k = suppressionKey(item.source, item.type, bucket);
const existing = counts.get(k);
if (existing) existing.count += 1;
else counts.set(k, { key: { source: item.source, type: item.type, bucket }, count: 1 });
}
// Threshold: same tuple dismissed ≥4 times in 30 days → suppress forever.
const suppressed = [...counts.values()]
.filter(({ count }) => count >= DISMISS_THRESHOLD)
.map(({ key, count }) => ({ ...key, dismissCount: count }));
await remember(userId, "CONTEXT", "attention_suppression_v2", JSON.stringify(suppressed));
return suppressed.length;
}
The suppression set is then read at the upsert path for every new attention item:
export function isSuppressed(
set: Set<string>,
source: string,
type: string,
priority?: number,
): boolean {
if (typeof priority === "number") {
const bucket = priorityBucket(priority);
if (set.has(suppressionKey(source, type, bucket))) return true;
}
return set.has(suppressionKey(source, type));
}
If the tuple is in the suppression set, the new attention item is forced into SILENT tier — it gets recorded for the audit log, but the user is never paged about it.
A few design choices worth pointing out:
-
Priority buckets matter. The first version keyed only on
(source, type). Dismissing four "due-today commitment" notifications would silence every commitment-due signal, including overdue ones. The current version buckets priority into HIGH / MEDIUM / LOW, so the user can train "I don't care about LOW-priority due commitments" without losing the HIGH ones. - Backwards-compatible key. Memory rows from the previous version are still read; a v1 row without a bucket matches every bucket, so a rollback doesn't lose learned behavior.
- 10-minute in-process cache. The upsert path is hot — checking the suppression set on every new item against the DB would be wasteful. A 10-minute TTL is short enough that a weekly adaptation run propagates fast and long enough to be free at request time.
Notice what's missing: an LLM. The classifier underneath uses one, but the suppression loop itself is plain counting. The model is not the right tool for "remember what the user doesn't care about". A GROUP BY is.
Pattern 2 — Contact Trust Score
The second feature changed how I think about every productivity tool I've ever used.
When someone makes a commitment to you — "I'll send the deck by Thursday", "let's reconnect next week" — that's a tracked row in a commitment ledger. When the commitment is fulfilled, we record whether it was on-time or late, and update a running tally per contact:
export async function updateTrustScore(
userId: string,
contactEmail: string,
displayName: string | null,
wasOnTime: boolean,
daysLate = 0,
): Promise<void> {
await prisma.contactTrustScore.upsert({
where: { userId_contactEmail: { userId, contactEmail: email } },
create: {
userId,
contactEmail: email,
displayName,
totalCount: 1,
onTimeCount: wasOnTime ? 1 : 0,
lateCount: wasOnTime ? 0 : 1,
totalDelayDays: Math.max(0, daysLate),
lastUpdatedAt: new Date(),
},
update: {
totalCount: { increment: 1 },
...(wasOnTime ? { onTimeCount: { increment: 1 } } : { lateCount: { increment: 1 } }),
...(daysLate > 0 ? { totalDelayDays: { increment: daysLate } } : {}),
lastUpdatedAt: new Date(),
},
});
}
That tally rolls up to a badge:
- reliable — ≥80% on-time, ≥3 data points
- mostly reliable — ≥50% on-time, ≥3 data points
- unreliable — <50% on-time, ≥3 data points
- unknown — fewer than 3 data points, or stale (no signal in 60+ days)
The stale check is doing real work. A year-old "reliable" badge on someone who has since gone dark shouldn't be load-bearing. Until we get full exponential decay, we demote anyone untouched in two half-lives back to unknown.
The badge gets surfaced as a small chip on the inbox card. But the actually-useful place is inside the agent prompt itself:
export async function buildTrustHintForPrompt(userId: string): Promise<string> {
const rows = await prisma.contactTrustScore.findMany({
where: { userId, totalCount: { gte: MIN_DATA_POINTS } },
orderBy: { lastUpdatedAt: "desc" },
take: 10,
});
if (rows.length === 0) return "";
const lines = rows.map((row) => {
const r = computeResult(row);
const name = r.displayName || r.contactEmail;
if (r.badge === "reliable")
return `- ${name}: reliable (${Math.round(r.onTimeRate * 100)}% on-time)`;
if (r.badge === "mostly_reliable") {
const delay = r.avgDelayDays > 0 ? `, avg +${Math.round(r.avgDelayDays)}d late` : "";
return `- ${name}: mostly reliable (${Math.round(r.onTimeRate * 100)}% on-time${delay})`;
}
return `- ${name}: unreliable (${Math.round(r.onTimeRate * 100)}% on-time, avg +${Math.round(r.avgDelayDays)}d late) — factor in extra buffer`;
});
return `\n## Contact Reliability\nBased on tracked commitments:\n${lines.join("\n")}`;
}
Now when the model decides how urgently to surface "Mina is asking for an update" vs "Sarah is asking for an update", it has actual data on which of them is going to deliver if you give them a polite nudge versus which one needs the deadline restated three times. The prompt isn't fed any feelings about either person. It is fed numbers.
The productivity-tool industry has spent ten years building calendars that don't know which meeting attendees actually show up on time. That's strange.
Pattern 3 — Approval-before-action as a schema constraint
The third pattern is the boring one, and it's the one most AI assistants get wrong.
The model is allowed to draft a reply. It is allowed to propose a calendar move. It is allowed to plan a sequence of actions. It is not allowed to send, move, or commit any of it. Not because we don't trust the model — we sometimes do — but because the user needs to know the surface area of what the system is doing on their behalf, and "silently sent" is a category of bug that never recovers user trust once it happens.
This is enforced at the schema level. Every action the agent proposes lives in a PendingAction row with a status enum. The state machine for that enum is the contract: only one transition (approve()) gets the side effect to actually run. The agent can propose() all day; nothing ships without a deliberate user transition.
The lowest-risk class of actions — internal-only things like blocking calendar time for focus, snoozing an item, setting a reminder — can be marked auto and skip approval. Everything that touches an outside party (sending mail, modifying someone else's calendar) is always gated. The boundary is conservative on purpose. The day a single user discovers their AI assistant silently sent an apology to their VC is the day every AI assistant in the category becomes harder to sell.
What this looks like in practice
The sum of these three patterns is not a smarter inbox. It is a small, quiet queue that contains roughly six to twelve items on any given day. Each item is either an explicit ask, a tracked commitment coming due, or a proposed action waiting for confirmation. The model spent the morning reading and reasoning about a few hundred other things, all of which the system decided you don't need to know about.
When you dismiss an item, the system learns. When a contact reliably delivers, their asks rise. When the model wants to act outside a narrow safelist, it asks first. The result, after a few weeks of training the noise floor, is a queue that feels like it was assembled by someone who actually knows what you ignore.
None of this requires a frontier model. The classifier underneath is a small, cheap LLM with strict cost guards. Almost all of the value is in the boundaries — what the system refuses to surface, what it refuses to do without you, and what it remembers about people you work with.
If you're building anything in this category and you find yourself adding a new surface that shows the user more things, stop and ask whether you'd rather build the thing that subtracts. The market is crowded with smarter inboxes. There is no good decision layer yet.
I'm shipping one at klorn.ai. Not asking for signups — sharing the pattern because I think more people should be building toward it. The closed-loop suppression and trust-score code above are excerpts from the real thing.
Built in TypeScript on Fastify, Prisma, and Postgres. Code patterns shown are production excerpts.
Top comments (0)