<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yongrean</title>
    <description>The latest articles on DEV Community by yongrean (@k08200).</description>
    <link>https://dev.to/k08200</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936365%2Fdfb1cda1-b92e-496b-a2be-019de80d764f.png</url>
      <title>DEV Community: yongrean</title>
      <link>https://dev.to/k08200</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/k08200"/>
    <language>en</language>
    <item>
      <title>Confidence is enough to decide. It's not enough to do.</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Thu, 25 Jun 2026 10:18:46 +0000</pubDate>
      <link>https://dev.to/k08200/confidence-is-enough-to-decide-its-not-enough-to-do-8ck</link>
      <guid>https://dev.to/k08200/confidence-is-enough-to-decide-its-not-enough-to-do-8ck</guid>
      <description>&lt;p&gt;A classifier confidence of 0.99 is enough to decide a tier. It is not enough to send an email you can't unsend.&lt;/p&gt;

&lt;p&gt;Those are two different bars, and most "autonomous" systems use the first one to clear the second. That's the bug.&lt;/p&gt;

&lt;p&gt;This is the third post in a series that started as a cheap-model brag and turned into an architecture argument. &lt;a href="https://dev.to/k08200/i-let-gpt-4o-and-a-cheaper-model-fight-over-my-inbox-gpt-4o-lost-fkj"&gt;Post one&lt;/a&gt;: a cheap model beat GPT-4o on email triage. &lt;a href="https://dev.to/k08200/i-dont-trust-the-llm-to-classify-my-email-so-i-dont-let-it-55d9"&gt;Post two&lt;/a&gt;: the model only scores four features, and a deterministic rule picks the tier. A commenter, &lt;a href="https://dev.to/hannune"&gt;@hannune&lt;/a&gt;, pointed at one of those four features:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Your reversibility signal is something I have not seen named explicitly before but it is exactly the right axis for anything that touches irreversible state.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He's right, and it's the cleanest way into the last piece of the design. So: what &lt;code&gt;reversibility&lt;/code&gt; actually routes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The line: can the user undo it with one click?
&lt;/h2&gt;

&lt;p&gt;Most of what a mail agent does is reversible. Archive, un-archive. Trash, restore. Apply a label, remove it. Mark read, mark unread. Re-tier. Snooze. Every one of those is a single click away from undone, so every one of those rides on exactly what post two described — classifier confidence plus a hash of the input bytes that drove the decision. If the model's confident and the inputs are pinned, ship it.&lt;/p&gt;

&lt;p&gt;Three actions are not like the others:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;FLOOR_ACTIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;send_email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;delete_permanent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;forward_external&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Send (Gmail's undo-send window is 30 seconds, then it's gone). Permanent delete (skips Trash, no recovery path). Forward to an external party (same network effect as send — it's out). For these, &lt;code&gt;reversibility&lt;/code&gt; scores near zero, and near-zero reversibility is the signal that says: confidence is necessary but no longer sufficient. You need something the probabilistic layer can't give you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why confidence isn't enough: sign the artifact, not the narration
&lt;/h2&gt;

&lt;p&gt;The failure mode here has a name I borrowed from people doing this in crypto: &lt;strong&gt;agent-vs-ABI mismatch&lt;/strong&gt;. The agent narrates a high-level intent — "I sent a polite follow-up to Alice" — and the thing that actually executed did something the narration glossed over: wrong recipient, an edited body, a different attachment. The agent isn't lying. Natural language is lossy by definition; the description and the bytes are allowed to drift.&lt;/p&gt;

&lt;p&gt;The cure isn't to verify the narration harder. It's to stop signing on the narration and sign on the deterministic artifact — the actual bytes that will travel to Gmail.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;sendEmailPayloadHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RECEIPT_SCHEMA_VERSION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;send_email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;NFC&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;NFC&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;NFC&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sha256&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hex&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you approve a send, the system mints an &lt;code&gt;ActionReceipt&lt;/code&gt; that pins this hash — the bytes you actually approved, normalized so a cosmetic edit (&lt;code&gt;Alice@Example.com&lt;/code&gt; vs &lt;code&gt;alice@example.com&lt;/code&gt;) doesn't false-alarm, and NFC-normalized so composed/decomposed Unicode hashes identically (this matters the moment a body has Korean in it). At execute time it recomputes the hash from the about-to-send bytes and checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;verifyReceipt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ActionReceipt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;FloorAction&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;currentPayloadHash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}):&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;RECEIPT_SCHEMA_VERSION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ActionReceiptSchemaError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ActionReceiptMismatchError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentPayloadHash&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payloadHash&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentPayloadHash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ActionReceiptMismatchError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;receipt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentPayloadHash&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any drift between approve and execute throws and the action is refused. Reusing a &lt;code&gt;send_email&lt;/code&gt; receipt to authorize a &lt;code&gt;delete_permanent&lt;/code&gt; throws on the action check. Bumping the schema version deliberately invalidates every pending receipt and forces a re-approve under the new shape. The autonomous path fails closed: no valid receipt, no irreversible action.&lt;/p&gt;

&lt;h2&gt;
  
  
  So reversibility is the router
&lt;/h2&gt;

&lt;p&gt;That's the whole point of naming &lt;code&gt;reversibility&lt;/code&gt; as a first-class feature instead of folding it into "risk." It's not decoration on the tier decision — it's the axis that decides &lt;em&gt;which trust model an action even gets&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High reversibility → the probabilistic layer is enough. Confidence + input hash, done.&lt;/li&gt;
&lt;li&gt;Near-zero reversibility → drop to the deterministic floor. Confidence got you to "this is worth doing"; the signed artifact is what gets you to "and the bytes are exactly the ones approved."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two layers, and the feature score picks which one applies. The probabilistic layer is allowed to stay probabilistic precisely because the floor catches the cases where probability isn't enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;Fair question: is this actually wired, or just a module with a TODO? It's wired. The receipt is minted at &lt;code&gt;/approve&lt;/code&gt; from the exact bytes you clicked on, &lt;code&gt;executeToolCall&lt;/code&gt; refuses any floor action that arrives without a verified receipt (&lt;code&gt;FloorReceiptRequiredError&lt;/code&gt;), and &lt;code&gt;send_email&lt;/code&gt; re-checks the payload hash in its own path before anything leaves.&lt;/p&gt;

&lt;p&gt;The honest edges, because there always are some: of the three floor actions, only &lt;code&gt;send_email&lt;/code&gt; is a callable tool today — &lt;code&gt;delete_permanent&lt;/code&gt; and &lt;code&gt;forward_external&lt;/code&gt; aren't wired as tool cases yet, but the central guard already fails them closed, so a future case physically can't ship a receipt-less side effect. And the autonomous agent runs in SUGGEST mode by default — read-only tools plus propose-only, no mutating power until you opt into AUTO, and even then the floor stands in front of the irreversible three. The brake went in before the autonomous engine gets switched on, which is the only order that isn't reckless. I'd rather show you the guard and its TODOs than claim more than the code does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The portable version
&lt;/h2&gt;

&lt;p&gt;Separate "confident enough to decide" from "verified enough to do." For anything your system can't undo with one user click, don't trust the model's description of what it's about to do — hash the deterministic artifact at approval, verify it at execution, and fail closed on any drift. Confidence is a fine reason to &lt;em&gt;decide&lt;/em&gt;. It is never, by itself, a reason to &lt;em&gt;do&lt;/em&gt; something you can't take back.&lt;/p&gt;

&lt;p&gt;The whole floor is ~210 readable lines in the open, AGPLv3: &lt;strong&gt;&lt;a href="https://github.com/k08200/klorn" rel="noopener noreferrer"&gt;github.com/k08200/klorn&lt;/a&gt;&lt;/strong&gt; — &lt;code&gt;packages/api/src/attention-floor.ts&lt;/code&gt;. Three posts, one idea: keep the model in the perception layer, and put everything you actually stand behind in code you can read.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>security</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I don't trust the LLM to classify my email. So I don't let it.</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Thu, 25 Jun 2026 05:57:48 +0000</pubDate>
      <link>https://dev.to/k08200/i-dont-trust-the-llm-to-classify-my-email-so-i-dont-let-it-55d9</link>
      <guid>https://dev.to/k08200/i-dont-trust-the-llm-to-classify-my-email-so-i-dont-let-it-55d9</guid>
      <description>&lt;p&gt;My classifier calls an LLM on every single email. The LLM is not allowed to classify the email.&lt;/p&gt;

&lt;p&gt;That sounds like a contradiction. It's the most important design decision in the thing.&lt;/p&gt;

&lt;p&gt;A reader named &lt;a href="https://dev.to/nazar_boyko"&gt;@nazar_boyko&lt;/a&gt; left a comment on my &lt;a href="https://dev.to/k08200/i-let-gpt-4o-and-a-cheaper-model-fight-over-my-inbox-gpt-4o-lost-fkj"&gt;last post&lt;/a&gt; — the one where a cheap model beat GPT-4o on email triage — and put it better than I did:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Once the LLM is a feature scorer and not the decider, "consistency over genius" falls right out of it, and a cheap fast model is exactly what you want for reading the same four signals the same way every time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The price upset was the fun headline. This is the actual thesis. So here it is on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model scores four numbers. That's all it does.
&lt;/h2&gt;

&lt;p&gt;Every inbound email goes to the LLM with one job: read the message and return four scores between 0 and 1.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;confidence&lt;/strong&gt; — how sure you are the other three scores are right&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;senderTrust&lt;/strong&gt; — 1.0 a known, important human; 0.3 an automated transactional notice you signed up for; 0.0 anonymous bulk marketing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reversibility&lt;/strong&gt; — if this got auto-handled and that was wrong, how easy is the recovery? 1.0 trivial undo; 0.0 irreversible ("lost an investor")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;urgency&lt;/strong&gt; — needs attention within hours (1.0) down to informational, no clock (0.0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The response schema is literally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"senderTrust"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reversibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"urgency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"short phrase"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No tier. The model never sees the words PUSH, QUEUE, SILENT, AUTO in its output contract. It reads an email and describes it along four axes. It does not get a vote on what happens next.&lt;/p&gt;

&lt;h2&gt;
  
  
  A rule I can read decides the tier
&lt;/h2&gt;

&lt;p&gt;What happens next lives in one file, &lt;code&gt;tier-policy.ts&lt;/code&gt;, in a function with no model in it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1. Very low confidence → QUEUE. Hiding uncertain mail is the worst failure.&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;QUEUE&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Urgent AND sure → wake the user.&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;urgency&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PUSH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Anonymous, no clock, trivially reversible → SILENT (narrow: marketing only).&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;senderTrust&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;urgency&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reversibility&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SILENT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 4. Reversible, very sure, not urgent, trusted → AUTO.&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reversibility&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;urgency&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;senderTrust&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;AUTO&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 5. Default → QUEUE. "I'll look at it on my own schedule" is the dominant bucket.&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;QUEUE&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole decider. Every threshold is a named constant in one object above it, not a magic number sprinkled through a prompt. Order matters — earlier branches win. I can read this in thirty seconds, write a unit test for each branch, and change the policy without touching the model or re-running an eval.&lt;/p&gt;

&lt;p&gt;Try doing any of that to "I asked GPT-4o to pick a tier and it picked QUEUE." You can't test it. You can't diff it. You can't explain to yourself why message #4,012 got hidden. The decision isn't anywhere — it's smeared across a weight matrix and a paragraph of prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "consistency over genius" falls right out of it
&lt;/h2&gt;

&lt;p&gt;Once the model's only job is to score four signals, the question stops being "which model reasons best about email policy?" and becomes "which model reads the same four signals the same way every time?"&lt;/p&gt;

&lt;p&gt;Those are different questions with different answers. The first one points you at the biggest, most expensive model. The second one points you at a cheap, fast, low-variance one — because a frontier model's extra reasoning, applied to a 30-word email, mostly buys you &lt;em&gt;more ways to have an opinion&lt;/em&gt;, which is variance, which is the enemy when you've already moved the judgment into a rule. That's why the cheap model won the last post. It wasn't a cost compromise. Splitting scorer from decider is what made the cheap model the &lt;em&gt;correct&lt;/em&gt; choice, not just the affordable one.&lt;/p&gt;

&lt;p&gt;And because the contract is "four features → tier" and nothing else, the model isn't load-bearing for correctness — it's load-bearing for &lt;em&gt;perception&lt;/em&gt;. Proof: when the LLM is down or rate-limited, a keyword fallback produces the same four features with zero model calls, and the exact same rule runs on top. The plumbing doesn't change. The only thing a better model buys you is sharper feature scores on the genuinely ambiguous mail — which is why the one place I'll spend a frontier model is a dial that escalates &lt;em&gt;only&lt;/em&gt; the low-confidence tail, and nowhere else.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this buys you that "let the model decide" can't
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Auditability.&lt;/strong&gt; The policy is a file. Code review covers it. A regression test pins every branch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stable learning.&lt;/strong&gt; When I correct a misclassification, the correction doesn't go fight the model for control of the answer. It becomes an example that nudges the &lt;em&gt;feature scores&lt;/em&gt; toward the right values, and the rule — the spine — stays fixed. The thing that learns and the thing that decides are separated on purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A blast radius you chose.&lt;/strong&gt; AUTO's thresholds sit deliberately high (reversibility ≥ 0.85, confidence ≥ 0.85, trusted sender) so the system structurally cannot auto-handle a destructive or low-trust action. That floor is a number I can point at, not a behavior I'm hoping the model keeps exhibiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;This doesn't make the model's judgment good — it makes the &lt;em&gt;decision layer&lt;/em&gt; honest. Garbage feature scores still produce garbage tiers; the rule only guarantees that identical scores always map to the identical tier, and that I can see why. The thresholds were hand-tuned against 50 emails, and calibrating them from accumulated real corrections is still ahead of me, not behind. The keyword fallback, by design, can't emit PUSH — so a total LLM outage degrades urgent mail to "visible in the queue," never "silently hidden," but it does degrade. I'd rather write that down than pretend the split is free.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway is portable
&lt;/h2&gt;

&lt;p&gt;This isn't really about email. Any time you're handing an LLM a decision with consequences, you can ask the same question: does the model need to &lt;em&gt;decide&lt;/em&gt;, or does it need to &lt;em&gt;read&lt;/em&gt;? Separate "what the model perceives" from "what the system does about it." Put the second half in code you can read, test, and stand behind. You get auditability, you get to use a cheaper model, and you stop being surprised by your own product.&lt;/p&gt;

&lt;p&gt;The judge, the rule, and the thresholds are all in the open — AGPLv3: &lt;strong&gt;&lt;a href="https://github.com/k08200/klorn" rel="noopener noreferrer"&gt;github.com/k08200/klorn&lt;/a&gt;&lt;/strong&gt;. The decider is &lt;code&gt;packages/api/src/tier-policy.ts&lt;/code&gt;, about sixty readable lines. Go see how few of them there are.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I let GPT-4o and a cheaper model fight over my inbox. GPT-4o lost.</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Wed, 24 Jun 2026 16:59:28 +0000</pubDate>
      <link>https://dev.to/k08200/i-let-gpt-4o-and-a-cheaper-model-fight-over-my-inbox-gpt-4o-lost-fkj</link>
      <guid>https://dev.to/k08200/i-let-gpt-4o-and-a-cheaper-model-fight-over-my-inbox-gpt-4o-lost-fkj</guid>
      <description>&lt;p&gt;Here's the scoreboard. Same 50 emails, same prompt, same 4-tier task:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;google/gemini-2.5-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100% recall on urgent mail — never missed one&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;google/gemini-2.5-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;the "smarter" sibling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openai/gpt-4o&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;the reflex pick&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;anything cheaper than flash&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 80%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;failed my floor, didn't ship&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cheap model didn't tie the expensive ones. It beat them by six points and never missed an email that should have woken me up. The two models I'd have reached for on instinct — the obviously-smarter ones, the ones that cost several times more per token — both came second.&lt;/p&gt;

&lt;p&gt;I almost didn't run this comparison at all. That's the part worth your time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The task
&lt;/h2&gt;

&lt;p&gt;I'm building an email firewall. Every inbound message gets exactly one of four tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SILENT&lt;/strong&gt; — recorded, never shown (marketing, receipts, FYI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QUEUE&lt;/strong&gt; — visible when I choose to look, no notification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PUSH&lt;/strong&gt; — actually interrupt me&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AUTO&lt;/strong&gt; — reversible, hands-off (classified only, for now)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the entire output surface. No suggestion cards, no "AI thinks you should reply" badges. One label per email.&lt;/p&gt;

&lt;p&gt;The thing doing the labeling is what I call the judge. It's the part I'd assumed needed a good model — reading an email and deciding whether it's allowed to ring your phone feels like a judgment call, and judgment calls are what you buy a frontier model for.&lt;/p&gt;

&lt;p&gt;So I had &lt;code&gt;gpt-4o&lt;/code&gt; wired in and I was ready to leave it there. Then I did the boring thing and measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "measured" means here
&lt;/h2&gt;

&lt;p&gt;The eval set is committed to the repo: &lt;code&gt;packages/api/eval/judge-eval-set.json&lt;/code&gt;. Fifty emails, synthetic and PII-free, hand-labeled to encode one specific person's policy — QUEUE is the default, SILENT is narrow (clear marketing only), PUSH is urgent &lt;em&gt;and&lt;/em&gt; confident, AUTO is reversible &lt;em&gt;and&lt;/em&gt; not urgent. The tier mix is 21 QUEUE / 13 PUSH / 12 SILENT / 4 AUTO.&lt;/p&gt;

&lt;p&gt;One command runs a model against it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm &lt;span class="nb"&gt;eval&lt;/span&gt;:judge   &lt;span class="c"&gt;# tsx scripts/poc-accuracy.ts --in=eval/judge-eval-set.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran it across the models I was actually choosing between. Flash won. Not "won on cost, tied on quality" — won on quality, and it happens to be the cheap one. I pinned production to it and wrote the result into the commit message so future-me can't quietly pretend the expensive model was a sacrifice I made for the budget:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;flash is not a compromise — it scores 88% / 100% PUSH recall, beating gemini-2.5-pro and gpt-4o (both 82%).&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why cheap wins &lt;em&gt;here&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;A frontier model sells you reasoning depth. Long chains, hard problems, hold-ten-things-in-your-head problems. Email triage is none of that. It's a short, repetitive, read-four-signals-and-be-consistent problem. You're not paying for capability that moves this needle — you're paying for capability you never touch, and a bigger model's extra "thinking" mostly buys you more chances to overthink a 30-word email.&lt;/p&gt;

&lt;p&gt;There's an architecture reason too, and it's the load-bearing one. &lt;strong&gt;The LLM never picks the tier.&lt;/strong&gt; It scores four features per email — confidence, sender trust, reversibility, urgency — and a ~20-line deterministic rule maps those four numbers to PUSH/QUEUE/SILENT/AUTO. The model is a feature-scorer, not a decider. So I don't need a model that reasons brilliantly about email policy. I need one that reads four signals the same way every time. Consistency, not genius. That's exactly the job a cheap fast model is good at — and exactly the job where a bigger model's cleverness becomes variance you don't want.&lt;/p&gt;

&lt;p&gt;It also means the policy is auditable without the model in the loop. I can read the rule. I can test it. If the model's down, a keyword fallback produces the same four features so urgent mail still gets through. None of that works if you let the LLM free-hand the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before you @ me
&lt;/h2&gt;

&lt;p&gt;This is 50 emails. It's small on purpose — it's one person's mental model written down, not a benchmark, and I'm not going to dress it up as one. The set is synthetic, so it tests whether the model applies &lt;em&gt;my policy&lt;/em&gt; consistently, not whether it can read the real world. A different inbox with a different owner would draw the lines somewhere else and might rank the models differently.&lt;/p&gt;

&lt;p&gt;I'm also only claiming what I measured. Flash hit 100% recall on PUSH — it never sent an urgent email to a quiet tier. I'm not going to invent per-tier numbers for the models that lost; I have their headline accuracy and that's what I'm putting my name on.&lt;/p&gt;

&lt;p&gt;What I'm &lt;em&gt;not&lt;/em&gt; walking back: on the one task I actually care about, on the set that's sitting in the public repo for you to open, the expensive models lost. That result was stable enough to bet production on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson is annoyingly cheap
&lt;/h2&gt;

&lt;p&gt;Most of us never run this comparison. "Use the best model" is the default, the best model is the expensive one, and the leaderboard agrees, so why would you waste an afternoon proving the obvious?&lt;/p&gt;

&lt;p&gt;Because the leaderboard has never seen your task. MMLU doesn't know what &lt;em&gt;you&lt;/em&gt; mean by "urgent." The only eval that ranks models on your problem is the one you write — and when you write it, the ranking stops matching the price tag surprisingly often. The frontier model isn't smarter at your job. It's just smarter at the jobs in the press release.&lt;/p&gt;

&lt;p&gt;Write the small eval. Run the cheap model against it before you reach for the expensive one. Worst case you confirm the obvious. Best case you cut your bill and your accuracy goes &lt;em&gt;up&lt;/em&gt;, which is a sentence I didn't expect to type either.&lt;/p&gt;

&lt;p&gt;The judge, the eval set, and the deterministic rule are all in the open — AGPLv3, OpenAI-compatible, point it at Ollama or vLLM and keep your mail on your own box: &lt;strong&gt;&lt;a href="https://github.com/k08200/klorn" rel="noopener noreferrer"&gt;github.com/k08200/klorn&lt;/a&gt;&lt;/strong&gt;. The eval set is &lt;code&gt;packages/api/eval/judge-eval-set.json&lt;/code&gt;. Open it, label it your way, and go find out which model actually wins on &lt;em&gt;your&lt;/em&gt; inbox.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Treat upstream catalogs as mutable: how a free-tier model SKU retirement broke my AI agent</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Thu, 11 Jun 2026 15:24:22 +0000</pubDate>
      <link>https://dev.to/k08200/treat-upstream-catalogs-as-mutable-how-a-free-tier-model-sku-retirement-broke-my-ai-agent-159l</link>
      <guid>https://dev.to/k08200/treat-upstream-catalogs-as-mutable-how-a-free-tier-model-sku-retirement-broke-my-ai-agent-159l</guid>
      <description>&lt;p&gt;Tuesday afternoon, every autonomous cycle in my agent started returning the same error:&lt;/p&gt;

&lt;p&gt;[AGENT] Cycle failed: 404 No endpoints found for model: google/gemma-2-9b-it:free&lt;/p&gt;

&lt;p&gt;The model hadn't changed in my config. The provider hadn't gone down. The endpoint just... wasn't there anymore. OpenRouter had retired the &lt;code&gt;:free&lt;/code&gt; SKU mid-week — no notification, no deprecation window, just gone. Every background classification, every briefing generation, every proactive scan started failing in the same way.&lt;/p&gt;

&lt;p&gt;I had a fallback. That was the embarrassing part.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fallback that didn't fall back
&lt;/h2&gt;

&lt;p&gt;My &lt;code&gt;createCompletion()&lt;/code&gt; wrapper had been catching the documented provider failure modes for months:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;402 insufficient_credits&lt;/code&gt; → walk to next provider&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;403 daily_quota_exceeded&lt;/code&gt; → walk to next provider&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;429 rate_limited&lt;/code&gt; → backoff + retry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it didn't catch: "the model you asked for doesn't exist anymore." A &lt;code&gt;404 No endpoints found&lt;/code&gt; propagated as a generic error and killed the cycle. The fallback chain never even got consulted because nothing in the existing branches matched.&lt;/p&gt;

&lt;p&gt;The mental model was wrong. I'd been treating the model catalog as &lt;strong&gt;fixed configuration&lt;/strong&gt; — something you set once and forget. In reality it's &lt;strong&gt;upstream state&lt;/strong&gt; that can mutate at any moment, just like any other dependency. The retirement was a feature of the provider's catalog management, not a bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: walk the free-model chain on retirement signals
&lt;/h2&gt;

&lt;p&gt;The actual patch was short. Two PRs:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
ts
// Before: only walked on credit/quota/rate failures
if (isCreditError(err) || isKeyLimitError(err)) {
  return walkFallbackChain(...);
}

// After: also walk when the model itself is gone
if (isModelUnavailableError(err)) {
  markModelUnavailable(model);
  return walkFallbackChain(...);
}
isModelUnavailableError matches on:

HTTP 404 with No endpoints found in body
HTTP 400 with model_not_found code
Anything else the provider emits when the SKU is gone
markModelUnavailable puts the model on a 24h cooldown so the next cycle doesn't try it again immediately. When the catalog refreshes (providers add new SKUs all the time too), the cooldown expires and we retry.

The fallback chain itself is per-provider:


const OPENROUTER_FALLBACK_CHAIN = [
  'meta-llama/llama-3.3-70b-instruct:free',
  'google/gemma-2-9b-it:free',
  'mistralai/mistral-7b-instruct:free',
  'qwen/qwen-2.5-7b-instruct:free',
];
When one entry 404s, we walk to the next. When all of them fail, we fail over to the secondary provider (Gemini direct), which has its own chain. Only when every chain across every provider has been exhausted does the agent give up and surface AllProvidersExhaustedError to the user.

What I should have done from day 1
Three rules I'm internalizing:

1. The upstream catalog is mutable. Hardcoding a single model ID is the same antipattern as hardcoding a single CDN URL. Always have a list. Always make the list cheap to rotate.

2. Distinguish "this model is unavailable" from "the provider is unavailable." They're different failures with different recovery paths. Treating them the same way means you either over-rotate (give up the provider when only one model is gone) or under-rotate (give up entirely when the provider is fine).

3. Cooldowns, not blacklists. When a model disappears, don't kill it forever. Put it on a window. Providers add models back, or you might be hitting a transient 404. A 24h cooldown is much friendlier than a permanent deny-list that requires a code change to undo.

Why this matters beyond one provider
If you're running an agent in production, your model isn't your only upstream dependency:

Vendor's catalog can change
Pricing can change (:free → :paid is a real failure mode)
Rate-limit policies can change
Authentication schemes can change (Google's AQ.-prefix keys rejected by their own OpenAI-compat endpoint is a fun one — I had to write a native adapter for it)
The pattern is the same: treat every assumption about the upstream as a potential dynamic value, and make the recovery path the default, not the exception.

Agents that survive in prod have failover chains, cooldown windows, and degraded modes built in from the start. Not because the upstream is unreliable — because the upstream is alive, and alive things change.

I've been writing about Klorn, an open-source attention firewall for Gmail, where this kind of failure mode hits constantly because the agent runs continuously. Repo: github.com/k08200/klorn · Doctrine: deterministic-floor.md.

If you've shipped agents to prod, what other upstream-mutation failure modes have caught you off-guard?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>MCP CI gates need retry receipts for flaky downstreams</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Mon, 08 Jun 2026 04:43:52 +0000</pubDate>
      <link>https://dev.to/k08200/mcp-ci-gates-need-retry-receipts-for-flaky-downstreams-2akb</link>
      <guid>https://dev.to/k08200/mcp-ci-gates-need-retry-receipts-for-flaky-downstreams-2akb</guid>
      <description>&lt;p&gt;MCP CI gates need to distinguish two very different failures:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the server is actually broken&lt;/li&gt;
&lt;li&gt;the downstream dependency is temporarily flaky&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If both become hard failures, CI gets noisy.&lt;br&gt;
If both are ignored, the gate stops meaning anything.&lt;/p&gt;

&lt;p&gt;So I shipped &lt;code&gt;@k08200/mcp-probe@1.12.0&lt;/code&gt; with explicit sidecar retry policy for tool-call dry-runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;A readiness gate that calls real MCP tools can hit transient downstream failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;503 Service Unavailable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;502 Bad Gateway&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;504 Gateway Timeout&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;rate limits&lt;/li&gt;
&lt;li&gt;short network timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But auth and permission failures are different. A &lt;code&gt;401&lt;/code&gt; or &lt;code&gt;403&lt;/code&gt; usually means the agent will fail in production too.&lt;/p&gt;

&lt;p&gt;Those should stay visible unless the contract explicitly says otherwise.&lt;/p&gt;
&lt;h2&gt;
  
  
  Retry is opt-in per tool
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;mcp-probe&lt;/code&gt; now lets a sidecar contract define retry behavior per tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"logs_query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"service:web status:error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"timeframe"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1h"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"retry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"attempts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"delayMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"retryOn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rate limit"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part: retry is not global magic.&lt;/p&gt;

&lt;p&gt;It only happens when the sidecar explicitly opts in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Receipts still show the flake
&lt;/h2&gt;

&lt;p&gt;If a call fails once and passes on retry, the final result can pass, but the receipt still records every attempt.&lt;/p&gt;

&lt;p&gt;That means CI can tolerate a transient downstream blip without pretending the run was clean.&lt;/p&gt;

&lt;p&gt;Example shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"flaky_read"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sidecar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"attempts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"attempt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fail"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"503 Service Unavailable: transient downstream"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"attempt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the distinction I want MCP CI gates to preserve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hard failures should block&lt;/li&gt;
&lt;li&gt;transient failures can be retried&lt;/li&gt;
&lt;li&gt;pass-after-retry should still leave a receipt&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-D&lt;/span&gt; @k08200/mcp-probe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or run directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest &lt;span class="nt"&gt;--config&lt;/span&gt; mcp-probe.config.json &lt;span class="nt"&gt;--github-summary&lt;/span&gt; &lt;span class="nt"&gt;--receipt-file&lt;/span&gt; mcp-probe.receipt.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub release: &lt;a href="https://github.com/k08200/mcp-probe/releases/tag/v1.12.0" rel="noopener noreferrer"&gt;https://github.com/k08200/mcp-probe/releases/tag/v1.12.0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;npm: &lt;a href="https://www.npmjs.com/package/@k08200/mcp-probe" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@k08200/mcp-probe&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>devops</category>
      <category>testing</category>
      <category>ai</category>
    </item>
    <item>
      <title>Every "autonomous AI agent" is a customer-support ticket waiting to happen</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Sun, 07 Jun 2026 16:23:09 +0000</pubDate>
      <link>https://dev.to/k08200/klorn-the-approval-layer-for-ai-agents-builder-log-1o8m</link>
      <guid>https://dev.to/k08200/klorn-the-approval-layer-for-ai-agents-builder-log-1o8m</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/NbmQJG-kd7c"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;I'm tired of writing apology emails for my own AI.&lt;/p&gt;

&lt;p&gt;Last month an agent I was dogfooding cancelled a calendar event I actually cared about. Two weeks before that, a different one auto-replied to an investor with what read like a hostage note from a Slack bot. Both companies have raised more money than I'll see in five years.&lt;/p&gt;

&lt;p&gt;The pattern across every "agentic AI" demo on my timeline is the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent does a thing&lt;/li&gt;
&lt;li&gt;Agent emails the user that it did the thing&lt;/li&gt;
&lt;li&gt;The thing was wrong&lt;/li&gt;
&lt;li&gt;The company ships a fix the following Tuesday&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I stopped trusting them. Then I built one that &lt;strong&gt;can't do this&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wedge: agents that wait
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://klorn.ai" rel="noopener noreferrer"&gt;Klorn&lt;/a&gt; is an approval layer between AI agents and your Gmail / Calendar. The agent does the thinking — reads the email, checks your calendar, drafts the reply, creates the event proposal. Then it stops. Nothing fires until you click approve.&lt;/p&gt;

&lt;p&gt;Sounds boring. The constraint is what makes it real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The constraint that kills "act first, apologize later"
&lt;/h2&gt;

&lt;p&gt;Every meaningful action in Klorn is signed with a payload hash &lt;em&gt;before&lt;/em&gt; it fires. &lt;code&gt;send_email&lt;/code&gt; literally cannot execute without an &lt;code&gt;ActionReceipt&lt;/code&gt; that matches the hash of what was shown to you.&lt;/p&gt;

&lt;p&gt;There's an invariant test in the repo that fails the build if anyone — me, a future contributor, an AI agent (the irony) — tries to bypass it. Remove the approval check, the test fails, the build fails, the deploy fails.&lt;/p&gt;

&lt;p&gt;You &lt;strong&gt;cannot ship&lt;/strong&gt; a Klorn version that sends emails silently. It's architecturally impossible.&lt;/p&gt;

&lt;p&gt;This is the part nobody is building. Every "autonomous agent" demo on my timeline is one feature flag away from the next apology email.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I shipped this week
&lt;/h2&gt;

&lt;p&gt;The agent loop now runs end-to-end:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Meeting request hits inbox → tier-classified (PUSH / QUEUE / SILENT / AUTO)&lt;/li&gt;
&lt;li&gt;Klorn reads the email, checks the calendar for conflicts&lt;/li&gt;
&lt;li&gt;Drafts the reply &lt;em&gt;and&lt;/em&gt; the calendar event proposal&lt;/li&gt;
&lt;li&gt;Both wait as PendingActions in your decision queue&lt;/li&gt;
&lt;li&gt;One click → fires&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus a production bug that would have killed a less paranoid agent: OpenRouter retired a &lt;code&gt;:free&lt;/code&gt; model SKU mid-week. Every autonomous cycle died with &lt;code&gt;404 No endpoints found&lt;/code&gt;. The existing failover only covered 402 / 403 / 429 — not "the model is gone." Shipped a multi-model fallback chain on the same provider so losing one upstream SKU never kills the agent.&lt;/p&gt;

&lt;p&gt;That fix is the kind of thing you only ship when you trust the boundary the agent runs inside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop hype-cycling, start gating
&lt;/h2&gt;

&lt;p&gt;If you're shipping an "autonomous AI agent" in 2026, three questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Can a user prove what was approved is what was sent?&lt;/li&gt;
&lt;li&gt;Can a future contributor bypass your approval check?&lt;/li&gt;
&lt;li&gt;What is your invariant test?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answers are "no", "yes", and "we don't have one" — you're building the next apology email. Stop.&lt;/p&gt;

&lt;p&gt;I'd rather build the firewall.&lt;/p&gt;




&lt;p&gt;60-second walkthrough above (&lt;a href="https://youtu.be/NbmQJG-kd7c" rel="noopener noreferrer"&gt;YouTube&lt;/a&gt; · &lt;a href="https://youtu.be/RdxF3zcFhGo" rel="noopener noreferrer"&gt;Shorts cut&lt;/a&gt;).&lt;br&gt;
Try it free: &lt;a href="https://klorn.ai" rel="noopener noreferrer"&gt;klorn.ai&lt;/a&gt;. PRO auto-applied during private beta.&lt;/p&gt;

&lt;p&gt;If you've actually been thinking about where agents should and shouldn't act on their own, I'd love your honest take — even one-line replies. Disagreement especially welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>buildinpublic</category>
      <category>agents</category>
    </item>
    <item>
      <title>tools/list is not a readiness check for MCP servers</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Mon, 01 Jun 2026 06:48:53 +0000</pubDate>
      <link>https://dev.to/k08200/toolslist-is-not-a-readiness-check-for-mcp-servers-13j5</link>
      <guid>https://dev.to/k08200/toolslist-is-not-a-readiness-check-for-mcp-servers-13j5</guid>
      <description>&lt;p&gt;The first version of &lt;code&gt;mcp-probe&lt;/code&gt; checked the obvious things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;can the MCP server initialize?&lt;/li&gt;
&lt;li&gt;does &lt;code&gt;tools/list&lt;/code&gt; work?&lt;/li&gt;
&lt;li&gt;are tool schemas present?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was useful, but not enough.&lt;/p&gt;

&lt;p&gt;The more I tested real MCP workflows, the clearer the problem became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;tools/list&lt;/code&gt; is self-report. CI needs a receipt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An MCP server can advertise a clean tool catalog and still fail every real call because OAuth handoff, scopes, downstream credentials, row limits, tenant boundaries, or response shapes are broken.&lt;/p&gt;

&lt;p&gt;So the latest release of &lt;strong&gt;mcp-probe&lt;/strong&gt; focuses less on "does the process start?" and more on "is CI enforcing the contract an agent actually depends on?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The new bootstrap flow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest init &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; @your-org/your-mcp-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--discover&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--lock-tools&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--github-actions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mcp-probe.config.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.mcp-probe.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.github/workflows/mcp-probe.yml&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important part is what happens during &lt;code&gt;--discover&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mcp-probe&lt;/code&gt; connects to the server, reads the live &lt;code&gt;tools/list&lt;/code&gt; catalog, and generates a starting contract from the observed tool schemas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema-aware sidecar samples
&lt;/h2&gt;

&lt;p&gt;Older generated samples were too naive. If a schema said:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"enum"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Chicago"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"New York"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"integer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minimum"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the old fallback might produce empty strings or zero values. That often hit input validation and never tested the real call path.&lt;/p&gt;

&lt;p&gt;v1.11.0 now uses schema hints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;default&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;enum&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;numeric &lt;code&gt;minimum&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;string &lt;code&gt;minLength&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;nested objects&lt;/li&gt;
&lt;li&gt;array &lt;code&gt;minItems&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the generated sample becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chicago"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is still only a starting point. You should review generated samples before running them with production credentials, especially for mutating, admin, export, or environment-inspection tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catalog locking
&lt;/h2&gt;

&lt;p&gt;The other new piece is &lt;code&gt;--lock-tools&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;--discover&lt;/code&gt;, mcp-probe now writes the observed tool names into &lt;code&gt;expectedTools&lt;/code&gt;, so CI fails if a required tool disappears.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;--lock-tools&lt;/code&gt;, it also writes &lt;code&gt;allowedTools&lt;/code&gt;, so CI fails if unexpected tools appear.&lt;/p&gt;

&lt;p&gt;That matters for low-trust agent surfaces. If a server suddenly exposes &lt;code&gt;delete_user&lt;/code&gt;, &lt;code&gt;export_all&lt;/code&gt;, or &lt;code&gt;rotate_api_key&lt;/code&gt;, I do not want that to silently become available to an agent just because &lt;code&gt;tools/list&lt;/code&gt; still returns valid JSON.&lt;/p&gt;

&lt;p&gt;Example config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeoutMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@your-org/your-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"probeTools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"toolsFile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;".mcp-probe.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expectedTools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"read_record"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"allowedTools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"read_record"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Receipts
&lt;/h2&gt;

&lt;p&gt;For CI, the workflow can also persist a redacted receipt artifact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; mcp-probe.config.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--github-summary&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--fail-on-warn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--receipt-file&lt;/span&gt; mcp-probe.receipt.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That receipt is the thing I want CI to trust: not the server claiming it has tools, and not an agent claiming what happened later, but an independent probe that actually ran against the boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest @modelcontextprotocol/server-memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/k08200/mcp-probe" rel="noopener noreferrer"&gt;k08200/mcp-probe&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Release: &lt;a href="https://github.com/k08200/mcp-probe/releases/tag/v1.11.0" rel="noopener noreferrer"&gt;v1.11.0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I am especially looking for real Datadog, Supabase, and Gmail MCP recipes. The public fixtures are useful, but the real value is catching auth handoff, permission, tenant-scope, and response-contract failures in CI.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>typescript</category>
      <category>cli</category>
      <category>ai</category>
    </item>
    <item>
      <title>Stop Building AI Assistants. Build AI Firewalls.</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Thu, 28 May 2026 15:40:23 +0000</pubDate>
      <link>https://dev.to/k08200/stop-building-ai-assistants-build-ai-firewalls-1mh0</link>
      <guid>https://dev.to/k08200/stop-building-ai-assistants-build-ai-firewalls-1mh0</guid>
      <description>&lt;p&gt;Every week another "AI agent for X" launches. Email triage. Calendar coordination. Sales follow-up. PR reviewer. Slack monitor. Meeting summarizer.&lt;/p&gt;

&lt;p&gt;I've installed enough of them to see the pattern. Here's the dirty secret nobody mentions in the launch posts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These tools don't reduce your work. They multiply your notifications.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each AI tool is configured to be helpful by default. "Helpful" means: "I noticed this thing — here's a notification." Stack a dozen of those, and instead of one inbox to ignore you have twelve. The signal-to-noise ratio gets &lt;em&gt;worse&lt;/em&gt; every time you add an AI to your workflow.&lt;/p&gt;

&lt;p&gt;The mainstream answer is &lt;em&gt;"just configure each one."&lt;/em&gt; Sure. Spend four hours tuning notification settings every time you add a tool, and another four hours when one of them ships a "smarter notifications" update. That's not productivity. That's notification janitorial work disguised as setup.&lt;/p&gt;

&lt;p&gt;This is a structural problem. Not a configuration problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  60-second walkthrough
&lt;/h2&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-2060688051920314608-647" src="https://platform.twitter.com/embed/Tweet.html?id=2060688051920314608"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-2060688051920314608-647');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2060688051920314608&amp;amp;theme=dark"
  }



&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong question
&lt;/h2&gt;

&lt;p&gt;Every AI tool asks the same thing: &lt;strong&gt;"Is this important?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Wrong question. There is no objective "important." Importance depends on you, right now. A Stripe webhook is important when you're debugging a checkout flow. The same webhook is pure noise during a deep work block. A Slack message from your cofounder is critical at 11am Tuesday and irrelevant at 11pm Friday.&lt;/p&gt;

&lt;p&gt;The right question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is this urgent enough to interrupt me, right now, given what I'm doing?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's not a question any individual AI agent can answer. It's a layer &lt;strong&gt;above&lt;/strong&gt; all your AI agents. None of them have the context. None of them know what the others are doing. None of them know how you're spending the next hour.&lt;/p&gt;

&lt;p&gt;So they all default to "I'll just send you a notification, you decide." Which is exactly the experience you have right now: drowning.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an AI firewall actually looks like
&lt;/h2&gt;

&lt;p&gt;I'm building that layer. It's called &lt;a href="https://klorn.ai" rel="noopener noreferrer"&gt;Klorn&lt;/a&gt;. Here's how it works in practice — and what's already shipping vs what's scope-deferred.&lt;/p&gt;

&lt;p&gt;Every incoming email goes through a &lt;strong&gt;4-tier classification&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;PoC state&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PUSH&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wakes you up. Phone notification.&lt;/td&gt;
&lt;td&gt;Classified + alert ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QUEUE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Review on your own schedule.&lt;/td&gt;
&lt;td&gt;Classified + queued ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SILENT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recorded. Never interrupts.&lt;/td&gt;
&lt;td&gt;Classified + logged ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AUTO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reversible, hands-off. Low-risk actions execute; external-facing actions stay approval-gated.&lt;/td&gt;
&lt;td&gt;Partial execution: LOW-risk internal (classify, mark read, briefing) auto-executes. MEDIUM (send email, create event) and HIGH (delete) always go through an approve button.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's the entire surface. No "Call" tier. No fancy automations. Narrow on purpose.&lt;/p&gt;

&lt;p&gt;The tier is decided by a &lt;strong&gt;4-feature scorer&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Confidence&lt;/strong&gt; — how clearly the signal type maps to a tier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sender trust&lt;/strong&gt; — your historical reply rate and meeting acceptance for this contact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reversibility&lt;/strong&gt; — can the wrong tier be undone without consequence?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Urgency&lt;/strong&gt; — actual urgency signals, not "URGENT!!!" in the subject line&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;80% agreement with my hand-labels on 50 real emails.&lt;/strong&gt; That's the Day 7 PoC gate, met.&lt;/p&gt;

&lt;h2&gt;
  
  
  Override is GROUP BY, not LLM
&lt;/h2&gt;

&lt;p&gt;When the firewall gets a tier wrong, one click moves the email to the right tier. Your correction doesn't just fix this one email — it becomes ground truth for the next prompt.&lt;/p&gt;

&lt;p&gt;The override loop is the wedge. The classifier is replaceable; the alignment signal isn't. Every disagreement is signal, not noise.&lt;/p&gt;

&lt;p&gt;Boring + measurable beats fuzzy + ambitious.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why building this is unpopular in 2026
&lt;/h2&gt;

&lt;p&gt;Building AI firewalls is unsexy. Investors want &lt;strong&gt;"AI agents that DO things."&lt;/strong&gt; Saying "I built a system that does fewer things, more quietly" sounds backwards on a pitch deck.&lt;/p&gt;

&lt;p&gt;But every founder I've shown this to has the same reaction: relief. Because they're drowning. Because every productivity tool they bought made their attention worse, not better. The AI agent boom didn't reduce their work. It raised the floor of background notifications.&lt;/p&gt;

&lt;p&gt;The default for AI tools should be: &lt;strong&gt;shut up unless it actually matters.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most don't. So I'm building the layer that enforces it from outside, since none of the individual tools will do it on their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I am
&lt;/h2&gt;

&lt;p&gt;PoC sprint, Week 5, solo. 14-day window ending June 9, 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 7 Technical Gate&lt;/strong&gt; — ≥80% classifier agreement on 50 hand-labeled emails. &lt;strong&gt;Met.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Day 14 UX Gate&lt;/strong&gt; — ≥3/5 ICP demos register "oh, this is different." Pending.&lt;/p&gt;

&lt;p&gt;I dogfood it every day. My own inbox runs through the firewall.&lt;/p&gt;

&lt;p&gt;Stack: Next.js 15, TypeScript, Prisma, Postgres (Supabase), Claude / OpenAI for the tier reasoning, Gmail for ingest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual unpopular opinion
&lt;/h2&gt;

&lt;p&gt;If your AI tool sends push notifications by default, it's broken. Doesn't matter how good its reasoning is. You can't reason your way out of a notification flood.&lt;/p&gt;

&lt;p&gt;The next valuable layer of agentic products won't be more agents. It'll be the firewall that decides which agents are allowed to interrupt you, when.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try it&lt;/strong&gt;: &lt;a href="https://klorn.ai" rel="noopener noreferrer"&gt;klorn.ai&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Code&lt;/strong&gt;: &lt;a href="https://github.com/k08200/klorn" rel="noopener noreferrer"&gt;github.com/k08200/klorn&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're building agentic products and you disagree, I want to hear it. If you've solved it differently, I want to hear that more.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>startup</category>
      <category>indiehackers</category>
    </item>
    <item>
      <title>MCP CI gates need receipts: tools/list is not enough</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Thu, 28 May 2026 11:44:32 +0000</pubDate>
      <link>https://dev.to/k08200/mcp-ci-gates-need-receipts-toolslist-is-not-enough-29o4</link>
      <guid>https://dev.to/k08200/mcp-ci-gates-need-receipts-toolslist-is-not-enough-29o4</guid>
      <description>&lt;p&gt;MCP servers are starting to look like normal infrastructure.&lt;/p&gt;

&lt;p&gt;That means they need boring infrastructure checks.&lt;/p&gt;

&lt;p&gt;The mistake I kept seeing is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The server starts, and &lt;code&gt;tools/list&lt;/code&gt; returns a clean schema. Therefore it works."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not enough.&lt;/p&gt;

&lt;p&gt;An MCP server can pass &lt;code&gt;initialize&lt;/code&gt;, advertise every expected tool, and still fail every real call because auth, scopes, tenant boundaries, environment variables, downstream permissions, or read-only roles are broken.&lt;/p&gt;

&lt;p&gt;So I pushed &lt;code&gt;mcp-probe@1.8.0&lt;/code&gt; further toward being a real CI readiness gate for MCP servers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest &lt;span class="nt"&gt;--config&lt;/span&gt; mcp-probe.config.json &lt;span class="nt"&gt;--github-summary&lt;/span&gt; &lt;span class="nt"&gt;--fail-on-warn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Warnings can now fail CI
&lt;/h3&gt;

&lt;p&gt;By default, warnings still exit &lt;code&gt;0&lt;/code&gt;. That keeps existing users from getting surprise CI failures.&lt;/p&gt;

&lt;p&gt;But production gates often need stricter behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mcp-probe &lt;span class="nt"&gt;--config&lt;/span&gt; mcp-probe.config.json &lt;span class="nt"&gt;--fail-on-warn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;--fail-on-warn&lt;/code&gt;, auth handoff issues, permission warnings, or incomplete readiness receipts can block the workflow.&lt;/p&gt;

&lt;p&gt;That matters because many MCP failures are not hard crashes. They are degraded states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth flow requires a browser redirect the agent cannot complete&lt;/li&gt;
&lt;li&gt;a server starts but every tool call returns &lt;code&gt;401&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;a database tool works with admin credentials but fails with the intended read-only role&lt;/li&gt;
&lt;li&gt;the workflow mentions a probe but does not actually run the production boundary check&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Doctor now checks the actual workflow receipt
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;mcp-probe doctor&lt;/code&gt; already checked whether a GitHub Actions workflow existed.&lt;/p&gt;

&lt;p&gt;But that is not enough either.&lt;/p&gt;

&lt;p&gt;The new behavior is stricter: the required flags must appear on the same actual &lt;code&gt;mcp-probe&lt;/code&gt; run step.&lt;/p&gt;

&lt;p&gt;This should pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should not count as a complete gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx @k08200/mcp-probe --config mcp-probe.config.json&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx @k08200/mcp-probe ./server.js --github-summary --fail-on-warn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flags are present somewhere in the workflow, but no single run step proves the intended config is actually being checked with CI summaries and strict warning handling.&lt;/p&gt;

&lt;p&gt;That is the difference between "we have a gate" and "the gate is enforcing the thing we trust."&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Tool call coverage is now tied to expected tools
&lt;/h2&gt;

&lt;p&gt;For config-based checks, you can declare the expected tool catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.example.com/mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"transport"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bearer ${DATADOG_MCP_TOKEN}"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expectedTools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"logs_query"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"forbiddenTools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"delete_dashboard"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rotate_api_key"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"toolsFile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./datadog.tools.json"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;expectedTools&lt;/code&gt; and &lt;code&gt;toolsFile&lt;/code&gt; are both set, every expected tool needs a sidecar sample input.&lt;/p&gt;

&lt;p&gt;That means CI checks not just "is the tool advertised?" but "did we actually provide a meaningful dry-run sample for the tool an agent depends on?"&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Sidecar inputs are the real contract
&lt;/h2&gt;

&lt;p&gt;Auto-generated inputs are useful for smoke tests, but they mostly hit schema validation.&lt;/p&gt;

&lt;p&gt;Real readiness checks need meaningful inputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"logs_query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"service:web status:error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"timeframe"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1h"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"not_error_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"requiredFields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"freshness"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"maxRows"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For database-backed MCP servers, these assertions are the interesting part:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does the read-only role work?&lt;/li&gt;
&lt;li&gt;are row limits enforced?&lt;/li&gt;
&lt;li&gt;are broad exports/admin actions absent or gated?&lt;/li&gt;
&lt;li&gt;are denied writes structured enough for agents to recover?&lt;/li&gt;
&lt;li&gt;do results include provenance fields like source and freshness?&lt;/li&gt;
&lt;li&gt;does the response avoid leaking secrets, stack traces, or raw internals?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-D&lt;/span&gt; @k08200/mcp-probe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or run directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest doctor
npx @k08200/mcp-probe@latest &lt;span class="nt"&gt;--config&lt;/span&gt; mcp-probe.config.json &lt;span class="nt"&gt;--github-summary&lt;/span&gt; &lt;span class="nt"&gt;--fail-on-warn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/k08200/mcp-probe" rel="noopener noreferrer"&gt;https://github.com/k08200/mcp-probe&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/@k08200/mcp-probe" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@k08200/mcp-probe&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is simple: CI for MCP should test the contract an agent will actually depend on, not just whether the process starts.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>mcp-probe v1.6.0: Stricter GitHub Actions checks for MCP CI gates</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Tue, 26 May 2026 04:35:59 +0000</pubDate>
      <link>https://dev.to/k08200/mcp-probe-v160-stricter-github-actions-checks-for-mcp-ci-gates-52k9</link>
      <guid>https://dev.to/k08200/mcp-probe-v160-stricter-github-actions-checks-for-mcp-ci-gates-52k9</guid>
      <description>&lt;p&gt;I shipped &lt;strong&gt;mcp-probe v1.6.0&lt;/strong&gt; with a small but useful improvement to &lt;code&gt;mcp-probe doctor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Previous behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;check whether &lt;code&gt;.github/workflows&lt;/code&gt; exists&lt;/li&gt;
&lt;li&gt;check whether any workflow mentions &lt;code&gt;mcp-probe&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was useful, but too shallow. A workflow can mention &lt;code&gt;mcp-probe&lt;/code&gt; and still not run the actual CI gate correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;mcp-probe doctor&lt;/code&gt; now warns when the matching GitHub Actions workflow is missing any of these pieces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;actions/checkout@v6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--config &amp;lt;config-file&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--github-summary&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest doctor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your workflow calls &lt;code&gt;mcp-probe&lt;/code&gt; directly but does not use the configured fleet gate, doctor now tells you what is missing before you trust the CI result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;The larger goal of mcp-probe is to make MCP servers testable like normal infrastructure. That means checking more than process startup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MCP initialize handshake&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools/list&lt;/code&gt; discovery&lt;/li&gt;
&lt;li&gt;real &lt;code&gt;tools/call&lt;/code&gt; dry-runs&lt;/li&gt;
&lt;li&gt;sidecar sample inputs&lt;/li&gt;
&lt;li&gt;contract assertions for row limits, stable error codes, and leak checks&lt;/li&gt;
&lt;li&gt;and now, whether the CI workflow itself is wired correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A readiness gate is only useful if the gate is actually installed correctly.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/k08200/mcp-probe" rel="noopener noreferrer"&gt;https://github.com/k08200/mcp-probe&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/@k08200/mcp-probe" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@k08200/mcp-probe&lt;/a&gt;&lt;br&gt;
Release: &lt;a href="https://github.com/k08200/mcp-probe/releases/tag/v1.6.0" rel="noopener noreferrer"&gt;https://github.com/k08200/mcp-probe/releases/tag/v1.6.0&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>devops</category>
      <category>githubactions</category>
      <category>ai</category>
    </item>
    <item>
      <title>mcp-probe v1.5.0: Doctor checks for MCP CI readiness</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Mon, 25 May 2026 15:40:20 +0000</pubDate>
      <link>https://dev.to/k08200/mcp-probe-v150-doctor-checks-for-mcp-ci-readiness-49nc</link>
      <guid>https://dev.to/k08200/mcp-probe-v150-doctor-checks-for-mcp-ci-readiness-49nc</guid>
      <description>&lt;p&gt;MCP servers are starting to look like infrastructure. That means the tooling around them needs boring preflight checks, not just optimistic smoke tests.&lt;/p&gt;

&lt;p&gt;I just shipped &lt;strong&gt;mcp-probe v1.5.0&lt;/strong&gt; with a new command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @k08200/mcp-probe@latest doctor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;mcp-probe doctor&lt;/code&gt; checks whether the current repository is ready to run MCP readiness checks in CI before you even probe an external server.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it checks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Node.js runtime satisfies mcp-probe requirements&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mcp-probe.config.json&lt;/code&gt; exists and parses&lt;/li&gt;
&lt;li&gt;configured sidecar files exist and have valid &lt;code&gt;tools.*.input&lt;/code&gt; objects&lt;/li&gt;
&lt;li&gt;GitHub Actions workflows are present and mention &lt;code&gt;mcp-probe&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mcp-probe doctor &lt;span class="nt"&gt;--config-file&lt;/span&gt; examples/self-check.config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mcp-probe doctor
────────────────────────────────────────────────────
  ✓  Node.js version
     Node 24.13.0 satisfies &amp;gt;=20.19.0
  ✓  Config file
     examples/self-check.config.json contains 1 server
  ✓  Sidecar examples/self-check.tools.json
     Found 4 tool entries
  ✓  GitHub Actions workflow
     Found 1 workflow file mentioning mcp-probe
────────────────────────────────────────────────────
  PASS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For automation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mcp-probe doctor &lt;span class="nt"&gt;--output&lt;/span&gt; json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;The earlier releases focused on the MCP server itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;initialize&lt;/code&gt; handshake&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools/list&lt;/code&gt; discovery&lt;/li&gt;
&lt;li&gt;real &lt;code&gt;tools/call&lt;/code&gt; dry-runs&lt;/li&gt;
&lt;li&gt;sidecar sample inputs&lt;/li&gt;
&lt;li&gt;contract assertions for row limits, metadata, stable error codes, and leak checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But teams still need to know whether their own probe setup is sane. A broken config file, missing sidecar, or workflow that never invokes the probe should fail early and loudly.&lt;/p&gt;

&lt;p&gt;This release is a small step, but an important one: before testing the MCP contract an agent depends on, test that your CI gate is actually wired correctly.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/k08200/mcp-probe" rel="noopener noreferrer"&gt;https://github.com/k08200/mcp-probe&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/@k08200/mcp-probe" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@k08200/mcp-probe&lt;/a&gt;&lt;br&gt;
Release: &lt;a href="https://github.com/k08200/mcp-probe/releases/tag/v1.5.0" rel="noopener noreferrer"&gt;https://github.com/k08200/mcp-probe/releases/tag/v1.5.0&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>devops</category>
      <category>node</category>
    </item>
    <item>
      <title>Stop building AI inboxes. Build decision layers instead.</title>
      <dc:creator>yongrean</dc:creator>
      <pubDate>Mon, 25 May 2026 13:40:43 +0000</pubDate>
      <link>https://dev.to/k08200/stop-building-ai-inboxes-build-decision-layers-instead-3id7</link>
      <guid>https://dev.to/k08200/stop-building-ai-inboxes-build-decision-layers-instead-3id7</guid>
      <description>&lt;p&gt;I spent six months building an AI-powered email tool. Then I deleted half of it.&lt;/p&gt;

&lt;p&gt;Not because the model was bad. Not because the embeddings were off. Because I finally noticed what every "AI inbox" on the market — including the one I was building — was actually doing.&lt;/p&gt;

&lt;p&gt;They were surfacing more.&lt;/p&gt;

&lt;p&gt;More "smart suggestions". More "priority signals". More "AI-drafted replies waiting for your review". More badges, more banners, more nudges. Every product in the category was racing to add a new surface and call it intelligence.&lt;/p&gt;

&lt;p&gt;My six-month-old prototype did all of that. I used it every day. And every morning the inbox was just as loud as the day I started. The model was right about which emails mattered. I still read all the other ones anyway, because they were &lt;em&gt;right there&lt;/em&gt;, with a little colored dot suggesting maybe-they-mattered-too.&lt;/p&gt;

&lt;p&gt;The model was solving the wrong problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The category bug
&lt;/h2&gt;

&lt;p&gt;Look at the leading email tools through this lens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Superhuman&lt;/strong&gt; made reading faster. You still read everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortwave&lt;/strong&gt; classified smarter. You still read everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Motion / Reclaim&lt;/strong&gt; got more proactive. They added a calendar layer on top of the noise.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of them subtract. They all add. "AI assistant" became a license to put one more thing in front of you.&lt;/p&gt;

&lt;p&gt;The deeper bug: these tools treat email as the &lt;em&gt;primary&lt;/em&gt; surface and try to make it better. But email is not what you want. What you want is &lt;em&gt;decisions you have to make&lt;/em&gt;. Email is one cheap, unreliable transport that occasionally contains those decisions, buried under hundreds that don't.&lt;/p&gt;

&lt;p&gt;Making the transport prettier doesn't fix the signal-to-noise problem. It hides it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The right abstraction: decision layer
&lt;/h2&gt;

&lt;p&gt;A decision layer doesn't replace your inbox. It sits &lt;em&gt;above&lt;/em&gt; mail, calendar, Slack, and any other transport, and it surfaces exactly one thing: items where the system genuinely needs your judgment.&lt;/p&gt;

&lt;p&gt;Three properties make a layer a decision layer rather than just "a better inbox":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It subtracts more than it adds.&lt;/strong&gt; A signal that you've ignored four times in a row should never reach you again. Not muted. Gone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It treats relationships as data.&lt;/strong&gt; Two people asking for the same thing are not the same ask. One of them has hit every deadline you've ever had with them; the other ships +3 days late, every time. That should weight the queue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It refuses to act without your approval.&lt;/strong&gt; The model can draft, propose, plan. It cannot send, modify, or commit. Approval-before-action has to be a schema-level constraint, not a UI nicety.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are AI features. They are &lt;em&gt;boundary&lt;/em&gt; features. The AI is helpful for the classification underneath, but the value lives in what the system refuses to surface.&lt;/p&gt;

&lt;p&gt;Here is what each of them actually looks like in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1 — Closed-loop suppression learning
&lt;/h2&gt;

&lt;p&gt;The single most useful thing the system does is forget.&lt;/p&gt;

&lt;p&gt;Every time the user dismisses an attention item, we record a &lt;code&gt;FeedbackEvent&lt;/code&gt; with the signal &lt;code&gt;DISMISSED&lt;/code&gt; or &lt;code&gt;IGNORED&lt;/code&gt;. That table is the cheap part. The interesting part is a job that reads it weekly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;runFeedbackAdaptation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;since&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;LOOK_BACK_DAYS&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;prisma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;feedbackEvent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ATTENTION_ITEM&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;DISMISSED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;IGNORED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;since&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;select&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;sourceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Join to the attention items themselves so we can bucket by (source, type,&lt;/span&gt;
  &lt;span class="c1"&gt;// priority) instead of just (source, type) — the bucket prevents an&lt;/span&gt;
  &lt;span class="c1"&gt;// over-broad rule from silencing legitimate high-priority signals.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;prisma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;attentionItem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sourceId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;select&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CountKey&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;itemMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sourceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;priorityBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;suppressionKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nx"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Threshold: same tuple dismissed ≥4 times in 30 days → suppress forever.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;suppressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;DISMISS_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;dismissCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;remember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CONTEXT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;attention_suppression_v2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;suppressed&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;suppressed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The suppression set is then read at the upsert path for every new attention item:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;isSuppressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;priority&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;priorityBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;suppressionKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;suppressionKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the tuple is in the suppression set, the new attention item is forced into &lt;code&gt;SILENT&lt;/code&gt; tier — it gets recorded for the audit log, but the user is never paged about it.&lt;/p&gt;

&lt;p&gt;A few design choices worth pointing out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Priority buckets matter.&lt;/strong&gt; The first version keyed only on &lt;code&gt;(source, type)&lt;/code&gt;. Dismissing four "due-today commitment" notifications would silence &lt;em&gt;every&lt;/em&gt; commitment-due signal, including overdue ones. The current version buckets priority into HIGH / MEDIUM / LOW, so the user can train "I don't care about LOW-priority due commitments" without losing the HIGH ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backwards-compatible key.&lt;/strong&gt; Memory rows from the previous version are still read; a v1 row without a bucket matches every bucket, so a rollback doesn't lose learned behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-minute in-process cache.&lt;/strong&gt; The upsert path is hot — checking the suppression set on every new item against the DB would be wasteful. A 10-minute TTL is short enough that a weekly adaptation run propagates fast and long enough to be free at request time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice what's missing: an LLM. The classifier underneath uses one, but the suppression loop itself is plain counting. The model is not the right tool for "remember what the user doesn't care about". A &lt;code&gt;GROUP BY&lt;/code&gt; is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2 — Contact Trust Score
&lt;/h2&gt;

&lt;p&gt;The second feature changed how I think about every productivity tool I've ever used.&lt;/p&gt;

&lt;p&gt;When someone makes a commitment to you — "I'll send the deck by Thursday", "let's reconnect next week" — that's a tracked row in a commitment ledger. When the commitment is fulfilled, we record whether it was on-time or late, and update a running tally per contact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;updateTrustScore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;contactEmail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;displayName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;wasOnTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;daysLate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;prisma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contactTrustScore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;userId_contactEmail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;contactEmail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;email&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;contactEmail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="nx"&gt;displayName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;totalCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;onTimeCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;wasOnTime&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;lateCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;wasOnTime&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;totalDelayDays&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;daysLate&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;lastUpdatedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;update&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;totalCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;...(&lt;/span&gt;&lt;span class="nx"&gt;wasOnTime&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;onTimeCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;lateCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;...(&lt;/span&gt;&lt;span class="nx"&gt;daysLate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;totalDelayDays&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;daysLate&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}),&lt;/span&gt;
      &lt;span class="na"&gt;lastUpdatedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tally rolls up to a badge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;reliable&lt;/strong&gt; — ≥80% on-time, ≥3 data points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mostly reliable&lt;/strong&gt; — ≥50% on-time, ≥3 data points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;unreliable&lt;/strong&gt; — &amp;lt;50% on-time, ≥3 data points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;unknown&lt;/strong&gt; — fewer than 3 data points, or stale (no signal in 60+ days)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stale check is doing real work. A year-old "reliable" badge on someone who has since gone dark shouldn't be load-bearing. Until we get full exponential decay, we demote anyone untouched in two half-lives back to unknown.&lt;/p&gt;

&lt;p&gt;The badge gets surfaced as a small chip on the inbox card. But the actually-useful place is inside the agent prompt itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildTrustHintForPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;prisma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contactTrustScore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;totalCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MIN_DATA_POINTS&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;lastUpdatedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;desc&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;take&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;computeResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;row&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;displayName&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contactEmail&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;badge&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;reliable&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`- &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: reliable (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onTimeRate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;% on-time)`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;badge&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mostly_reliable&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;avgDelayDays&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="s2"&gt;`, avg +&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;avgDelayDays&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;d late`&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`- &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: mostly reliable (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onTimeRate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;% on-time&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;)`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`- &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: unreliable (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onTimeRate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;% on-time, avg +&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;avgDelayDays&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;d late) — factor in extra buffer`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`\n## Contact Reliability\nBased on tracked commitments:\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when the model decides how urgently to surface "Mina is asking for an update" vs "Sarah is asking for an update", it has actual data on which of them is going to deliver if you give them a polite nudge versus which one needs the deadline restated three times. The prompt isn't fed any feelings about either person. It is fed numbers.&lt;/p&gt;

&lt;p&gt;The productivity-tool industry has spent ten years building calendars that don't know which meeting attendees actually show up on time. That's strange.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3 — Approval-before-action as a schema constraint
&lt;/h2&gt;

&lt;p&gt;The third pattern is the boring one, and it's the one most AI assistants get wrong.&lt;/p&gt;

&lt;p&gt;The model is allowed to draft a reply. It is allowed to propose a calendar move. It is allowed to plan a sequence of actions. It is &lt;em&gt;not&lt;/em&gt; allowed to send, move, or commit any of it. Not because we don't trust the model — we sometimes do — but because &lt;em&gt;the user&lt;/em&gt; needs to know the surface area of what the system is doing on their behalf, and "silently sent" is a category of bug that never recovers user trust once it happens.&lt;/p&gt;

&lt;p&gt;This is enforced at the schema level. Every action the agent proposes lives in a &lt;code&gt;PendingAction&lt;/code&gt; row with a status enum. The state machine for that enum is the contract: only one transition (&lt;code&gt;approve()&lt;/code&gt;) gets the side effect to actually run. The agent can &lt;code&gt;propose()&lt;/code&gt; all day; nothing ships without a deliberate user transition.&lt;/p&gt;

&lt;p&gt;The lowest-risk class of actions — internal-only things like blocking calendar time for focus, snoozing an item, setting a reminder — can be marked &lt;code&gt;auto&lt;/code&gt; and skip approval. Everything that touches an outside party (sending mail, modifying someone else's calendar) is always gated. The boundary is conservative on purpose. The day a single user discovers their AI assistant silently sent an apology to their VC is the day every AI assistant in the category becomes harder to sell.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;The sum of these three patterns is not a smarter inbox. It is a small, quiet queue that contains roughly six to twelve items on any given day. Each item is either an explicit ask, a tracked commitment coming due, or a proposed action waiting for confirmation. The model spent the morning reading and reasoning about a few hundred other things, all of which the system decided you don't need to know about.&lt;/p&gt;

&lt;p&gt;When you dismiss an item, the system learns. When a contact reliably delivers, their asks rise. When the model wants to act outside a narrow safelist, it asks first. The result, after a few weeks of training the noise floor, is a queue that feels like it was assembled by someone who actually knows what you ignore.&lt;/p&gt;

&lt;p&gt;None of this requires a frontier model. The classifier underneath is a small, cheap LLM with strict cost guards. Almost all of the value is in the boundaries — what the system refuses to surface, what it refuses to do without you, and what it remembers about people you work with.&lt;/p&gt;

&lt;p&gt;If you're building anything in this category and you find yourself adding a &lt;em&gt;new surface that shows the user more things&lt;/em&gt;, stop and ask whether you'd rather build the thing that subtracts. The market is crowded with smarter inboxes. There is no good decision layer yet.&lt;/p&gt;

&lt;p&gt;I'm shipping one at &lt;a href="https://klorn.ai" rel="noopener noreferrer"&gt;klorn.ai&lt;/a&gt;. Not asking for signups — sharing the pattern because I think more people should be building toward it. The closed-loop suppression and trust-score code above are excerpts from the real thing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built in TypeScript on Fastify, Prisma, and Postgres. Code patterns shown are production excerpts.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>ai</category>
      <category>typescript</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
