<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeremy Longshore</title>
    <description>The latest articles on DEV Community by Jeremy Longshore (@jeremy_longshore).</description>
    <link>https://dev.to/jeremy_longshore</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842419%2Ff5d02b54-daf0-4520-9aef-118fbd0c24ac.jpeg</url>
      <title>DEV Community: Jeremy Longshore</title>
      <link>https://dev.to/jeremy_longshore</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeremy_longshore"/>
    <language>en</language>
    <item>
      <title>Run the Readiness Audit Before You Flip DNS</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Sat, 04 Jul 2026 13:00:29 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/run-the-readiness-audit-before-you-flip-dns-2g07</link>
      <guid>https://dev.to/jeremy_longshore/run-the-readiness-audit-before-you-flip-dns-2g07</guid>
      <description>&lt;h2&gt;
  
  
  The cutover that would have taken the money and then broken
&lt;/h2&gt;

&lt;p&gt;The DiagnosticPro migration moved a live product off Firebase, Firestore, GCP, and Vertex AI onto a single self-hosted VPS. New database engine, new secrets model, new LLM client, new proxy, new deployment shape — the whole substrate replaced at once. The plan ended with the usual last step: flip &lt;code&gt;diagnosticpro.io&lt;/code&gt; DNS from the old host to the new one and watch the traffic move over.&lt;/p&gt;

&lt;p&gt;That last step is the one that cannot be un-done cheaply. The moment DNS propagates, real customers hit the new stack with real credit cards. Everything before that moment is reversible: the old host is still authoritative, rollback costs a config revert. Everything after it is a live incident.&lt;/p&gt;

&lt;p&gt;So before the flip, the new stack was put through an &lt;em&gt;adversarial readiness audit&lt;/em&gt;: not a smoke test against the happy path, but a deliberate, multi-lens attempt to find how the migration would fail in production — run against the deployed stack while the old host was still authoritative and rollback was still free. It found a failure, and it was the worst possible kind.&lt;/p&gt;

&lt;h2&gt;
  
  
  The single most damning finding
&lt;/h2&gt;

&lt;p&gt;The live VPS database was missing &lt;strong&gt;every column the payment and membership write-paths depended on&lt;/strong&gt; — nine columns on the &lt;code&gt;submissions&lt;/code&gt; table, three on the &lt;code&gt;analyses&lt;/code&gt; table. The schema the code was written against and the schema actually deployed on the box had drifted apart.&lt;/p&gt;

&lt;p&gt;Trace what that means through a real Stripe checkout:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Customer enters card details and completes checkout. &lt;strong&gt;Stripe charges the card.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Stripe fires &lt;code&gt;checkout.session.completed&lt;/code&gt; to the webhook.&lt;/li&gt;
&lt;li&gt;The webhook handler tries to write the submission row — including columns that do not exist on the live table.&lt;/li&gt;
&lt;li&gt;The write throws. The handler returns &lt;strong&gt;HTTP 500.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The customer has already been charged. The system then errors out and cannot record the purchase, cannot queue the diagnostic, cannot deliver the report. This is the worst kind of failure: &lt;strong&gt;the money moves, the system errors, and no record survives on either side.&lt;/strong&gt; Not a declined card, not a graceful "try again" — a completed charge followed by a server error, with no record on your side that the transaction happened.&lt;/p&gt;

&lt;p&gt;Every single paid checkout after the DNS flip would have hit this. Not an edge case. The main revenue path, guaranteed to fail, on a stack that had already collected the customer's money.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix: migrate the schema on boot, idempotently
&lt;/h3&gt;

&lt;p&gt;The repair was not a one-time hand-run &lt;code&gt;ALTER TABLE&lt;/code&gt; on the box — that fixes today's database and silently rots the next time a fresh environment comes up. The fix was to make the application &lt;strong&gt;upgrade its own schema on startup&lt;/strong&gt;, so any database it boots against converges to the schema the code expects.&lt;/p&gt;

&lt;p&gt;The migration reads the current shape of each table, compares it to what the code needs, and applies only the missing changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;migrateSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PRAGMA table_info(submissions)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;have&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;wanted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;stripe_session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEXT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;payment_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEXT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;membership_tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEXT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// ...the rest of the drifted columns&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;wanted&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;have&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`ALTER TABLE submissions ADD COLUMN &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;PRAGMA table_info&lt;/code&gt; gives the current columns; the loop only issues &lt;code&gt;ALTER TABLE&lt;/code&gt; for the ones that are absent. That &lt;code&gt;have.has(name)&lt;/code&gt; guard is load-bearing: SQLite's &lt;code&gt;ALTER TABLE ADD COLUMN&lt;/code&gt; has no &lt;code&gt;IF NOT EXISTS&lt;/code&gt; clause before version 3.37, so it throws if the column is already there — the idempotency lives in the JavaScript check, not the SQL. Run it twice and the second pass is a clean no-op. A brand-new database converges to the full schema; an old drifted one gets exactly the missing columns added in place. (The &lt;code&gt;${name} ${type}&lt;/code&gt; interpolation is safe only because both come from a hardcoded object in the same file — never substitute column names or types from untrusted input; SQLite won't bind them as parameters.)&lt;/p&gt;

&lt;p&gt;Verified against the live VPS database, it applied &lt;strong&gt;12 migrations in place&lt;/strong&gt; — the exact drift the audit had predicted, closed before a single customer touched the new stack. A regression test locks the behavior in: create an old-shape database, boot the app, assert the columns exist, boot again, assert the second boot is a clean no-op.&lt;/p&gt;

&lt;h2&gt;
  
  
  What else the audit dragged into the light
&lt;/h2&gt;

&lt;p&gt;The schema drift was the headline, but an irreversible cutover has more than one way to go wrong. The audit was multi-lens on purpose — many independent passes, several review angles, each finding adversarially re-verified rather than trusted on first sight. Several of the others were also invisible until someone actually paid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A GCP secrets client on a non-GCP host.&lt;/strong&gt; The code carried a Google Secret Manager client inherited from its Firebase/GCP life. On a GCP host that client silently uses the platform's identity. On a self-hosted VPS there is no GCP metadata server — so the client hunts for one, and that hunt can hang process startup while it waits on a network endpoint that will never answer. You inherit this hazard for free when you self-host code that assumed it was running inside Google's platform. The fix was an env-first secrets model: secrets materialized at deploy time from an encrypted store into the process environment, and the cloud SDKs removed entirely — &lt;code&gt;@google-cloud/secret-manager&lt;/code&gt; and &lt;code&gt;google-auth-library&lt;/code&gt; deleted, pruning &lt;strong&gt;63 npm packages&lt;/strong&gt; from the install.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A dead gateway URL on the success page.&lt;/strong&gt; The deployed frontend bundle was still calling a decommissioned GCP gateway host — a stale fallback URL baked into the post-payment success page. It would only fire &lt;em&gt;after&lt;/em&gt; a customer paid, which is precisely why a normal click-through of the site never surfaced it. The path a paying customer takes is often the least-tested path on the whole site.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A login guaranteed to throw.&lt;/strong&gt; The Whop login flow referenced an undeclared &lt;code&gt;membership&lt;/code&gt; variable — residue from the deleted Firestore code path. Every login attempt would hit a &lt;code&gt;ReferenceError&lt;/code&gt; and break. Not intermittent, not conditional: a guaranteed crash on a core flow, left behind by the migration itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Webhook replay with no idempotency guard.&lt;/strong&gt; Stripe re-delivers webhooks; &lt;code&gt;checkout.session.completed&lt;/code&gt; can arrive more than once for the same session. Without a guard, a duplicate delivery re-queues the diagnostic work and re-runs the LLM — double cost, double side effects. The fix keys on the session ID and treats any already-seen event as a no-op acknowledgement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two testing traps the cold-CI sweep caught
&lt;/h3&gt;

&lt;p&gt;The same discipline — replicate the real environment instead of trusting the local one — surfaced two test bugs a warm laptop hides:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An exact-string environment gate.&lt;/strong&gt; The backend gates its mock-LLM path on the &lt;em&gt;exact&lt;/em&gt; string &lt;code&gt;'true'&lt;/code&gt;. The end-to-end test set &lt;code&gt;TEST_MOCK_LLM=1&lt;/code&gt;. Environment variables are always strings in Node, so the gate compares &lt;code&gt;"1" !== 'true'&lt;/code&gt; — the test would have driven a &lt;strong&gt;real, keyless LLM call&lt;/strong&gt; and failed the full-flow run for a reason that had nothing to do with the code under test. The lesson is unforgiving and portable: know exactly which value your gate compares against. &lt;code&gt;"1"&lt;/code&gt;, &lt;code&gt;"true"&lt;/code&gt;, &lt;code&gt;"yes"&lt;/code&gt;, and &lt;code&gt;true&lt;/code&gt; are four different things, and a strict-equality check honors exactly one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Passes locally" that CI never saw.&lt;/strong&gt; A &lt;code&gt;ts-jest&lt;/code&gt; inline &lt;code&gt;tsconfig&lt;/code&gt; omitted &lt;code&gt;lib&lt;/code&gt;. TypeScript derives its default &lt;code&gt;lib&lt;/code&gt; from &lt;code&gt;target&lt;/code&gt;, and &lt;code&gt;target: es2022&lt;/code&gt; defaults to &lt;code&gt;["ES2022"]&lt;/code&gt; — which has no &lt;code&gt;DOM&lt;/code&gt;. Locally the project-level &lt;code&gt;tsconfig&lt;/code&gt; still supplied &lt;code&gt;DOM&lt;/code&gt;, so compilation passed; the inline &lt;code&gt;ts-jest&lt;/code&gt; config didn't inherit it, so CI type-checked against the bare &lt;code&gt;es2022&lt;/code&gt; defaults and the DOM globals — &lt;code&gt;window&lt;/code&gt;, &lt;code&gt;IntersectionObserver&lt;/code&gt; — failed to resolve. The green checkmark on the laptop came from config the CI transform never read. The fix pinned &lt;code&gt;lib: ["ES2022", "DOM", "DOM.Iterable"]&lt;/code&gt; and turned on &lt;code&gt;isolatedModules&lt;/code&gt;. A related trap in the same suite: the Playwright job ran &lt;code&gt;vite preview&lt;/code&gt; without building &lt;code&gt;dist&lt;/code&gt; first, so the preview server errored with "directory dist does not exist." Both share a moral — &lt;strong&gt;"it works on my machine" can quietly mean "it works with config and build artifacts CI never sees."&lt;/strong&gt; CI is cold and empty by design — which is exactly why an adversarial audit replicates the deploy environment instead of reading the code and trusting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The remediation discipline
&lt;/h2&gt;

&lt;p&gt;The audit ran &lt;strong&gt;independent, adversarial review of the new stack before the irreversible step, while rollback was still free.&lt;/strong&gt; Many separate passes, several review lenses, every finding re-verified against the actual live artifacts rather than the code as written. The drift between "what the code assumes" and "what is actually deployed" is exactly the gap that a happy-path smoke test steps right over.&lt;/p&gt;

&lt;p&gt;Finding the bugs was half the job. The other half was making sure they stayed fixed. Before the flip, a revenue-path test suite was installed and gated in CI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;7 backend test suites, 92 tests total.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stripe&lt;/strong&gt; — first delivery, replay/idempotency, signature validation, and the checkout-session contract.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whop&lt;/strong&gt; — Standard-Webhooks verification, legacy HMAC, a length-mismatch path that must return 401, and the OAuth flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reports routes&lt;/strong&gt;, plus a &lt;strong&gt;rate-limit&lt;/strong&gt; test asserting the 11th submission returns 429.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema-migration regression&lt;/strong&gt; — old database upgrades cleanly, and a double boot is idempotent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDF generation&lt;/strong&gt; — output exceeds 10 KB and starts with the &lt;code&gt;%PDF-&lt;/code&gt; header.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backend coverage: 84.6% of lines, with a 60% floor gated in CI.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pre-remediation grade was recorded as &lt;strong&gt;D+&lt;/strong&gt; in a &lt;code&gt;TEST_AUDIT.md&lt;/code&gt;. That honest starting grade mattered: it named the gap in writing so the revenue-path P0s could be tracked and closed rather than waved through. By the flip, every revenue-path P0 was closed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack the audit ran against
&lt;/h2&gt;

&lt;p&gt;For context on the scope of the drift, here is the substrate the traffic landed on — every layer new relative to the Firebase/GCP original, which is why nothing about the old deployment could be assumed to still hold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQLite via &lt;code&gt;better-sqlite3&lt;/code&gt; in WAL mode&lt;/strong&gt; for the database, with the local filesystem for artifacts — no managed database, no object store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI &lt;code&gt;gpt-4o&lt;/code&gt; through an OpenAI-compatible client.&lt;/strong&gt; The old &lt;code&gt;callVertexAI&lt;/code&gt; became &lt;code&gt;callLLM&lt;/code&gt;; the provider is now a pure environment swap, not a code change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Env-first secrets&lt;/strong&gt;, materialized from an encrypted (SOPS) source at deploy time — no cloud secrets SDK in the running process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A multi-stage Dockerfile with docker-compose&lt;/strong&gt;, fronted by &lt;strong&gt;Caddy&lt;/strong&gt; proxying &lt;code&gt;/api&lt;/code&gt; on the VPS, on &lt;strong&gt;Node 20.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firebase and Firestore deleted from the client entirely&lt;/strong&gt;, and every Google Cloud dependency removed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: &lt;code&gt;diagnosticpro.io&lt;/code&gt; DNS was flipped off Firebase and went live on the VPS on 2026-07-01.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before you flip DNS — a short checklist
&lt;/h2&gt;

&lt;p&gt;Take this to your own cutover:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audit the deployed artifact, not the source.&lt;/strong&gt; The schema drift was invisible in the code — it lived in the gap between the repo and the live box. Point your verification at what is actually running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Walk the money path end to end.&lt;/strong&gt; The success page, the webhook write, the receipt — the paths a paying customer takes are often the least-exercised on the whole site, and the most expensive to get wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grep the new host for assumptions from the old one.&lt;/strong&gt; Cloud SDKs, metadata-server calls, gateway URLs, identity clients — anything that quietly depended on the previous platform is now a boot hazard or a dead call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make schema changes converge, not command.&lt;/strong&gt; A boot-time idempotent migration fixes every environment forever; a hand-run &lt;code&gt;ALTER TABLE&lt;/code&gt; fixes exactly one database until the next fresh deploy rots it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust cold CI over a warm laptop.&lt;/strong&gt; "Passes locally" can mean "passes with config and build artifacts CI never sees." Run it clean before you believe it.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;The moment before an irreversible cutover is the &lt;strong&gt;cheapest time you will ever have&lt;/strong&gt; to find catastrophic bugs. After the DNS flip, the schema drift is a live payment incident with charged customers and no records. Before it, it is a diff. The distance between those two costs is one adversarial audit of the new stack plus a revenue-path test suite — run while rollback is still nothing more than reverting a config. Buy the safety while it is free.&lt;/p&gt;

</description>
      <category>migration</category>
      <category>selfhosting</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>Surviving CoreWeave: the GPU failures that burn your hours</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Thu, 02 Jul 2026 22:13:15 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/surviving-coreweave-the-gpu-failures-that-burn-your-hours-1233</link>
      <guid>https://dev.to/jeremy_longshore/surviving-coreweave-the-gpu-failures-that-burn-your-hours-1233</guid>
      <description>&lt;p&gt;Your 64-GPU job just threw &lt;strong&gt;Xid 94&lt;/strong&gt;. Reschedule, or is the node dead?&lt;/p&gt;

&lt;p&gt;It comes down to one bit, and most people get it wrong. They see "memory error," panic, drain the whole node, and lose an hour babysitting a machine that was fine. Or they do the opposite — retry blindly onto a GPU that's quietly returning corrupt gradients.&lt;/p&gt;

&lt;p&gt;CoreWeave gives you world-class hardware with very sharp edges. The failure modes aren't exotic. They're just undocumented in the place you're looking when a run dies at 2 a.m. Here are the five that burn the most hours, and how to read them fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node forensics: read the Xid before you touch anything
&lt;/h2&gt;

&lt;p&gt;When a GPU misbehaves, the driver logs an &lt;strong&gt;Xid&lt;/strong&gt; to &lt;code&gt;dmesg&lt;/code&gt;. The Xid number is the whole story. Learn three of them and you'll triage 90% of node incidents without a support ticket.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Xid 79 — "GPU has fallen off the bus."&lt;/strong&gt; The driver can't reach the GPU over PCIe anymore. The NVIDIA catalog's recommended action is &lt;code&gt;RESTART_BM&lt;/code&gt; — restart the bare metal. The card is gone until the node reboots. Don't fight it; the node needs a power cycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Xid 94 — "Contained memory error."&lt;/strong&gt; A memory error the hardware isolated to &lt;em&gt;one&lt;/em&gt; application. Action: &lt;code&gt;RESTART_APP&lt;/code&gt;. Every other process on that GPU keeps running. You restart your job and move on. The node is not dead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Xid 95 — "Uncontained memory error."&lt;/strong&gt; The error escaped the faulting process. Other work on that GPU may have read corrupt data. Action: &lt;code&gt;RESET_GPU&lt;/code&gt;. Now the GPU is suspect and everything that touched it is suspect with it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the bit everyone gets wrong. &lt;strong&gt;94 is contained — reschedule. 95 is uncontained — the GPU is out.&lt;/strong&gt; Same three words in the log ("GPU memory error"), opposite responses.&lt;/p&gt;

&lt;p&gt;The other pair worth memorizing is &lt;strong&gt;Xid 63 / 64&lt;/strong&gt; — row remapper events. Ampere and newer GPUs quietly retire bad memory rows. When you suspect a flaky card, read the remap state directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; ROW_REMAPPER
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two fields decide the verdict. &lt;strong&gt;&lt;code&gt;Pending: Yes&lt;/code&gt;&lt;/strong&gt; means a remap is scheduled but won't take effect until the GPU is reset — the repair is durably recorded, just not live. &lt;strong&gt;&lt;code&gt;Remapping Failure Occurred: Yes&lt;/code&gt;&lt;/strong&gt; is the one that ends the conversation: the remap couldn't be written (that's the Xid 64 case), and the card needs to come out of service. Pending is routine. Failure is terminal.&lt;/p&gt;

&lt;p&gt;One CoreWeave-specific rule sits on top of all this: &lt;strong&gt;never manually uncordon a health cordon.&lt;/strong&gt; CoreWeave's Node Life Cycle controller cordons nodes automatically on GPU hardware errors and reboots them. Those node conditions are internal machinery, not a knob for your automation. If you &lt;code&gt;kubectl uncordon&lt;/code&gt; a node CoreWeave cordoned for a hardware fault, you're scheduling work back onto a known-bad card. Let the controller clear it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fabric lies quietly: RDMA that fell back to TCP
&lt;/h2&gt;

&lt;p&gt;This one doesn't crash. It just makes your training 5x slower and says nothing.&lt;/p&gt;

&lt;p&gt;GPUDirect RDMA over InfiniBand is what makes multi-node training fast. When it silently falls back to TCP sockets, your &lt;code&gt;all_reduce&lt;/code&gt; still completes — over Ethernet, at a fraction of the bandwidth. No error. No log line screaming at you. Just a job that's mysteriously behind schedule.&lt;/p&gt;

&lt;p&gt;Three things have to be true, and all three fail quietly if you miss one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Request the InfiniBand resource in both requests and limits.&lt;/strong&gt; It's a boolean scheduling flag, not a count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;rdma/ib&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;rdma/ib&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Miss it in either block and your pod schedules onto a node without IB, and NCCL shrugs and uses TCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Point NCCL at the right interfaces.&lt;/strong&gt; On CoreWeave that's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;NCCL_IB_HCA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ibp
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;NCCL_SOCKET_IFNAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;eth0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Then confirm it actually took.&lt;/strong&gt; Turn on debug output and read one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;NCCL_DEBUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;INFO
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;NET/IB&lt;/code&gt;. That's InfiniBand — good. If you see &lt;code&gt;NET/Socket&lt;/code&gt;, you're on TCP and every second of that run is wasted. CoreWeave's own docs put it plainly: if a multi-node job runs but throughput is far below expectations, NCCL has usually fallen back from InfiniBand to TCP.&lt;/p&gt;

&lt;p&gt;Prove the fabric before you trust it. Run &lt;code&gt;all_reduce_perf&lt;/code&gt; from CoreWeave's &lt;code&gt;nccl-tests&lt;/code&gt; and check the reported bus bandwidth. On a 64-GPU H100 run over InfiniBand you should see hundreds of GB/s — the exact number moves with your NCCL version, so baseline against CoreWeave's published manifests, not a screenshot from someone's blog. If your busbw is an order of magnitude low, you already know why: go back and read the &lt;code&gt;NET/&lt;/code&gt; line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Launch: MPIJob, not PyTorchJob
&lt;/h2&gt;

&lt;p&gt;The obvious path — Kubeflow's &lt;code&gt;PyTorchJob&lt;/code&gt; — is not the path CoreWeave paves. Their reference manifests for multi-node NCCL runs use the &lt;strong&gt;MPI Operator (&lt;code&gt;MPIJob&lt;/code&gt;)&lt;/strong&gt;, with SUNK/Slurm as the other supported route. Follow the paved road; the sharp edges are already sanded off it.&lt;/p&gt;

&lt;p&gt;Two traps live here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;slotsPerWorker&lt;/code&gt; must equal the GPUs per pod.&lt;/strong&gt; The MPI Operator launches one rank per slot. Set eight GPUs per worker and four slots, and &lt;code&gt;mpirun&lt;/code&gt; spins up the wrong number of ranks — half your GPUs sit idle while the job reports "running."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't reach for the SUNK pod scheduler to gang-schedule a distributed job.&lt;/strong&gt; Its own docs are blunt: &lt;em&gt;"No gang scheduling. The scheduler schedules each Pod as a separate Slurm job. Multi-node PodGroups aren't supported."&lt;/em&gt; It's built for single-node work like inference. Point it at a multi-pod training job and you get piecemeal scheduling — some ranks start, the rest stay &lt;code&gt;Pending&lt;/code&gt;, and the ranks that started block forever at NCCL rendezvous waiting for peers that were never scheduled. Your job hangs at init with no error. Use &lt;code&gt;MPIJob&lt;/code&gt; or a real Slurm allocation, which grab the whole gang at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage: the endpoint decides your throughput
&lt;/h2&gt;

&lt;p&gt;CoreWeave AI Object Storage is S3-compatible, so your existing code "just works." That's the trap — it works at the wrong speed if you use the wrong hostname.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inside the cluster, hit &lt;code&gt;http://cwlota.com&lt;/code&gt;.&lt;/strong&gt; That's LOTA, the Local Object Transport Accelerator — a proxy that caches objects on the GPU nodes' local NVMe. The public &lt;code&gt;https://cwobject.com&lt;/code&gt; endpoint works too, and it'll route you around the accelerator you're paying for. In-cluster: &lt;code&gt;cwlota.com&lt;/code&gt;. Full stop.&lt;/p&gt;

&lt;p&gt;LOTA has rules. &lt;strong&gt;It only caches objects larger than 4 MB&lt;/strong&gt; — anything smaller bypasses the cache entirely. So the classic "millions of tiny files" dataset gets zero benefit. Consolidate small samples into larger archives (WebDataset, TAR, TFRecord) so your objects clear the 4 MB bar, and use a &lt;strong&gt;minimum 50 MB multipart part size&lt;/strong&gt; to cut request overhead.&lt;/p&gt;

&lt;p&gt;The payoff is real once the cache warms. CoreWeave's own benchmark on 160 H200 GPUs measured about &lt;strong&gt;24 GiB/s aggregate cold&lt;/strong&gt; (first minute, cache empty) climbing to &lt;strong&gt;368 GiB/s warm&lt;/strong&gt; — roughly 2.3 GiB/s per GPU. Cold reads are network-bound; warm reads come off local NVMe. Structure your data loader to warm the cache early.&lt;/p&gt;

&lt;p&gt;For weights, &lt;strong&gt;Tensorizer&lt;/strong&gt; loads faster than the standard path — CoreWeave clocked GPT-J-6B at ~8.2 s median versus ~15 s for HuggingFace on A40s, by streaming tensors at wire speed instead of deserializing a checkpoint blob.&lt;/p&gt;

&lt;p&gt;For resilience, use &lt;strong&gt;PyTorch Distributed Checkpoint&lt;/strong&gt;. &lt;code&gt;torch.distributed.checkpoint.async_save&lt;/code&gt; writes checkpoints on a background thread so training barely stalls, and DCP's load-time &lt;strong&gt;resharding&lt;/strong&gt; lets you save on one topology and resume on another. That last part matters here: when CoreWeave cordons a node mid-run and you come back on a different world size, a DCP checkpoint loads anyway. A rigidly-sharded one doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost: there is no dashboard — you build it
&lt;/h2&gt;

&lt;p&gt;Here's the one that surprises every finance-conscious team: &lt;strong&gt;CoreWeave ships no cost dashboard and no billing API.&lt;/strong&gt; You want to know what you're spending? You build it yourself, in PromQL, against their managed Grafana.&lt;/p&gt;

&lt;p&gt;The usage metrics are there. The one you start with is &lt;code&gt;billing:instance:total&lt;/code&gt; — running instances by cluster — alongside &lt;code&gt;billing:object_storage_used_bytes:total&lt;/code&gt; and friends. Multiply usage by your rate card inside the query to synthesize a cost estimate; there's no single "dollars" metric to read. You'll need to be in the &lt;code&gt;admin&lt;/code&gt;, &lt;code&gt;metrics&lt;/code&gt;, or &lt;code&gt;write&lt;/code&gt; group to see them.&lt;/p&gt;

&lt;p&gt;Two levers keep the number down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reserve carefully.&lt;/strong&gt; Reserved capacity runs 25–60% off on-demand, with the deepest cuts on multi-year H100/H200 commitments. But reserved means &lt;em&gt;paid whether or not you use it&lt;/em&gt; — an idle reservation is worse than on-demand. Reserve only your steady-state floor; burst the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right-size the GPU to the model.&lt;/strong&gt; An H100 is wasted on small-model inference. Third-party 2026 benchmarks consistently put the L40S ahead of the H100 on cost-per-token for models in the ~7B–30B range — reach for the H100 only when you genuinely need ultra-low latency. (Treat the exact dollar figures as directional; they swing hard by benchmark and by month.) And remember the hardware line: &lt;strong&gt;FP8 lives on Hopper and Ada, not Ampere&lt;/strong&gt; — if your inference plan is FP8, an A100 can't run it, so match the silicon to the numeric format before you commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read the Xid first.&lt;/strong&gt; 79 → node reboot. 94 → contained, restart your app. 95 → uncontained, the GPU is out. &lt;code&gt;nvidia-smi -q -d ROW_REMAPPER&lt;/code&gt;: Pending is fine, Remapping Failure is terminal. Never uncordon a health cordon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confirm &lt;code&gt;NET/IB&lt;/code&gt;, not &lt;code&gt;NET/Socket&lt;/code&gt;.&lt;/strong&gt; RDMA fails silent and slow. &lt;code&gt;rdma/ib: 1&lt;/code&gt; in requests &lt;em&gt;and&lt;/em&gt; limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Launch with &lt;code&gt;MPIJob&lt;/code&gt;.&lt;/strong&gt; &lt;code&gt;slotsPerWorker&lt;/code&gt; = GPUs per pod. The SUNK pod scheduler won't gang-schedule — your job will hang at rendezvous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use &lt;code&gt;cwlota.com&lt;/code&gt; in-cluster.&lt;/strong&gt; Keep objects above 4 MB, parts above 50 MB. &lt;code&gt;async_save&lt;/code&gt; + DCP resharding survives node loss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build your cost view in PromQL.&lt;/strong&gt; No dashboard exists. Reserve only your floor; put small-model inference on cheaper silicon.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is in one place, which is exactly why it costs people days. Now it's in one place.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Jeremy Longshore builds Claude Code skills for infrastructure platforms at Intent Solutions — the CoreWeave GPU-ops pack lives in &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills" rel="noopener noreferrer"&gt;claude-code-plugins&lt;/a&gt;. Community-contributed. Not affiliated with, endorsed by, or sponsored by CoreWeave, Inc.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>debugging</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Gate the Statement, Not the Tool Name</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Thu, 02 Jul 2026 21:35:25 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/gate-the-statement-not-the-tool-name-2hpc</link>
      <guid>https://dev.to/jeremy_longshore/gate-the-statement-not-the-tool-name-2hpc</guid>
      <description>&lt;p&gt;The original safety gate on the Dolt-over-MCP plugin tried to keep a Claude Code agent harmless by excluding "history-affecting tools" from its MCP grant. It was the wrong granularity, and it did nothing.&lt;/p&gt;

&lt;p&gt;MCP exposes the entire database through one tool — &lt;code&gt;query&lt;/code&gt; / &lt;code&gt;exec&lt;/code&gt; — and that tool carries every SQL verb. &lt;code&gt;SELECT&lt;/code&gt; rides it. So does &lt;code&gt;CALL DOLT_PUSH&lt;/code&gt;, &lt;code&gt;CALL DOLT_RESET('--hard')&lt;/code&gt;, &lt;code&gt;DROP DATABASE&lt;/code&gt;, and &lt;code&gt;CALL DOLT_BRANCH('-D', 'main')&lt;/code&gt;. Excluding "dangerous tools" from the grant accomplishes nothing, because the dangerous verbs live &lt;em&gt;inside&lt;/em&gt; the one tool you already granted. The destructive operations were never separate tools to exclude.&lt;/p&gt;

&lt;p&gt;This is the reframe the whole Phase 0 hardening pass turned on: &lt;strong&gt;a tool-name allowlist is meaningless for any tool that carries a sub-language.&lt;/strong&gt; SQL is a sub-language. So is the shell behind a &lt;code&gt;Bash&lt;/code&gt; tool. So is anything behind an &lt;code&gt;eval&lt;/code&gt;. If the tool can run arbitrary statements in some grammar, the only boundary that means anything is one that reads the statement. It is the move from tool-name allowlisting to capability-based security: the grant stops being "you may call the &lt;code&gt;query&lt;/code&gt; tool" and becomes "you may run these statement classes inside it."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just allowlist the safe tools?
&lt;/h2&gt;

&lt;p&gt;Because there is exactly one tool, and it is not safe or unsafe — it is whatever statement you hand it. You cannot partition a single door into a safe door and a dangerous door by naming. The same logic kills the next-obvious fix: a denylist of dangerous verbs. Blacklist &lt;code&gt;DOLT_PUSH&lt;/code&gt;, &lt;code&gt;DOLT_RESET&lt;/code&gt;, &lt;code&gt;DROP&lt;/code&gt;... and miss &lt;code&gt;DOLT_REBASE&lt;/code&gt;, or the proc Dolt ships next quarter, or a &lt;code&gt;CALL&lt;/code&gt; whose name your regex didn't anticipate. A denylist is only as good as your imagination on the day you wrote it.&lt;/p&gt;

&lt;p&gt;The fix inverts that. &lt;strong&gt;You add safety by enumerating what is safe, not by blacklisting what is dangerous.&lt;/strong&gt; Anything you cannot positively classify as safe is treated as the most dangerous thing it could be. Default-deny the unknown. It's least privilege applied to a grammar: the agent gets only the verbs it can prove it needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The classifier: three verb classes, decided before the server sees it
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;scripts/sql_classifier.py&lt;/code&gt; is 259 lines of pure stdlib. No import side effects, so the 22 unit tests in &lt;code&gt;tests/test_sql_classifier.py&lt;/code&gt; import it directly and hammer it in isolation. &lt;code&gt;scripts/dolt-mcp-client.py&lt;/code&gt; makes it the chokepoint — every statement is classified &lt;em&gt;before&lt;/em&gt; it reaches the dolt-mcp server, into one of three classes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;read&lt;/strong&gt; — &lt;code&gt;SELECT&lt;/code&gt; / &lt;code&gt;SHOW&lt;/code&gt; / &lt;code&gt;DESCRIBE&lt;/code&gt; / &lt;code&gt;EXPLAIN&lt;/code&gt; / read-only table functions → executes freely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;safe-write&lt;/strong&gt; — &lt;code&gt;INSERT&lt;/code&gt; / &lt;code&gt;UPDATE&lt;/code&gt; / &lt;code&gt;DELETE&lt;/code&gt; / &lt;code&gt;CREATE TABLE&lt;/code&gt; / &lt;code&gt;CALL DOLT_COMMIT&lt;/code&gt; / &lt;code&gt;DOLT_CHECKOUT&lt;/code&gt; / &lt;code&gt;DOLT_BRANCH&lt;/code&gt; (create) → executes &lt;strong&gt;only&lt;/strong&gt; on an agent-owned branch (never &lt;code&gt;main&lt;/code&gt;) and &lt;strong&gt;only&lt;/strong&gt; under &lt;code&gt;--allow-mutation&lt;/code&gt;. On pre-GA / alpha database flavors it's refused entirely — read-only there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;history-affecting&lt;/strong&gt; — &lt;code&gt;CALL DOLT_PUSH&lt;/code&gt; / &lt;code&gt;DOLT_PULL&lt;/code&gt; / &lt;code&gt;DOLT_MERGE&lt;/code&gt; / &lt;code&gt;DOLT_REBASE&lt;/code&gt; / &lt;code&gt;DOLT_RESET('--hard')&lt;/code&gt; / branch-or-tag delete / &lt;code&gt;DROP DATABASE&lt;/code&gt; / &lt;code&gt;GRANT&lt;/code&gt; / any unknown &lt;code&gt;CALL …&lt;/code&gt; → &lt;strong&gt;always refused.&lt;/strong&gt; The classifier is recommend-only here: it surfaces the exact command it would have run and a human runs it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent gets to mutate its own scratch branch. It never gets to rewrite shared history. That line is drawn by reading the verb, not by trusting a tool grant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fail-safe details are the whole point
&lt;/h2&gt;

&lt;p&gt;A statement-level gate is really an input-validation boundary — every statement is validated before it executes — and a classifier is only as good as its failure mode. This one fails closed, and the details are where a naive regex gate quietly gets it wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default-deny the unknown.&lt;/strong&gt; Any &lt;code&gt;CALL&lt;/code&gt; with no resolvable procedure name, and any unrecognized &lt;code&gt;CALL DOLT_*&lt;/code&gt;, is classified history-affecting — refused. When Dolt ships a new stored proc, it lands on the deny side automatically, with no code change. That's the enumerate-the-safe principle paying rent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch = max severity.&lt;/strong&gt; A multi-statement batch is classified at the severity of its most dangerous statement. A read prefix cannot smuggle a write past the gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A batch is as dangerous as its worst statement.
# "SELECT 1; CALL DOLT_PUSH(...)" classifies as history-affecting, not read.
&lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;classify_statement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;split_statements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SEVERITY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Comment-stripping is quote-aware.&lt;/strong&gt; &lt;code&gt;/* */&lt;/code&gt;, &lt;code&gt;--&lt;/code&gt;, and &lt;code&gt;#&lt;/code&gt; comments are stripped before classification, so a verb hidden behind a comment can't mask the real leading verb. But string literals are preserved — which matters more than it looks. The &lt;code&gt;--hard&lt;/code&gt; inside &lt;code&gt;CALL DOLT_RESET('--hard')&lt;/code&gt; must &lt;em&gt;not&lt;/em&gt; be mistaken for the start of a &lt;code&gt;--&lt;/code&gt; line comment. Get that wrong and a hard reset reads as a soft one. The contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;strip_sql_comments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove -- , # , and /* */ comments. Quote-aware.

    Inside a &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; string literal, comment markers are
    inert: the --hard in CALL DOLT_RESET(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--hard&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) survives intact.
    Backslash and doubled-quote escapes are honored so a quote inside
    a literal doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t prematurely end it.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;DOLT_RESET&lt;/code&gt; is severity-split on its argument.&lt;/strong&gt; Soft reset is safe-write. &lt;code&gt;--hard&lt;/code&gt; is history-affecting. Same proc name, two classes, decided by reading the argument the literal preserved above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cannot prove it's a read → at least safe-write.&lt;/strong&gt; Ambiguity loses. A &lt;code&gt;WITH&lt;/code&gt; (CTE) resolves to whatever it ultimately wraps — &lt;code&gt;WITH x AS (...) SELECT&lt;/code&gt; is read; &lt;code&gt;WITH x AS (...) DELETE&lt;/code&gt; is safe-write. The classifier never guesses in the agent's favor.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bash door: the §10 union gate
&lt;/h2&gt;

&lt;p&gt;Hardening the MCP path left a second door open. The original safety check inspected only &lt;code&gt;mcp__*&lt;/code&gt; grants. It was blind to the fact that an agent could still be handed &lt;code&gt;Bash(dolt:*)&lt;/code&gt; or &lt;code&gt;Bash(bash:*)&lt;/code&gt; and reach &lt;code&gt;dolt push&lt;/code&gt; — or anything — straight through the shell. Same destructive operation, different surface, completely unguarded.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;scripts/check-agent-safety.sh&lt;/code&gt; is the CI gate that closes it. It asserts the mutation-verb taxonomy across &lt;strong&gt;both&lt;/strong&gt; surfaces — every agent &lt;code&gt;.md&lt;/code&gt; and the core &lt;code&gt;SKILL.md&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No &lt;code&gt;Bash(&amp;lt;cmd&amp;gt;:*)&lt;/code&gt; wildcard that can reach a history-affecting op. &lt;code&gt;bash&lt;/code&gt;/&lt;code&gt;sh&lt;/code&gt; are arbitrary by definition; &lt;code&gt;dolt&lt;/code&gt;/&lt;code&gt;bd&lt;/code&gt;/&lt;code&gt;bd-sync&lt;/code&gt;/&lt;code&gt;git&lt;/code&gt; reach &lt;code&gt;push&lt;/code&gt; / &lt;code&gt;reset&lt;/code&gt; / &lt;code&gt;branch -D&lt;/code&gt; / &lt;code&gt;killall&lt;/code&gt;. Banned.&lt;/li&gt;
&lt;li&gt;No granted MCP tool outside the read/safe set — so a &lt;em&gt;future&lt;/em&gt; &lt;code&gt;…__exec&lt;/code&gt; / &lt;code&gt;…__merge&lt;/code&gt; / &lt;code&gt;…__push&lt;/code&gt; / &lt;code&gt;…__reset&lt;/code&gt; grant fails the build the moment someone adds it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The subtlety that makes it correct: &lt;strong&gt;it scans the allowlist only, never the denylist.&lt;/strong&gt; A destructive pattern in &lt;code&gt;disallowedTools&lt;/code&gt; is the mitigation, not a violation — flagging it would be backwards. The gate only cares what a config &lt;em&gt;permits&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scan ALLOWED grants only. A destructive pattern under&lt;/span&gt;
&lt;span class="c"&gt;# disallowedTools is the fix, not the finding.&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-oE&lt;/span&gt; &lt;span class="s1"&gt;'Bash\(([^):]+):\*\)'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$agent_md&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; grant&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;cmd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'%s'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$grant&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'s/Bash\(([^):]+):\*\)/\1/'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$cmd&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="k"&gt;in
    &lt;/span&gt;bash|sh|dolt|bd|bd-sync|git&lt;span class="p"&gt;)&lt;/span&gt;
      fail &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$agent_md&lt;/span&gt;&lt;span class="s2"&gt; grants Bash(&lt;/span&gt;&lt;span class="nv"&gt;$cmd&lt;/span&gt;&lt;span class="s2"&gt;:*) — reaches a history-affecting op"&lt;/span&gt; &lt;span class="p"&gt;;;&lt;/span&gt;
  &lt;span class="k"&gt;esac&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That landed by replacing &lt;code&gt;Bash(bash|dolt|bd:*)&lt;/code&gt; wildcards in 5 agents plus the &lt;code&gt;SKILL.md&lt;/code&gt; with explicit read-only subcommand allowlists. Wildcards are an unbounded grant; an enumerated subcommand list is a bounded one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Invariants become mechanisms, not comments
&lt;/h2&gt;

&lt;p&gt;A second blocker had the same shape at a smaller scale: &lt;code&gt;scripts/dolt-push-dolthub.sh&lt;/code&gt; &lt;em&gt;documented&lt;/em&gt; its safety invariants in comments and trusted them. A safety invariant written only in a comment is not enforced. So the comments became mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A failed &lt;code&gt;bd export&lt;/code&gt; used to be swallowed with &lt;code&gt;|| true&lt;/code&gt;. Now a failed flush &lt;strong&gt;aborts the push&lt;/strong&gt; — you never push on an unverified flush.&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;flock&lt;/code&gt; idempotency guard makes overlapping scheduled runs a no-op, so a double-fire can't double-apply.&lt;/li&gt;
&lt;li&gt;On an ambiguous push failure, it polls the DoltHub SQL API for the real terminal state instead of blind-retrying into a possible double-push.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the supply chain got pinned: &lt;code&gt;dolt-mcp-server@v0.3.6&lt;/code&gt; plus a Go module checksum, consistent across README / SKILL / client. No &lt;code&gt;@latest&lt;/code&gt; in anything security-sensitive — &lt;code&gt;@latest&lt;/code&gt; means "I'll run whatever you publish next," which is not a thing you say to a binary that can rewrite a database.&lt;/p&gt;

&lt;h2&gt;
  
  
  The general lesson
&lt;/h2&gt;

&lt;p&gt;Tool-name allowlisting works when each tool is a single, fixed capability. It collapses the instant one tool carries a grammar. SQL over MCP is the case here, but a &lt;code&gt;Bash&lt;/code&gt; tool over a shell is the same hole, and so is any &lt;code&gt;eval&lt;/code&gt;-style tool that takes a string and runs it. For those, the tool name tells you nothing about what's about to happen. Only the statement does.&lt;/p&gt;

&lt;p&gt;So gate the statement. Enumerate the safe verbs, default-deny everything you can't prove safe, classify batches at max severity, and make sure your parser is honest about quotes and comments — because the one place a lazy gate breaks is the &lt;code&gt;--hard&lt;/code&gt; it mistook for a comment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also shipped
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;governed-second-brain&lt;/strong&gt; — &lt;code&gt;/teamkb-compile&lt;/code&gt;, a nightly job that compiles the day's work into the governed team brain (auto-graduates itself, fixed a tenant-mismatch bug); a follow-up review locked down the scratch dir and made paths portable with glob/dir guards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;intent-mail&lt;/strong&gt; — migrated to React 19 + Ink 7, removed a dead &lt;code&gt;@anthropic-ai/claude-agent-sdk&lt;/code&gt; integration, batch-adopted 11 gate-passing Dependabot bumps, and made the OSV scan report-only to kill a phantom red check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;claude-code-plugins&lt;/strong&gt; — databricks-pack was the Killer Skill of the Week (W27); the &lt;code&gt;dolt-mcp-vcs&lt;/code&gt; rename landed as a non-breaking install-slug alias, so the old &lt;code&gt;beads-dolt&lt;/code&gt; slug still resolves.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Related posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/llm-never-does-the-math/"&gt;The LLM Should Never Do the Math&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/when-llm-output-lies-instead-of-crashing/"&gt;When LLM Output Lies Instead of Crashing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/coverage-vs-mutation-testing-rules-engine/"&gt;Coverage vs Mutation Testing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;{&lt;br&gt;
  "&lt;a class="mentioned-user" href="https://dev.to/context"&gt;@context&lt;/a&gt;": "&lt;a href="https://schema.org" rel="noopener noreferrer"&gt;https://schema.org&lt;/a&gt;",&lt;br&gt;
  "@type": "BlogPosting",&lt;br&gt;
  "headline": "Gate the Statement, Not the Tool Name",&lt;br&gt;
  "description": "When one MCP tool carries every SQL verb, allowlisting tool names is theater. The safety boundary has to read the statement — here's how that gate was built.",&lt;br&gt;
  "datePublished": "2026-06-29T08:00:00-05:00",&lt;br&gt;
  "author": {&lt;br&gt;
    "@type": "Person",&lt;br&gt;
    "name": "Jeremy Longshore",&lt;br&gt;
    "url": "&lt;a href="https://startaitools.com/about/" rel="noopener noreferrer"&gt;https://startaitools.com/about/&lt;/a&gt;"&lt;br&gt;
  },&lt;br&gt;
  "publisher": {&lt;br&gt;
    "@type": "Organization",&lt;br&gt;
    "name": "StartAITools",&lt;br&gt;
    "url": "&lt;a href="https://startaitools.com" rel="noopener noreferrer"&gt;https://startaitools.com&lt;/a&gt;"&lt;br&gt;
  },&lt;br&gt;
  "articleSection": "Technical Deep-Dive",&lt;br&gt;
  "keywords": "ai-agents, claude-code, security, mcp, architecture",&lt;br&gt;
  "mainEntityOfPage": {&lt;br&gt;
    "@type": "WebPage",&lt;br&gt;
    "&lt;a class="mentioned-user" href="https://dev.to/id"&gt;@id&lt;/a&gt;": "&lt;a href="https://startaitools.com/posts/gate-the-statement-not-the-tool-name/" rel="noopener noreferrer"&gt;https://startaitools.com/posts/gate-the-statement-not-the-tool-name/&lt;/a&gt;"&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>claudecode</category>
      <category>security</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Coverage Said 69%, Mutation Testing Said 25%</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Wed, 01 Jul 2026 13:00:26 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/coverage-said-69-mutation-testing-said-25-5eol</link>
      <guid>https://dev.to/jeremy_longshore/coverage-said-69-mutation-testing-said-25-5eol</guid>
      <description>&lt;p&gt;Sunday 2026-06-28. intent-mail repo, fresh Stryker baseline run. The coverage gate reported green: 69.09% line coverage. Three seconds later, mutation testing reported 24.88%. The rules engine—the code that actually mutates user email—reported 0.00%.&lt;/p&gt;

&lt;p&gt;That zero is the story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap
&lt;/h2&gt;

&lt;p&gt;Coverage counts lines. Mutation testing counts assertions. When a 535-line end-to-end suite exercises a piece of code but doesn't assert on its internal logic, coverage sees a line executed and calls it a win. Mutation testing inverts a single boolean operator in that line, runs the suite again, and if the outcome is the same, marks that mutant as &lt;em&gt;survived&lt;/em&gt;. The engine had 301 mutants and zero of them were killed—not because there were no tests, but because the tests that ran the code never asserted on the code's &lt;em&gt;decision logic&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's the real shape of it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Line Coverage&lt;/th&gt;
&lt;th&gt;Mutation Score&lt;/th&gt;
&lt;th&gt;Killed&lt;/th&gt;
&lt;th&gt;Survived&lt;/th&gt;
&lt;th&gt;No-coverage Mutants&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;connectors/shared/retry.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;76.27%&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;storage/token-crypto.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;52.53%&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ai/daily-digest.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;34.66%&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;rules/engine.ts&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.00%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;301&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall repo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;69.09%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24.88%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;157&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;366&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The engine had no co-located unit tests. It was exercised end-to-end—so it counted toward line coverage—but its specific logic was never pinned. Flip a condition in the engine, and the outcome (was the email moved?) often stays the same. The assertion lives at the wrong level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Coverage vs. Mutation Testing: What's the Difference?
&lt;/h2&gt;

&lt;p&gt;Code coverage counts whether a line executed. Mutation testing counts whether that line's behavior is actually asserted. Run both on one repo and you can get 69% line coverage beside a 24.88% mutation score—because a line can execute in dozens of tests while none of them pin down what it should do.&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;mutant&lt;/em&gt; is a one-character change: &lt;code&gt;===&lt;/code&gt; becomes &lt;code&gt;!==&lt;/code&gt;, &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt; becomes &lt;code&gt;||&lt;/code&gt;, a &lt;code&gt;&amp;gt;&lt;/code&gt; becomes &lt;code&gt;&amp;gt;=&lt;/code&gt;. Stryker performs &lt;em&gt;fault injection&lt;/em&gt;—it injects the mutant, runs the test suite, and counts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Killed&lt;/strong&gt;: a test failed (the mutant was caught).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Survived&lt;/strong&gt;: all tests passed (the mutant hid).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-coverage&lt;/strong&gt;: no test executed that line at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 301 mutants in the engine were all no-coverage. Stryker never even got to run them against a test, because there was no unit test visiting that file.&lt;/p&gt;

&lt;p&gt;Meanwhile, the repo's overall line coverage sat at 69.09%—and the engine's lines were executed via the E2E path, counting toward that number. Coverage says "this line ran." Mutation testing says "this line's behavior is pinned down by assertions"—call it assertion coverage. They are not the same metric, and a &lt;a href="https://dev.to/posts/when-green-ci-proves-nothing/"&gt;green CI run&lt;/a&gt; only tells you about the first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stryker Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"packageManager"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"testRunner"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vitest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reporters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"html"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clear-text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"progress"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"coverageAnalysis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"perTest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mutate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"src/rules/engine.ts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"src/connectors/shared/retry.ts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"src/storage/token-crypto.ts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"src/ai/daily-digest.ts"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"thresholds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"break"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;mutate&lt;/code&gt; array targets four high-value pure-logic files. Stryker baseline runs in ~30 seconds. The &lt;code&gt;break: null&lt;/code&gt; is deliberate: report the score, but do not fail CI. Establish a baseline first. Ratchet later.&lt;/p&gt;

&lt;p&gt;This is the inverse of the "fail immediately on every finding" instinct. A fresh gate that blocks on day one gets disabled by the next engineer. Report-only first. Let the team see the numbers. Then make it enforceable—and &lt;a href="https://dev.to/posts/honor-the-gate-when-the-verdict-is-inconvenient/"&gt;honor the gate&lt;/a&gt; when its verdict is inconvenient.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why E2E Wasn't Enough
&lt;/h2&gt;

&lt;p&gt;The same day, a 535-line end-to-end suite was added:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rules engine E2E&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tmpDir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mkdtempSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;tmpdir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;intentmail-&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dbPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmpDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;test.db&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;testMasterKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;e&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 64-char hex = 32-byte AES-256 key&lt;/span&gt;

  &lt;span class="nf"&gt;beforeAll&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;INTENTMAIL_DB_PATH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dbPath&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;INTENTMAIL_MASTER_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;testMasterKey&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;applies a rule and writes an audit log&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Create account → upsert emails → create rule → run it → assert outcome&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;emails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getEmailsByRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ruleId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;emails&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;emails&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toContain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;archived&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;auditLog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getAuditLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ruleId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;auditLog&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;auditLog&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;move&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;em&gt;good&lt;/em&gt; and necessary. It proves the wiring: condition → action → side effect. But it asserts on the outcome (was the email moved?), not the logic (did the condition comparison return true or false for this specific case?).&lt;/p&gt;

&lt;p&gt;Here's the subtle part. The engine's 301 mutants didn't even &lt;em&gt;survive&lt;/em&gt;—they came back as &lt;strong&gt;no-coverage&lt;/strong&gt;. At baseline time, Stryker found no co-located unit test pinning &lt;code&gt;engine.ts&lt;/code&gt;, so it never ran those mutants against an assertion at all. The E2E suite executes the engine, but it asserts through the storage layer on the final state—which is exactly why the documented next step is a co-located &lt;code&gt;engine.test.ts&lt;/code&gt;, not more end-to-end tests.&lt;/p&gt;

&lt;p&gt;The other three files show the milder failure mode—a mutant that &lt;em&gt;is&lt;/em&gt; covered but still survives. When a test executes a mutated line yet only checks the outcome, the mutant lives. Take a condition in the engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Original&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="c1"&gt;// Stryker mutant: flip the condition&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An E2E test that asserts only on the final database state can let this flip through. Unless a fixture exercises both the matching and the non-matching branch—full branch coverage—&lt;em&gt;and&lt;/em&gt; checks each one, inverting the condition can still leave the rows where the test expects them. The test never asserted "for this subject, the condition must return true"—so the mutant lives. That's the failure mode that left 85 mutants alive in &lt;code&gt;daily-digest.ts&lt;/code&gt;, where tests did run. The engine's case is worse: no unit test ran at all.&lt;/p&gt;

&lt;p&gt;Line coverage: ✓ (the line executed via E2E).&lt;br&gt;&lt;br&gt;
Mutation coverage: 0 (the logic was never asserted at the unit level).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Block On It?
&lt;/h2&gt;

&lt;p&gt;Because a gate that reports a baseline truth on day one and gets disabled by day three is worse than no gate at all. The 24.88% score is real and uncomfortable. Flip &lt;code&gt;break&lt;/code&gt; from &lt;code&gt;null&lt;/code&gt; to &lt;code&gt;60&lt;/code&gt;, and the suite fails immediately. The next engineer to touch this codebase sees a broken CI, disables the gate, deletes the stryker script, and ships. The baseline evaporates.&lt;/p&gt;

&lt;p&gt;Instead: report-only for one development cycle. Let the team see the number. Add co-located unit tests to &lt;code&gt;src/rules/engine.test.ts&lt;/code&gt; until the engine clears &lt;code&gt;low: 60&lt;/code&gt;. Then flip &lt;code&gt;break: 60&lt;/code&gt; and make it a ratchet: the mutation floor can only move up.&lt;/p&gt;

&lt;p&gt;This is the same discipline as the L5 security scan wired the same day—gitleaks + OSV, both report-only, both feeding into &lt;code&gt;tests/TESTING.md&lt;/code&gt; as the single source of truth for "which gates are enforced now, which are baseline, which are deferred?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Also Shipped
&lt;/h2&gt;

&lt;p&gt;The same day closed two beads: &lt;strong&gt;GCP Deployment&lt;/strong&gt; (won't-do, 592 lines deleted—&lt;code&gt;deploy.yml&lt;/code&gt;, &lt;code&gt;drift.yml&lt;/code&gt;, &lt;code&gt;infra/&lt;/code&gt;), and &lt;strong&gt;Test Coverage&lt;/strong&gt; (paid down lint debt, promoted &lt;code&gt;no-case-declarations&lt;/code&gt; to error, adopted dotenv 17 + commander 15, 9 GitHub Actions bumps, closed children). The security scan job is non-blocking (&lt;code&gt;continue-on-error: true&lt;/code&gt;) because 5 &lt;code&gt;fix=NONE&lt;/code&gt; advisories live in duckdb's native build chain—accepted and documented until upstreams ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Transferable Rule
&lt;/h2&gt;

&lt;p&gt;Coverage measures attendance. Mutation testing measures whether anyone was paying attention. A green coverage gate tells you code was executed. A green mutation gate tells you code's behavior was asserted. They are not the same gate. If your mutation score is much lower than your line coverage, your tests are outcome-level and your logic is untouched. Add assertions closer to the decision points. Start with report-only. Then ratchet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/when-green-ci-proves-nothing/"&gt;Green CI Proves Nothing: Why Your Tests Gate Zero Calls&lt;/a&gt; — a passing test suite that asserts on nothing.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/honor-the-gate-when-the-verdict-is-inconvenient/"&gt;Honor the Gate When the Verdict Is Inconvenient&lt;/a&gt; — the discipline of trusting the gate's verdict.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/when-llm-output-lies-instead-of-crashing/"&gt;When LLM Output Lies Instead of Crashing&lt;/a&gt; — the same intent-mail codebase: "it ran without erroring" is not "it's correct."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;{&lt;br&gt;
  "&lt;a class="mentioned-user" href="https://dev.to/context"&gt;@context&lt;/a&gt;": "&lt;a href="https://schema.org" rel="noopener noreferrer"&gt;https://schema.org&lt;/a&gt;",&lt;br&gt;
  "@type": "BlogPosting",&lt;br&gt;
  "headline": "Coverage Said 69%, Mutation Testing Said 25%",&lt;br&gt;
  "description": "A repo at 69% line coverage scored 24.88% on mutation testing—and the rules engine that touches user email scored 0.00%. Coverage said fine; Stryker didn't.",&lt;br&gt;
  "datePublished": "2026-06-28T08:00:00-05:00",&lt;br&gt;
  "dateModified": "2026-06-28T08:00:00-05:00",&lt;br&gt;
  "keywords": "mutation testing, code coverage, mutation score, Stryker, test quality, CI/CD, unit testing",&lt;br&gt;
  "wordCount": 1200,&lt;br&gt;
  "author": {&lt;br&gt;
    "@type": "Person",&lt;br&gt;
    "name": "Jeremy Longshore"&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

</description>
      <category>testing</category>
      <category>typescript</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>Rent the Agent, Own the Proof</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Wed, 01 Jul 2026 03:08:56 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/rent-the-agent-own-the-proof-2anb</link>
      <guid>https://dev.to/jeremy_longshore/rent-the-agent-own-the-proof-2anb</guid>
      <description>&lt;p&gt;On June 23, 2026, Anthropic &lt;a href="https://www.anthropic.com/news/introducing-claude-tag" rel="noopener noreferrer"&gt;shipped Claude Tag&lt;/a&gt;: tag &lt;code&gt;@Claude&lt;/code&gt; in a Slack channel and it works as an agentic teammate — running tool calls, following threads, building memory, acting under your org's identity. The legacy "Claude in Slack" chat app &lt;a href="https://www.techtimes.com/articles/319206/20260627/claude-tag-brings-ambient-ai-slack-admins-have-until-august-3-migrate.htm" rel="noopener noreferrer"&gt;retires August 3, 2026&lt;/a&gt;; admins get 30 days to migrate. It is, by a distance, the best agentic teammate you can &lt;em&gt;rent&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I want to be honest about that up front, because the interesting question isn't whether Claude Tag is good. It's good. The interesting question is the one it doesn't answer: &lt;strong&gt;when an agent has admin-scoped tools in your workspace, who owns the memory it builds, and who can verify the log of what it did?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Give Claude Tag its due
&lt;/h2&gt;

&lt;p&gt;A hit piece would be easy and wrong. Claude Tag's engineering is strong, and the honest version of this argument starts by saying so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero infrastructure.&lt;/strong&gt; No servers, no deploy, no on-call. Tag the bot, it works. Nothing self-hosted competes on setup cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiplayer shared context.&lt;/strong&gt; "One Claude that interacts with everyone" in a channel — anyone can see what it's working on and pick up where the last person left off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compounding memory.&lt;/strong&gt; It "learns over time… builds more context about the work," and can learn across channels with permission.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Central governance.&lt;/strong&gt; Admins specify which tools and data the model can touch, per channel. Usage is billed to the org. Admins can set &lt;strong&gt;token-spend caps&lt;/strong&gt; per org and per channel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A genuinely good security model.&lt;/strong&gt; Work runs in an &lt;strong&gt;isolated sandbox on Anthropic's infrastructure&lt;/strong&gt;. When it needs an external system, requests cross an &lt;strong&gt;Agent Proxy&lt;/strong&gt; where — per launch coverage — credentials "stay in a store and are injected at the network boundary without the model receiving raw keys," with a &lt;strong&gt;default policy of deny&lt;/strong&gt; for un-allowlisted hosts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point deserves emphasis, because it's the part people underrate: keeping raw credentials out of the model and defaulting egress to deny is &lt;em&gt;exactly&lt;/em&gt; the posture a security-minded self-hosted stack would build. Anthropic did the hard, correct thing. If you want an agent in Slack tomorrow with no infrastructure, Claude Tag is an excellent product and you should use it.&lt;/p&gt;

&lt;p&gt;So this is not "their thing is insecure." It's a different question entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trade you're actually making
&lt;/h2&gt;

&lt;p&gt;Every one of those benefits shares a root: &lt;strong&gt;the substrate is Anthropic's.&lt;/strong&gt; The sandbox is on their infra. And so is the memory — as &lt;a href="https://alphasignalai.substack.com/p/the-real-claude-tag-question-is-context" rel="noopener noreferrer"&gt;AlphaSignal argues&lt;/a&gt;, the real Claude Tag question is context ownership: the organizational knowledge Claude builds becomes the vendor's proprietary state, not an exportable dataset your next vendor can ingest. The principle it points to is simple — rent the agent, but own the memory — and Claude Tag inverts it: you rent the agent &lt;em&gt;and&lt;/em&gt; the memory lives with the vendor. Critics have named the deeper trap the same way: &lt;a href="https://ksingh7.medium.com/everyones-worried-about-model-lock-in-but-the-real-trap-is-context-lock-in-af04c16167b0" rel="noopener noreferrer"&gt;not model lock-in but &lt;em&gt;context&lt;/em&gt; lock-in&lt;/a&gt;. And they've been fair about it — the concern isn't that Anthropic is doing anything wrong, it's that the incentives are obvious.&lt;/p&gt;

&lt;p&gt;Two more things live on that substrate, and they matter more than memory portability:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Governance is coarse.&lt;/strong&gt; Admin control means &lt;em&gt;which channels&lt;/em&gt;, &lt;em&gt;which tools/data&lt;/em&gt;, and &lt;em&gt;spend caps&lt;/em&gt;. That is scope plus budget. It is not a documented, mandatory human approval &lt;strong&gt;before each consequential tool call executes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The audit log is the vendor's.&lt;/strong&gt; Admins "can view a log of everything that &lt;a class="mentioned-user" href="https://dev.to/claude"&gt;@claude&lt;/a&gt; has done." You can &lt;em&gt;view&lt;/em&gt; Anthropic's record. You cannot &lt;em&gt;verify&lt;/em&gt; it, independently, with your own key, with no vendor in the trust path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a lot of teams, none of that is a dealbreaker — and they should use Claude Tag. But for a regulated shop, a security vendor, or anyone whose answer to "prove what your agent did" has to survive an adversary who controls the log's storage, coarse governance and a vendor-shown record aren't enough. That's the gap two open-source projects were built to fill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why filters were never the boundary
&lt;/h2&gt;

&lt;p&gt;Before the alternative, the premise: for a tool-using agent reading untrusted text — every Slack message is untrusted input — &lt;strong&gt;content filtering is not a security boundary.&lt;/strong&gt; This is true of &lt;em&gt;any&lt;/em&gt; Slack-connected agent — CCSC and AGP included, not Claude Tag in particular — which is why the answer isn't a better filter, it's a different architecture. The literature is blunt about this. &lt;a href="https://arxiv.org/abs/2403.02691" rel="noopener noreferrer"&gt;InjecAgent&lt;/a&gt; (Findings of ACL 2024) found ReAct-prompted GPT-4 vulnerable to indirect prompt injection 24% of the time, with private-data exfiltration a primary attack class. Worse, &lt;a href="https://arxiv.org/abs/2503.00061" rel="noopener noreferrer"&gt;Adaptive Attacks Break Defenses Against Indirect Prompt Injection&lt;/a&gt; (NAACL 2025) bypassed &lt;strong&gt;all eight&lt;/strong&gt; evaluated defenses, keeping attack success above 50%. And because tools reached over MCP are "first-class, composable objects with natural-language metadata," &lt;a href="https://arxiv.org/abs/2510.15994" rel="noopener noreferrer"&gt;MCP Security Bench&lt;/a&gt; shows the standard &lt;em&gt;enlarges&lt;/em&gt; the attack surface — name-collision, tool-description injection, out-of-scope parameters.&lt;/p&gt;

&lt;p&gt;The takeaway isn't "add a better filter." It's that a probabilistic model cannot be the thing that decides whether a destructive action runs. You need &lt;strong&gt;a deterministic gate the model can't talk its way past, a human in the loop for the consequential calls, and a record of what actually ran that you can trust without trusting the runtime.&lt;/strong&gt; That's three properties — and they're the three pillars of what I'll call customer-owned, verifiable governance.&lt;/p&gt;

&lt;p&gt;Two reference implementations: &lt;strong&gt;&lt;a href="https://github.com/jeremylongshore/claude-code-slack-channel" rel="noopener noreferrer"&gt;CCSC&lt;/a&gt;&lt;/strong&gt; (Claude Code Slack Channel), a small auditable governance kernel that puts Claude Code in Slack behind exactly these gates; and &lt;strong&gt;&lt;a href="https://github.com/jeremylongshore/agent-governance-plane" rel="noopener noreferrer"&gt;AGP&lt;/a&gt;&lt;/strong&gt; (Agent Governance Plane), which reimplements-and-hardens that kernel with sandboxed execution across multiple agent harnesses. AGP doesn't vendor CCSC or depend on it — it's an independent reimplementation ("adapt-and-harden") pinned to CCSC's v0.10.0 design. Both are open source; you run them on your own infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 1 — Deterministic, fine-grained policy (not a probabilistic scope)
&lt;/h3&gt;

&lt;p&gt;The oversight literature lays out a &lt;a href="https://arxiv.org/abs/2507.14034" rel="noopener noreferrer"&gt;spectrum&lt;/a&gt;, from human-out-of-the-loop through human-on-the-loop. Claude Tag's channel scope sits low on it: a coarse boundary set once, per channel. Fine-grained governance sits higher, and it looks like capability-based security — &lt;a href="https://en.wikipedia.org/wiki/Object-capability_model" rel="noopener noreferrer"&gt;Miller's object-capability model&lt;/a&gt; and the principle of least authority, where authority is granted per &lt;em&gt;operation&lt;/em&gt;, not per &lt;em&gt;door&lt;/em&gt;. Established capability-security work is explicit that coarse role/scope access &lt;a href="https://arxiv.org/abs/1909.12279" rel="noopener noreferrer"&gt;over-privileges by construction&lt;/a&gt; and that fine-grained capabilities are how you get least authority at scale.&lt;/p&gt;

&lt;p&gt;In CCSC, that's a pure decision procedure: &lt;code&gt;evaluate(call, rules, now)&lt;/code&gt; maps a tool call to &lt;code&gt;allow | deny | require&lt;/code&gt; with no side effects — same inputs, same verdict, every time. Tiered rules resolve strictest-wins, then first-applicable. Out of the box, one tool — &lt;code&gt;upload_file&lt;/code&gt; — defaults to fail-closed (denied unless a rule allows it); every other tool follows the policy &lt;em&gt;you&lt;/em&gt; author, so making the actions that matter in your workspace fail-closed is a rule you write, not a default you inherit. AGP takes the harder line and bakes it in: its engine is &lt;strong&gt;default-deny with deny &amp;gt; require &amp;gt; allow&lt;/strong&gt;, so a call that matches no rule fails closed to deny, not allow. That's precisely what "AGP hardens the kernel" means. Neither ships an &lt;code&gt;if (model === "claude")&lt;/code&gt; branch anywhere near the decision; the gate is data, evaluated before execution.&lt;/p&gt;

&lt;p&gt;The difference from a spend cap is the whole point. A token budget limits &lt;em&gt;how much&lt;/em&gt; the agent can do. A policy engine decides &lt;em&gt;whether this specific action&lt;/em&gt; is allowed — and can force a human into the loop for the ones that matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 2 — Human-in-command approval, per tool call
&lt;/h3&gt;

&lt;p&gt;On the oversight spectrum, the mode that matters for consequential actions is &lt;strong&gt;human-in-command&lt;/strong&gt;: the action does not execute until a human approves it. Not "review the log afterward." Not "set a budget." A blocking approval on &lt;em&gt;this&lt;/em&gt; call, &lt;em&gt;before&lt;/em&gt; it runs.&lt;/p&gt;

&lt;p&gt;CCSC's &lt;code&gt;require&lt;/code&gt; verdict routes to a human approver: approval is a reply-code the gate recognizes, with quorum counted by distinct user IDs — and because permission-reply messages from peer bots are dropped at the inbound gate, an injected message can't cast an approving vote. A separate, stronger handshake guards the admin command path (&lt;code&gt;!restart&lt;/code&gt;): a single-use 64-bit nonce DM'd out of band, TTL-bounded, checked against wrong-channel replay — built specifically to defeat a same-channel-approval attack, an &lt;a href="https://arxiv.org/abs/2509.10540" rel="noopener noreferrer"&gt;EchoLeak&lt;/a&gt;-class injection where the request &lt;em&gt;and&lt;/em&gt; the approval both originate in a compromised channel. AGP renders the per-call approval as Slack Block-Kit Approve/Deny buttons whose nonce is bound to &lt;code&gt;(messageId, sessionId)&lt;/code&gt; — and crucially, &lt;strong&gt;a bot can never approve, and a bot's click doesn't even burn the nonce.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the pillar Claude Tag most visibly doesn't have. Admin scope decides what tools are reachable; there's no public evidence of a mandatory per-action human approval before each consequential call. For a broad class of "summarize the thread" work you don't want one. For "open this PR," "message this customer," "run this migration," a lot of orgs very much do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 3 — A signed audit you can verify without trusting the runtime
&lt;/h3&gt;

&lt;p&gt;This is the sharpest line between renting and owning, so be precise about what "verifiable" means. Claude Tag gives admins a log they can &lt;em&gt;view&lt;/em&gt;. CCSC and AGP produce a log you can &lt;em&gt;verify&lt;/em&gt; — a different verb.&lt;/p&gt;

&lt;p&gt;CCSC's journal is a hash chain: each entry commits &lt;code&gt;sha256(prevHash ‖ canonicalJson)&lt;/code&gt;, so any reordering or edit breaks every downstream link. Each entry is then &lt;strong&gt;Ed25519-signed over RFC 8785 JSON Canonicalization&lt;/strong&gt;, and carries a &lt;code&gt;policy_attestation&lt;/code&gt; digest recording exactly which rules were in force when the decision was made. You verify the whole thing offline — &lt;code&gt;bun server.ts --verify-audit-log &amp;lt;path&amp;gt;&lt;/code&gt; — with &lt;strong&gt;only the public key&lt;/strong&gt;. No vendor. No live service. Key rotation is supported, so the verifier still works across a rotation. This is the &lt;a href="https://datatracker.ietf.org/doc/html/rfc6962" rel="noopener noreferrer"&gt;Certificate Transparency&lt;/a&gt; design lineage — append-only logs a third party can check without trusting the operator — the same lineage as verifiable ledger databases like &lt;a href="https://doi.org/10.14778/3583140.3583152" rel="noopener noreferrer"&gt;GlassDB&lt;/a&gt; and recent work on &lt;a href="https://arxiv.org/abs/2509.18415" rel="noopener noreferrer"&gt;Certificate-Transparency-style provenance for agent action chains&lt;/a&gt;, where external verifiers cryptographically validate what an agent did without access to the runtime.&lt;/p&gt;

&lt;p&gt;And here's the honest, load-bearing detail most "verifiable log" claims skip: &lt;strong&gt;a bare hash chain does not stop truncation.&lt;/strong&gt; If an attacker drops the last N entries, the remaining chain still verifies clean — CCSC documents this as threat T8, out loud, in its threat model. AGP closes exactly that hole: it writes a &lt;strong&gt;signed HEAD checkpoint&lt;/strong&gt; — a &lt;code&gt;&amp;lt;journal&amp;gt;.head&lt;/code&gt; file pinning &lt;code&gt;{seq, hash}&lt;/code&gt; — so the offline verifier catches a truncated tail a chain alone cannot. &lt;code&gt;agp verify&lt;/code&gt; checks the chain, the signatures, &lt;em&gt;and&lt;/em&gt; the signed head — offline, with only the public key. That's what owning the proof buys you: not "trust our log," but "here is a record, check it yourself, and check that none of it was quietly cut off."&lt;/p&gt;

&lt;h3&gt;
  
  
  The fourth property AGP adds — a sandbox that &lt;em&gt;proves&lt;/em&gt; egress is off
&lt;/h3&gt;

&lt;p&gt;Claude Tag's sandbox is on Anthropic's infra. AGP's runs on yours, and it doesn't &lt;em&gt;assume&lt;/em&gt; isolation — it &lt;em&gt;proves&lt;/em&gt; it. Containers launch with &lt;code&gt;--network none&lt;/code&gt;, &lt;code&gt;--cap-drop ALL&lt;/code&gt;, &lt;code&gt;no-new-privileges&lt;/code&gt;, and pid/memory limits, with &lt;strong&gt;no silent fallback to the host&lt;/strong&gt; if hardened flags fail. Then a &lt;strong&gt;network preflight&lt;/strong&gt; actually runs an egress probe to a TEST-NET-1 address and &lt;strong&gt;fails closed&lt;/strong&gt; — tearing the container down — if traffic is &lt;em&gt;not&lt;/em&gt; isolated. Images are pinned — by digest, or at minimum a specific version tag, never &lt;code&gt;latest&lt;/code&gt; — and host secrets are mounted through a deny-list. Same "gate, don't impersonate" credential posture as Claude Tag's Agent Proxy: AGP holds no model credentials, and &lt;code&gt;{{secret:NAME}}&lt;/code&gt; is resolved only &lt;em&gt;after&lt;/em&gt; the gate, at the exec boundary — the journal records secret &lt;em&gt;names&lt;/em&gt;, never values.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part where I don't oversell it
&lt;/h2&gt;

&lt;p&gt;The reason to trust any of this is that both projects are relentless about their own limits, and I'm not going to break that streak here.&lt;/p&gt;

&lt;p&gt;CCSC does &lt;strong&gt;not&lt;/strong&gt; protect you against a compromised host OS, a same-UID process, a supply-chain attack, or a socially-engineered operator at their own terminal — its threat model says so explicitly. Its manifest &lt;em&gt;consumer&lt;/em&gt; doesn't ship yet, so cross-bot identity is still just &lt;code&gt;bot_id&lt;/code&gt;. And T8 truncation, again, is a real gap in CCSC alone — it's &lt;em&gt;AGP&lt;/em&gt; that fixes it, which is the whole reason AGP exists.&lt;/p&gt;

&lt;p&gt;AGP's sandbox is namespace/cgroup isolation, &lt;strong&gt;not&lt;/strong&gt; a VM — a kernel exploit or container escape defeats it, and the code comments say exactly that. Its multi-harness claim rests on a deterministic reference contract and a conformance test that drives Claude Code and Codex through identical governance; &lt;strong&gt;live Codex interception is provisional and not yet CI-validated.&lt;/strong&gt; AGP's own docs rate the current test suite a B− (78/100). And its marketing-claims scanner &lt;em&gt;bans&lt;/em&gt; the words "tamper-proof," "forensic-grade," and "compliance-grade" in CI — which is why I've said &lt;strong&gt;signed and offline-verifiable&lt;/strong&gt; throughout, and not one of those. A log can be signed and still be stolen wholesale if your host is owned. Verifiability is a property of the record, not a force field around your infrastructure.&lt;/p&gt;

&lt;p&gt;That discipline is the point. A governance layer that oversells itself is worse than none, because it manufactures the exact false confidence it was supposed to remove.&lt;/p&gt;

&lt;h2&gt;
  
  
  So: rent, or own?
&lt;/h2&gt;

&lt;p&gt;Both stacks share one architectural signature — &lt;strong&gt;prove-don't-trust: a deterministic gate, a human-in-command approval, and a signed record you can check.&lt;/strong&gt; AGP takes it furthest and makes fail-closed the &lt;em&gt;default&lt;/em&gt;: default-deny, no-decision-means-deny, egress &lt;em&gt;runtime-verified&lt;/em&gt; off rather than assumed, an unresolved secret throws, an unpinned image throws. CCSC brings the deterministic policy, the human-in-command gate, and the signed record — and leaves which operations fail closed to the policy you author. Put together, it's one posture: don't trust the runtime — make it prove itself, and keep a record you can check. The single sharpest thing they do that a rented, vendor-logged agent doesn't is produce &lt;strong&gt;a publicly verifiable, offline audit&lt;/strong&gt; — and AGP's signed-HEAD checkpoint closes the exact truncation hole CCSC documents. That's the difference between a log you're shown and a proof you hold.&lt;/p&gt;

&lt;p&gt;Claude Tag is the best agent you can rent, and for most teams that's the right answer — take the zero-setup, the multiplayer memory, the strong sandbox, and go. But if your governance model needs the agent — and its memory, and its audit trail — to be &lt;strong&gt;yours&lt;/strong&gt;, on &lt;strong&gt;your&lt;/strong&gt; infrastructure, and &lt;strong&gt;verifiable with your own key&lt;/strong&gt;, then you host it. You give up the free memory and the zero-ops UX; you get a substrate no vendor can revoke, relocate, or rewrite.&lt;/p&gt;

&lt;p&gt;Both are legitimate engineering positions. The question was never which agent is smarter. It's who has to own the substrate — and who has to be able to prove what happened.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CCSC and AGP are open source: &lt;a href="https://github.com/jeremylongshore/claude-code-slack-channel" rel="noopener noreferrer"&gt;claude-code-slack-channel&lt;/a&gt; (the governance kernel) and &lt;a href="https://github.com/jeremylongshore/agent-governance-plane" rel="noopener noreferrer"&gt;agent-governance-plane&lt;/a&gt; (sandboxed, multi-harness). A companion post, &lt;a href="https://dev.to/posts/gate-the-statement-not-the-tool-name/"&gt;Gate the Statement, Not the Tool Name&lt;/a&gt;, walks through the capability-based gate at the statement level.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic (2026). &lt;em&gt;Introducing Claude Tag.&lt;/em&gt; anthropic.com/news/introducing-claude-tag&lt;/li&gt;
&lt;li&gt;Wulf, J., Meierhofer, J., &amp;amp; Hannich, F. (2025). &lt;em&gt;Architecting Human-AI Cocreation for Technical Services.&lt;/em&gt; arXiv:2507.14034 — the human-in-command / human-on-the-loop oversight spectrum.&lt;/li&gt;
&lt;li&gt;Zhan, Q., Liang, Z., Ying, Z., &amp;amp; Kang, D. (2024). &lt;em&gt;InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents.&lt;/em&gt; Findings of ACL 2024. arXiv:2403.02691.&lt;/li&gt;
&lt;li&gt;Zhan, Q., Fang, R., Panchal, H., &amp;amp; Kang, D. (2025). &lt;em&gt;Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents.&lt;/em&gt; NAACL 2025. arXiv:2503.00061.&lt;/li&gt;
&lt;li&gt;Zhang, D., et al. (2025). &lt;em&gt;MCP Security Bench (MSB).&lt;/em&gt; arXiv:2510.15994.&lt;/li&gt;
&lt;li&gt;Reddy, P. &amp;amp; Gujral, A. S. (2025). &lt;em&gt;EchoLeak: Zero-Click Indirect Prompt Injection in Microsoft 365 Copilot (CVE-2025-32711).&lt;/em&gt; arXiv:2509.10540.&lt;/li&gt;
&lt;li&gt;Zigmond, E., Chong, S., Dimoulas, C., &amp;amp; Moore, S. (2019). &lt;em&gt;Fine-Grained, Language-Based Access Control for Database-Backed Applications (ShillDB).&lt;/em&gt; arXiv:1909.12279.&lt;/li&gt;
&lt;li&gt;Yue, C., et al. (2023). &lt;em&gt;GlassDB: An Efficient Verifiable Ledger Database System Through Transparency.&lt;/em&gt; PVLDB. doi:10.14778/3583140.3583152.&lt;/li&gt;
&lt;li&gt;Malkapuram, S., Gangavarapu, S., Kavalakuntla, K. R., &amp;amp; Gangavarapu, A. (2025). &lt;em&gt;Context Lineage Assurance for Non-Human Identities in Critical Multi-Agent Systems.&lt;/em&gt; arXiv:2509.18415.&lt;/li&gt;
&lt;li&gt;Laurie, B., Langley, A., &amp;amp; Kasper, E. (2013). &lt;em&gt;Certificate Transparency.&lt;/em&gt; RFC 6962, IETF.&lt;/li&gt;
&lt;li&gt;AlphaSignal (2026). &lt;em&gt;The Real Claude Tag Question Is Context Ownership.&lt;/em&gt; alphasignalai.substack.com&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>mcp</category>
      <category>llm</category>
    </item>
    <item>
      <title>The LLM Should Never Do the Math</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Mon, 29 Jun 2026 13:00:20 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/the-llm-should-never-do-the-math-31gf</link>
      <guid>https://dev.to/jeremy_longshore/the-llm-should-never-do-the-math-31gf</guid>
      <description>&lt;p&gt;A CFO will not act on a number an LLM eyeballed. They will not act on a number the model "estimated" by reasoning over a usage dump. And they should not — because the moment a language model emits a dollar figure it computed itself, that figure is a guess wearing the costume of a fact.&lt;/p&gt;

&lt;p&gt;This is the design constraint behind &lt;code&gt;databricks-cost-leak-hunter&lt;/code&gt;, the pilot skill of the databricks-pack v2 rebuild shipped in the claude-code-plugins marketplace (&lt;a href="https://github.com/jeremylongshore/claude-code-plugins" rel="noopener noreferrer"&gt;PR #906&lt;/a&gt;). Given a live, authenticated Databricks workspace, it surfaces real cost leaks across four named categories, ranks them by monthly dollar impact, and emits a report a finance reader can act on. The marketplace validator graded it B (88/100, zero errors). The SKILL.md is 329 lines. The single most important thing in it is a rule the model is structurally prevented from breaking: &lt;strong&gt;the LLM never does the dollar arithmetic.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just let the agent read the bill and summarize it?
&lt;/h2&gt;

&lt;p&gt;Because that is exactly how you ship a confidently wrong cost report.&lt;/p&gt;

&lt;p&gt;Hand a model a few thousand rows of &lt;code&gt;system.billing.usage&lt;/code&gt; and ask it for the top cost leaks, and it will give you a fluent answer. It will add DBUs. It will multiply by a price it half-remembers. It will round. Every one of those steps is a place the model can be plausibly, invisibly wrong — and the output reads identically whether the math is right or hallucinated. The failure mode of an LLM doing FinOps is not a crash. It is a clean, well-formatted, wrong number.&lt;/p&gt;

&lt;p&gt;The fix is architectural, not prompt-engineering. The model is allowed to decide &lt;em&gt;what to look for&lt;/em&gt; and &lt;em&gt;how to explain it&lt;/em&gt;. It is never allowed to be the calculator.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dollar primitive: confirmed, never estimated
&lt;/h2&gt;

&lt;p&gt;Every confirmed figure comes from the customer's own billing tables — &lt;code&gt;system.billing.usage&lt;/code&gt; joined to &lt;code&gt;system.billing.list_prices&lt;/code&gt;. Not a model estimate. Not a public price list. The number Databricks actually billed.&lt;/p&gt;

&lt;p&gt;That join is defined once, as a &lt;code&gt;priced&lt;/code&gt; CTE, and reused by every category query. Usage is multiplied by &lt;code&gt;list_prices.pricing.default&lt;/code&gt;, matched on both &lt;code&gt;sku_name&lt;/code&gt; and &lt;code&gt;usage_unit&lt;/code&gt;, inside the price-effective window, filtered to &lt;code&gt;currency_code = 'USD'&lt;/code&gt;. Define the dollar primitive once; every downstream query inherits it. There is exactly one place where a DBU becomes a dollar, and it is a SQL join against the source of truth.&lt;/p&gt;

&lt;p&gt;Before any of that runs, Step 1 is a fail-fast grant check. &lt;code&gt;system.billing.usage&lt;/code&gt; sits behind a metastore-admin grant chain, so the skill probes it first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;usage&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that errors, the skill reports the exact missing &lt;code&gt;GRANT USE CATALOG / USE SCHEMA / SELECT&lt;/code&gt; chain verbatim and stops — rather than limping forward and failing mid-analysis with a half-built report. It also requires &lt;code&gt;DATABRICKS_WAREHOUSE_ID&lt;/code&gt;, a running SQL warehouse to execute against. Fail at the door, with the precise remediation, or not at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two data planes: dollars and evidence
&lt;/h2&gt;

&lt;p&gt;The skill reads from two MCPs, and the split is the whole architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dollars&lt;/strong&gt; come from the Databricks CLI Statement Execution API — &lt;code&gt;databricks api post /api/2.0/sql/statements&lt;/code&gt; — reading the &lt;code&gt;system.*&lt;/code&gt; tables. Auth is the CLI's own &lt;code&gt;DATABRICKS_HOST&lt;/code&gt; + &lt;code&gt;DATABRICKS_TOKEN&lt;/code&gt; (or &lt;code&gt;databricks auth login&lt;/code&gt;), and Unity Catalog enforces the metastore-admin grant chain on every read. This plane answers &lt;em&gt;how much&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence&lt;/strong&gt; — the live config that explains &lt;em&gt;why&lt;/em&gt; a leak exists — comes from a custom &lt;code&gt;databricks-workspace-mcp&lt;/code&gt; control-plane server. It reads the live REST API for the auto-termination setting, the node type, the autoscale floor, the pool's &lt;code&gt;min_idle&lt;/code&gt;. It has its own PAT / U2M / M2M auth and needs no system-table grants, because it never touches billing data.&lt;/p&gt;

&lt;p&gt;The line that captures the division of labor: &lt;strong&gt;the SQL produces the number; the workspace MCP turns it into a verified, single-config-change fix.&lt;/strong&gt; "$8,400/month on a cluster that never auto-terminates" is the SQL's job. "Set &lt;code&gt;autotermination_minutes = 15&lt;/code&gt;" is the MCP's. And the degradation path matters: if the workspace MCP is absent, the skill still produces every dollar figure and falls back to accepting pasted config — it never fails silently mid-flow because one plane is missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four leak categories, each labeled with what it actually is
&lt;/h2&gt;

&lt;p&gt;The leaks are not a flat list. Each category carries an explicit confidence &lt;em&gt;kind&lt;/em&gt;, and that label travels with the dollars all the way to the report.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Clusters that never auto-terminate — &lt;code&gt;confirmed&lt;/code&gt;.&lt;/strong&gt; Join priced ALL_PURPOSE usage to &lt;code&gt;system.compute.clusters&lt;/code&gt;, flag &lt;code&gt;auto_termination_minutes = 0&lt;/code&gt;, rank by 30-day idle spend. This is money actually billed for compute that sat idle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'unknown'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cluster_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auto_termination_minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;spend_30d_usd&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;priced&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;cluster_cfg&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing_origin_product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ALL_PURPOSE'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auto_termination_minutes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auto_termination_minutes&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;spend_30d_usd&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Scheduled jobs on All-Purpose compute — &lt;code&gt;confirmed&lt;/code&gt;.&lt;/strong&gt; A usage row with a &lt;code&gt;job_id&lt;/code&gt; and &lt;code&gt;billing_origin_product = 'ALL_PURPOSE'&lt;/code&gt; (~$0.55/DBU) instead of &lt;code&gt;JOBS_COMPUTE&lt;/code&gt; (~$0.15/DBU). Re-price the exact same DBUs at the Jobs rate; the delta is the savings. Deterministic re-pricing, not a model's guess at "roughly 3x cheaper."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Overprovisioned clusters idling below floor — &lt;code&gt;estimated&lt;/code&gt;.&lt;/strong&gt; Mean CPU from &lt;code&gt;system.compute.node_timeline&lt;/code&gt;, flag clusters under 25% utilization, &lt;code&gt;est_overprovision = spend × (1 − CPU%)&lt;/code&gt;. This is the one &lt;em&gt;arithmetically modeled&lt;/em&gt; number in the skill — the at-risk Photon figure below is flagged for review, not model-computed — and it is labeled &lt;code&gt;est_*&lt;/code&gt; everywhere it appears, so no one mistakes a model for a bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Photon premium without the speedup — &lt;code&gt;at-risk&lt;/code&gt;.&lt;/strong&gt; Photon is not a column. It is billing-visible via the SKU. Surface the ~2× premium portion as money to review against actual runtime gain — not confirmed waste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;photon_spend_30d_usd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;photon_premium_at_risk_30d_usd&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;priced&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sku_name&lt;/span&gt; &lt;span class="k"&gt;ILIKE&lt;/span&gt; &lt;span class="s1"&gt;'%PHOTON%'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing_origin_product&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'ALL_PURPOSE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'JOBS_COMPUTE'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;photon_premium_at_risk_30d_usd&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three kinds — confirmed, estimated, at-risk — and a report that never blurs them together:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Confidence kind&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Confirmed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Money actually billed, from &lt;code&gt;system.billing.usage&lt;/code&gt; joined to &lt;code&gt;system.billing.list_prices&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Estimated&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A modeled figure from utilization data (CPU %), labeled &lt;code&gt;est_*&lt;/code&gt; so no one reads it as a bill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;At-risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spend flagged for business review (the Photon premium) before it counts as recoverable waste&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The ranker that the model cannot fudge
&lt;/h2&gt;

&lt;p&gt;The pipeline is &lt;strong&gt;detect → compute → rank → report&lt;/strong&gt;. The detect and report stages are the model's. The compute and rank stages live in &lt;code&gt;scripts/rank-and-report.py&lt;/code&gt;, whose docstring opens with the thesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The LLM does NOT do the dollar arithmetic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The script ingests per-category JSON — each item &lt;code&gt;{category, root_cause, fix, waste_30d_usd, kind}&lt;/code&gt; — converts 30-day spend to monthly, ranks descending, and renders the report. And it enforces the invariant the whole skill is built on: &lt;strong&gt;never sum confirmed and unconfirmed dollars under one verb.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;UNCONFIRMED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;at-risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Split sum — confirmed and unconfirmed dollars are NEVER added under one verb.
&lt;/span&gt;&lt;span class="n"&gt;confirmed_monthly&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confirmed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unconfirmed_monthly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;UNCONFIRMED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That split is why the headline reads honestly. For a $100K/month workspace it renders something like: &lt;em&gt;"burning *&lt;/em&gt;~$19,000/month** (confirmed), plus up to &lt;strong&gt;~$8,000/month&lt;/strong&gt; pending review"* — with a trailing-30-day window stamp, and a ranked table whose &lt;code&gt;Confidence&lt;/code&gt; column is load-bearing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Where it's leaking&lt;/th&gt;
&lt;th&gt;$/month&lt;/th&gt;
&lt;th&gt;Confidence&lt;/th&gt;
&lt;th&gt;The fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Root-cause cells use plain business language. No raw &lt;code&gt;DBU&lt;/code&gt; in any CFO-visible text — that translation happens before the number reaches finance, not in the reader's head.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not let the model add the numbers?
&lt;/h2&gt;

&lt;p&gt;Because a model that can add can also mis-add, and you cannot tell which run you got. Deterministic arithmetic outside the LLM is not a performance optimization. It is the difference between a report and a guess. The model's strengths — pattern-matching, explanation, ranking by business relevance — are real and used here. Its weakness — silent numerical confabulation — is engineered out by never giving it a calculator in the first place.&lt;/p&gt;

&lt;p&gt;This is the same instinct that runs through the rest of the skill-marketplace work: deterministic math outside the model, every figure labeled with its confidence, and the model confined to the parts of the job where being articulate is an asset rather than a liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The adversarial review that caught the LLM inventing a schema
&lt;/h2&gt;

&lt;p&gt;The skill was built through a 10-agent workflow: research → author → &lt;strong&gt;four adversarial review lenses&lt;/strong&gt; → revise. That review stage is not polish. It is load-bearing — and here is why.&lt;/p&gt;

&lt;p&gt;The review caught a &lt;strong&gt;hallucinated &lt;code&gt;system.compute.clusters.runtime_engine&lt;/code&gt; column.&lt;/strong&gt; The model had confidently written its first Photon-detection query against a column that does not exist. Left unreviewed, the skill would have shipped a cost report built on a fictional schema — and it would have looked completely plausible doing it. The fix is the Photon query above: detect via &lt;code&gt;sku_name ILIKE '%PHOTON%'&lt;/code&gt; on the priced billing row, because Photon is billing-visible by SKU and never a system-table column.&lt;/p&gt;

&lt;p&gt;That is the entire argument for adversarial review of LLM-authored systems in one bug. The failure mode is not a stack trace. It is &lt;em&gt;confident plausibility&lt;/em&gt; — output that looks right, reads right, and is wrong.&lt;/p&gt;

&lt;p&gt;Two more from the same pass:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;non-deterministic jobs-rate join fan-out.&lt;/strong&gt; The jobs-rate CTE could match multiple price rows per usage unit, multiplying rows and inflating savings — the kind of un-deduped join a model reaches for because the simpler version reads correctly to anyone skimming it. Fix: dedupe to one USD rate per &lt;code&gt;usage_unit&lt;/code&gt; so the join can't fan out.&lt;/li&gt;
&lt;li&gt;CFO-clarity gaps in the narrative cells.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then a follow-up Gemini code review caught three more. Two worth naming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CRITICAL:&lt;/strong&gt; &lt;code&gt;spend-baseline.sql.json&lt;/code&gt; had &lt;code&gt;${DATABRICKS_WAREHOUSE_ID}&lt;/code&gt; inside a &lt;code&gt;--json @file&lt;/code&gt; template. The Databricks CLI does &lt;em&gt;not&lt;/em&gt; expand env vars inside a JSON file, so the call would fail with an invalid warehouse id. Fix: jq-inject the id at call time — &lt;code&gt;jq --arg wh "$DATABRICKS_WAREHOUSE_ID" '. + {warehouse_id: $wh}' template.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HIGH:&lt;/strong&gt; the renderer, at one point, summed confirmed and estimated/at-risk dollars under a single total with no Confidence column — violating the exact invariant the skill exists to enforce. The model regressed against its own design rule. Review caught it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is the lesson stated cleanly: even a system explicitly designed around "never sum confirmed with modeled" will, under LLM authorship, drift back toward doing it — because summing everything into one big number is the locally fluent move. Adversarial review is the gate that catches the regression. A hallucinated column and a silently-fanned-out join both produce output that looks right; only a hostile second read finds them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The same instinct, elsewhere
&lt;/h2&gt;

&lt;p&gt;Precision over blunt tooling ran through the rest of that day's work too. A dependency-vuln triage in a Node mail service refused a blind &lt;code&gt;npm audit fix --force&lt;/code&gt; — which would have major-bumped googleapis and typescript-eslint — and instead cleared 17 of 22 advisories via npm &lt;code&gt;overrides&lt;/code&gt; and targeted direct bumps, documenting the 5 accepted residual install-time-only vulns in &lt;code&gt;tests/TESTING.md&lt;/code&gt;. The destructive automatic fix is rarely the right one, in dependency graphs or in cost reports. A version-controlled document-store initiative got consolidated into a single synthesis hub, and partner FinOps research moved a step forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The transferable rule
&lt;/h2&gt;

&lt;p&gt;If you want a number a decision-maker will act on, the model is not allowed to compute it. Join the source-of-truth tables for confirmed figures. Label every output with its confidence kind and never blur the kinds together. Run the arithmetic in deterministic code the model cannot reach. And put the whole thing through adversarial review before it ships — because the LLM's most dangerous output is the one that looks exactly right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/govern-at-merge-untrusted-union/"&gt;Govern at the Merge: The Untrusted Union&lt;/a&gt; — where governance belongs in a pipeline, and why the merge point is the gate. Same argument as the grant check and the confidence split: enforce at the boundary, not after.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/bot-loop-circuit-breaker-multi-agent-slack/"&gt;The Bot Loop Circuit Breaker for Multi-Agent Slack&lt;/a&gt; — multi-agent review and the failure modes that only show up when agents check each other's work, the way the four review lenses caught the hallucinated column.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/the-api-is-the-real-boundary/"&gt;The API Is the Real Boundary&lt;/a&gt; — the two-MCP split is exactly this: dollars on one data plane, config evidence on another, each with its own auth and its own job.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;{&lt;br&gt;
  "&lt;a class="mentioned-user" href="https://dev.to/context"&gt;@context&lt;/a&gt;": "&lt;a href="https://schema.org" rel="noopener noreferrer"&gt;https://schema.org&lt;/a&gt;",&lt;br&gt;
  "@type": "BlogPosting",&lt;br&gt;
  "headline": "The LLM Should Never Do the Math",&lt;br&gt;
  "description": "A Claude Code skill that hunts Databricks cost leaks and reports confirmed dollars from the customer's own billing tables — never LLM estimates.",&lt;br&gt;
  "datePublished": "2026-06-26T08:00:00-05:00",&lt;br&gt;
  "dateModified": "2026-06-26T08:00:00-05:00",&lt;br&gt;
  "author": {&lt;br&gt;
    "@type": "Person",&lt;br&gt;
    "name": "Jeremy Longshore"&lt;br&gt;
  },&lt;br&gt;
  "publisher": {&lt;br&gt;
    "@type": "Organization",&lt;br&gt;
    "name": "Start AI Tools"&lt;br&gt;
  },&lt;br&gt;
  "url": "&lt;a href="https://startaitools.com/posts/llm-never-does-the-math/" rel="noopener noreferrer"&gt;https://startaitools.com/posts/llm-never-does-the-math/&lt;/a&gt;",&lt;br&gt;
  "keywords": "LLM cost report, Databricks cost leaks, deterministic math, confirmed dollars, FinOps, Claude Code skill, AI agent reliability, adversarial review, hallucination"&lt;br&gt;
}&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>claudecode</category>
      <category>architecture</category>
      <category>finops</category>
    </item>
    <item>
      <title>N Bots in One Slack Channel Loop Forever: Three Gates</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:49:28 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/n-bots-in-one-slack-channel-loop-forever-three-gates-184h</link>
      <guid>https://dev.to/jeremy_longshore/n-bots-in-one-slack-channel-loop-forever-three-gates-184h</guid>
      <description>&lt;p&gt;Put two AI agents in the same Slack channel and let each one reply to anything it sees, and you have built a perpetual motion machine. A answers a human. B sees A's message and answers it. A sees B's answer and answers &lt;em&gt;that&lt;/em&gt;. Neither one is wrong. Neither one is broken. They will keep going until you kill a process, because each message is, from Slack's point of view, a legitimately distinct event that deserves a response.&lt;/p&gt;

&lt;p&gt;This post is about &lt;code&gt;claude-code-slack-channel&lt;/code&gt; — the substrate that lets humans, Claude Code sessions, and peer agents share one channel — and the three commits-worth of work it took to make that channel &lt;em&gt;safe&lt;/em&gt; when more than one bot is in it. The thesis is one sentence: safety for a multi-agent channel is three moves, not one. You gate who gets to speak (mention-to-engage), you trip a breaker when speech runs away (a channel-wide circuit breaker), and you keep the whole thing observable and bounded under load (backpressure plus structured drop reasons). Pull any one and the other two don't save you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: bots answer each other forever
&lt;/h2&gt;

&lt;p&gt;The peer-bot feedback loop is the failure that defines the domain. By default a channel ignores other bots entirely; the loop only becomes possible once an operator explicitly opts peer bots in via an &lt;code&gt;allowBotIds&lt;/code&gt; allowlist. But that opt-in is exactly what you &lt;em&gt;want&lt;/em&gt; — the whole point of a shared channel is agents collaborating — so "just don't allowlist bots" isn't a fix, it's a refusal to build the feature.&lt;/p&gt;

&lt;p&gt;The moment two bots are on the allowlist, A's reply is an inbound event for B, B's reply is an inbound event for A, and the exchange has no natural terminator. Ordinary event-dedup doesn't help: dedup catches the &lt;em&gt;same&lt;/em&gt; event arriving twice, but A's third message and B's fourth message are genuinely new events. The loop is made of distinct, valid messages. That's what makes it nasty — every individual step looks correct.&lt;/p&gt;

&lt;p&gt;The obvious fix is "only respond when you're spoken to" — require an @-mention before a bot engages. And it works, right up until a human tries to actually have a conversation. A mention-gate that demands a fresh &lt;code&gt;@bot&lt;/code&gt; on &lt;em&gt;every single message&lt;/em&gt; is too dumb to use: the human mentions the bot, gets an answer, types a follow-up, and gets silence, because the follow-up didn't re-mention. So they re-mention. Every turn. The channel becomes a place where you shout the bot's name before each sentence. Mention-gating that's correct but unusable isn't a fix; it just trades one failure for another.&lt;/p&gt;

&lt;p&gt;So the real problem has two halves that have to be solved together: stop bots from talking to bots, without making the channel miserable for the humans. Three moves do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move one: mention-to-engage, but thread-sticky
&lt;/h2&gt;

&lt;p&gt;The first move makes mention-gating usable. A &lt;code&gt;requireMention&lt;/code&gt; channel drops any message that doesn't @-mention the bot — &lt;em&gt;except&lt;/em&gt; once a human has engaged a thread by mentioning the bot, subsequent human messages in that same thread are delivered without a fresh mention. Mention once to open the thread, then converse. The engaged thread is recorded in a set keyed by the session thread, so a top-level mention and its in-thread follow-ups resolve to the same slot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Illustrative — the requireMention branch of the channel gate.&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requireMention&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;isMentioned&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;botUserId&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Thread-sticky: a human who already mentioned the bot in this&lt;/span&gt;
  &lt;span class="c1"&gt;// thread keeps the floor without re-mentioning every turn.&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bot_id&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;engagedThreads&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;threadTs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;thread_ts&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nx"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;engagedThreads&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;deliveredThreadKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;threadTs&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;deliver&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;access&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;drop&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;dropReason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;channel.require_mention&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three details in that branch are load-bearing. &lt;strong&gt;Peer bots are never sticky&lt;/strong&gt; — the &lt;code&gt;!ev.bot_id&lt;/code&gt; guard means an allowlisted bot has to mention the target on &lt;em&gt;every&lt;/em&gt; message to be delivered, so a bot can't ride a human's engaged thread into a loop. Stickiness is a convenience extended to humans only. &lt;strong&gt;A mention is still required to open a thread&lt;/strong&gt; — stickiness lowers the cost of staying in a conversation, never the cost of starting one. And &lt;strong&gt;the engaged set is bounded&lt;/strong&gt;: it evicts its oldest entry past a cap, so a busy channel can't grow it without limit. A convenience feature that leaks memory is a slow outage.&lt;/p&gt;

&lt;p&gt;The matching gate also has to be careful about what counts as a mention. A human pasting a code snippet or quoting an earlier message that contains &lt;code&gt;&amp;lt;@bot&amp;gt;&lt;/code&gt; is displaying the token, not addressing the bot. So the matcher prefers Slack's structured &lt;code&gt;blocks&lt;/code&gt; and prunes the &lt;code&gt;rich_text_preformatted&lt;/code&gt; (code block) and &lt;code&gt;rich_text_quote&lt;/code&gt; (blockquote) subtrees before looking for the mention — a &lt;code&gt;&amp;lt;@bot&amp;gt;&lt;/code&gt; buried in a fenced code block can't falsely engage the channel. It falls back to substring matching only when structured blocks are absent.&lt;/p&gt;

&lt;p&gt;Finally, the interaction mode itself is a first-class channel choice, and new channels default to mention-to-engage. The safe posture is the default; opening the floodgates is the thing you opt into, not the thing you forget to turn off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move two: a channel-wide circuit breaker for N-cycle rings
&lt;/h2&gt;

&lt;p&gt;Mention-gating handles the well-behaved case. The breaker handles the runaway one — when something gets past the gate anyway, or when two bots are &lt;em&gt;supposed&lt;/em&gt; to talk and the conversation goes feral.&lt;/p&gt;

&lt;p&gt;The first layer is a per-&lt;code&gt;(channel, bot_id)&lt;/code&gt; sliding window: track each bot's message timestamps, and once a single bot exceeds the threshold (default 10 messages in 60 seconds) in a channel, drop that bot's messages until the window slides back under. That's tuned well above any plausible legitimate cross-bot rate — humans interleave their messages, loops don't — and it cleanly breaks the &lt;em&gt;pairwise&lt;/em&gt; A→B→A case.&lt;/p&gt;

&lt;p&gt;But the pairwise limit has a blind spot, and it's the kind you only see when you reason about it adversarially. A longer A→B→C→D→E→A &lt;em&gt;ring&lt;/em&gt; keeps every individual sender under its own per-bot cap. Five bots each posting every seven or eight seconds sit at roughly eight messages a minute apiece — comfortably under the cap of ten — so a counter that only ever watches one bot at a time reports all-clear while their combined velocity runs away. That's the threshold math, and it's worth saying out loud: a two- or three-bot ring is &lt;em&gt;caught&lt;/em&gt; by the per-bot counter, because for three bots to sum past a 40/60s aggregate at least one has to exceed its own 10/60s cap. The aggregate breaker earns its keep precisely when the ring is large enough — five or more — that every member can hide under the per-bot limit.&lt;/p&gt;

&lt;p&gt;So a second counter sits on top: a channel-wide aggregate that sums peer-bot velocity across &lt;em&gt;all&lt;/em&gt; bots in the channel and trips when the total runs away.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Illustrative — two counters, two scopes, checked in order on the inbound gate.&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;perBotWindow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;channelId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;botId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;perBotLimit&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;drop&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;dropReason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rate.cross_bot_loop&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;// pairwise A→B→A&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;channelWindow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;checkChannel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;channelId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;breakerConfig&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;drop&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;dropReason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rate.channel_cycle&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="c1"&gt;// N-cycle ring A→B→C→D→E→A&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The aggregate default is roughly 4× the per-bot limit (40 messages in 60 seconds), chosen so a legitimate two- or three-bot exchange interleaved with real human work never trips it, but a runaway ring — which fires as fast as the pipeline allows — trips it within seconds.&lt;/p&gt;

&lt;p&gt;It helps to trace the ring against both counters to see why the second one earns its place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — an A→B→C→D→E→A ring under a 10/60s per-bot cap and a 40/60s breaker.
# Five bots, round-robin, one message every ~1.5s.
t=00.0s  A posts   per-bot[A]=1   channel=1
t=01.5s  B posts   per-bot[B]=1   channel=2
t=03.0s  C posts   per-bot[C]=1   channel=3
t=04.5s  D posts   per-bot[D]=1   channel=4
t=06.0s  E posts   per-bot[E]=1   channel=5
t=07.5s  A posts   per-bot[A]=2   channel=6     # each bot only ~ every 7.5s — per-bot stays near 8, never 10
...
t=58.5s  ...        per-bot[*]≈8   channel=40    # every bot under its cap, but the ring trips the AGGREGATE → rate.channel_cycle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each bot paces itself to about eight messages in the window — under its own cap of ten — while the channel total climbs to 40 and the breaker fires. The per-bot counter, watching each bot in isolation, would have let this five-way ring run indefinitely. Three properties keep the aggregate honest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Humans always post freely.&lt;/strong&gt; Both counters only apply to events where &lt;code&gt;bot_id&lt;/code&gt; is set. A human in a melting-down channel is never rate-limited; the throttle is aimed exclusively at the machines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's disablable, with one coherent semantic.&lt;/strong&gt; Setting the config to &lt;code&gt;{ count: 0, windowMs: 0 }&lt;/code&gt; short-circuits to "always allow" in both the store and the gate. (That non-positive-config path was a follow-up fix — the original code had a dead branch where a zero config didn't actually disable cleanly. To deny &lt;em&gt;all&lt;/em&gt; peer-bot delivery you don't zero the limit; you set &lt;code&gt;allowBotIds: []&lt;/code&gt;. Two intentions, two switches.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's deliberately not persisted.&lt;/strong&gt; A process restart resets every counter. Same posture as the channel's nonce store: a fresh process means no stale state for a misbehaving bot to game, and the breaker re-derives from live traffic in one window.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Counter&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Catches&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-bot sliding window&lt;/td&gt;
&lt;td&gt;one &lt;code&gt;(channel, bot_id)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;pairwise A→B→A loops&lt;/td&gt;
&lt;td&gt;10 / 60s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Channel circuit breaker&lt;/td&gt;
&lt;td&gt;all bots in a channel&lt;/td&gt;
&lt;td&gt;N-cycle A→B→C→A rings&lt;/td&gt;
&lt;td&gt;40 / 60s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither counter alone is sufficient. The per-bot limit can't see a ring; the aggregate limit can't tell you &lt;em&gt;which&lt;/em&gt; bot is the offender. Layered, they cover both the pair and the ring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move three: backpressure and observability
&lt;/h2&gt;

&lt;p&gt;The third move is what turns a channel you've &lt;em&gt;gated&lt;/em&gt; into one you can actually &lt;em&gt;operate&lt;/em&gt;. Loops aren't the only way a multi-agent channel hurts you — a burst of legitimate new conversations can exhaust resources just as effectively, and a channel where you can't see &lt;em&gt;why&lt;/em&gt; a bot went quiet is one you can't debug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A global backpressure cap.&lt;/strong&gt; The session supervisor takes a &lt;code&gt;maxConcurrentSessions&lt;/code&gt; limit; once live plus in-flight activations hit the cap, a genuinely &lt;em&gt;new&lt;/em&gt; session is refused rather than spun up. The load-shed decision is both logged to the operator and written to the audit journal — a refusal is a governance event, not a silent drop. One subtle correctness note: the cap counts &lt;em&gt;distinct&lt;/em&gt; in-progress sessions, because a session that just finished activating is briefly present in both the "live" set and the "activating" set, and naively summing the two would double-count it and reject sessions spuriously near the cap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Illustrative — refuse a NEW session at the cap; count distinct in-flight keys.&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;inFlightNew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;activating&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;live&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="nx"&gt;inFlightNew&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;live&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;inFlightNew&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;maxConcurrentSessions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;journalWrite&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;session.activate_rejected&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;deny&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;at maxConcurrentSessions cap&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A structured drop reason on every gate drop.&lt;/strong&gt; Every inbound message the gate turns away now carries a typed &lt;code&gt;dropReason&lt;/code&gt;, and that reason lands in the journal. The reasons are grouped by the gate stage that produced them — which is itself a readable map of how an inbound event is evaluated, in order:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;bot-event stage&lt;/strong&gt; — &lt;code&gt;self.echo&lt;/code&gt; (our own message), &lt;code&gt;bot.not_allowlisted&lt;/code&gt;, &lt;code&gt;admin.muted&lt;/code&gt;, &lt;code&gt;rate.cross_bot_loop&lt;/code&gt;, &lt;code&gt;rate.channel_cycle&lt;/code&gt;, &lt;code&gt;bot.permission_relay&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;top-level stage&lt;/strong&gt; — &lt;code&gt;subtype.filtered&lt;/code&gt; (an edited or deleted message, intentionally ignored), &lt;code&gt;event.no_user&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DM stage&lt;/strong&gt; — &lt;code&gt;dm.policy_closed&lt;/code&gt;, &lt;code&gt;dm.pairing_cap&lt;/code&gt;, &lt;code&gt;dm.pending_full&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;channel stage&lt;/strong&gt; — &lt;code&gt;channel.not_opted&lt;/code&gt;, &lt;code&gt;channel.allowfrom_miss&lt;/code&gt;, &lt;code&gt;channel.require_mention&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The payoff is operational: when a bot stays silent, you can always answer "why didn't it respond?" by reading one field instead of guessing. Silence stops being ambiguous. An enum beats a free-text string here because the reasons are a closed set you want to count, filter, and alert on — &lt;code&gt;rate.channel_cycle&lt;/code&gt; spiking is a signal you can build a dashboard on; a string blob isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read-only admin verbs from Slack.&lt;/strong&gt; Operators can ask the channel about itself without changing anything: &lt;code&gt;!mute-status&lt;/code&gt; lists active peer-bot mutes and their remaining TTL, &lt;code&gt;!rate-limit&lt;/code&gt; shows the effective thresholds, &lt;code&gt;!agents&lt;/code&gt; lists the peer bots seen active recently. The &lt;code&gt;!agents&lt;/code&gt; view is derived from the per-&lt;code&gt;(channel, bot)&lt;/code&gt; activity the rate limiter already tracks — the same timestamps that feed the breaker double as an "agents online" read-out, no separate bookkeeping. These are pure observation verbs — no argument, no state change — sitting alongside the destructive &lt;code&gt;!clear&lt;/code&gt;/&lt;code&gt;!restart&lt;/code&gt; commands but carrying none of their risk. You shouldn't have to SSH into a box to find out which bots are live in a channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A targeted mute, between the allowlist and the breaker.&lt;/strong&gt; The allowlist is all-or-nothing per bot; the breaker is all-or-nothing per channel. Between them sits &lt;code&gt;!mute @bot&lt;/code&gt;, which silences one specific peer bot in one channel for a TTL — its messages drop with &lt;code&gt;dropReason: 'admin.muted'&lt;/code&gt; until the mute expires. That's the surgical option: when exactly one bot is misbehaving, you don't have to choose between de-allowlisting it everywhere or tripping the whole channel's breaker. You mute the one offender and leave the rest of the conversation running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-user session isolation, opt-in.&lt;/strong&gt; A channel can opt into keying sessions by &lt;code&gt;(channel, thread, userId)&lt;/code&gt; instead of the shared &lt;code&gt;(channel, thread)&lt;/code&gt;, so two humans working in the same thread get independent sessions that don't observe each other's state. It's default-off and default-safe: no flag means the legacy shared key, behavior unchanged. The guard validates the sender id is a non-empty string before keying on it, so a malformed event falls back to the shared session rather than constructing a broken key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the three compose
&lt;/h2&gt;

&lt;p&gt;It's tempting to ship one of these and call the channel safe. Each one alone leaves a hole the other two close.&lt;/p&gt;

&lt;p&gt;Mention-gating without a breaker assumes every actor honors the gate. The moment a bot is &lt;em&gt;meant&lt;/em&gt; to be in the conversation — allowlisted, mentioning correctly — the gate waves it through, and two such bots can still loop. The breaker is the backstop for actors the gate legitimately admits.&lt;/p&gt;

&lt;p&gt;A breaker without mention-gating is all backstop and no front door. You'd let every bot respond to everything and rely on rate limits to clean up the mess — which means the channel is constantly tripping its own breaker, and the humans are drowning in bot chatter below the trip threshold. Gating keeps the normal case quiet so the breaker only fires on genuine runaways.&lt;/p&gt;

&lt;p&gt;And both of those without observability is a system that's safe but unoperatable. When a bot goes quiet you can't tell whether it hit the mention gate, tripped the breaker, got muted, or crashed — so you can't tune any of it. The structured drop reasons and the read-only verbs are what let you see which gate is firing and adjust the thresholds with evidence instead of superstition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ceiling, named honestly
&lt;/h2&gt;

&lt;p&gt;What this earns is bounded, and worth stating plainly so nobody oversells it.&lt;/p&gt;

&lt;p&gt;The breaker is &lt;strong&gt;velocity control, not intent control.&lt;/strong&gt; It stops a &lt;em&gt;fast&lt;/em&gt; loop — the kind that fires as fast as the pipeline allows. Two bots having a slow, wrong conversation at one message every ten seconds stay under every threshold and will happily waste tokens all afternoon. The defense against &lt;em&gt;that&lt;/em&gt; is the mention gate and the allowlist, not the rate limit; the breaker is specifically the runaway-velocity backstop, and calling it a general "bad conversation detector" would be a lie.&lt;/p&gt;

&lt;p&gt;The counters are &lt;strong&gt;in-memory and per-process.&lt;/strong&gt; A restart resets them — deliberately, but it means a multi-process deployment doesn't share a breaker, and the cap is per-supervisor, not global across a fleet. At single-channel, single-process scale that's the right amount of machinery; the day this runs as a horizontally-scaled service, the counters need a shared backing store, and that's a real piece of work, not a config flag.&lt;/p&gt;

&lt;p&gt;And the thresholds are heuristics. 10-per-60s and 40-per-60s are tuned for the realistic loop shape, but they're operator-tunable for exactly the reason that the right number depends on your channel's legitimate traffic. The defaults are a safe starting point, not a proof.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;A shared channel is a place where independent actors — human and machine — touch the same surface, and the reflex is to assume each actor will behave. For humans that's mostly true. For bots it is exactly false: a bot does precisely what it's told, forever, including answer another bot that's answering it.&lt;/p&gt;

&lt;p&gt;So you don't make a multi-agent channel safe with one clever gate. You gate who may speak, you trip a breaker when speech runs away, and you keep the whole thing observable so you can see which defense is doing the work. Mention-to-engage keeps the normal case calm. The circuit breaker catches the loops the gate admits. Backpressure and structured drop reasons keep it bounded and debuggable under load. Three moves, each covering the other two's blind spot. Anyone putting more than one bot in one room needs all three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related posts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/the-api-is-the-real-boundary/"&gt;MCP Server Auth: The API Is the Real Boundary&lt;/a&gt; — the same instinct applied to auth: knowing which layer is the actual boundary and which is just UX.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/honor-the-gate-when-the-verdict-is-inconvenient/"&gt;Honor the Gate When the Verdict Is Inconvenient&lt;/a&gt; — a gate is only worth building if you respect its drops.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/govern-at-merge-untrusted-union/"&gt;The Merge Is the Trust Boundary: Re-Derive as Untrusted&lt;/a&gt; — properties don't compose across a boundary you didn't enforce yourself.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>multiagent</category>
      <category>slack</category>
      <category>bots</category>
      <category>ratelimiting</category>
    </item>
    <item>
      <title>The Merge Is the Trust Boundary: Re-Derive as Untrusted</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:49:20 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/the-merge-is-the-trust-boundary-re-derive-as-untrusted-ea1</link>
      <guid>https://dev.to/jeremy_longshore/the-merge-is-the-trust-boundary-re-derive-as-untrusted-ea1</guid>
      <description>&lt;p&gt;A single-writer knowledge store with a hash-chained audit log has a clean story: every governance event folds the previous entry's hash into its own, so an in-place edit breaks the chain and &lt;code&gt;verify&lt;/code&gt; catches it. That story holds right up to the moment a second writer appears.&lt;/p&gt;

&lt;p&gt;The instant two clones of the same store can each promote facts independently and later reconcile, the audit chain stops being a single line. Clone A has a chain that verifies clean against itself. Clone B has a chain that verifies clean against itself. Merge them, and you have a question neither green checkmark answers: &lt;strong&gt;is the union governed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post is about the EPIC 1 work in &lt;code&gt;qmd-team-intent-kb&lt;/code&gt; that took the position that it is not — and that the only honest fix is to treat the merge itself as the trust boundary, re-derive the union as untrusted, and re-govern it from scratch. The thesis underneath is one sentence: in a multi-writer knowledge store, the merge is where trust has to be rebuilt, because it is the one place trust cannot be inherited.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: audit chains don't compose across a merge
&lt;/h2&gt;

&lt;p&gt;The seductive assumption is that two trustworthy halves make a trustworthy whole. A clone's chain verified; the other clone's chain verified; surely their union is fine. Two independent failures kill that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure one: a content-level merge runs no policy.&lt;/strong&gt; The decisive finding came from a throwaway spike against Dolt's native three-way merge. The spike fed two branches into Dolt's row-union and watched a &lt;em&gt;secret-bearing row&lt;/em&gt; — one that a governance gate would have rejected — land in the merged database with zero governance applied.&lt;/p&gt;

&lt;p&gt;The union operator reconciles rows; it does not re-run your disclosure checks, your dedupe, your tenancy rules. Version control is not governance. Anything either branch promoted (or had slip past a weaker gate) rides into the merged state unchecked, because the merge operator has no concept of a policy to enforce. This is not a Dolt defect — it is true of every content-level merge operator, which is exactly why the fix has to live above the substrate rather than inside it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure two: the linear verifier breaks at the seam — for an innocent reason.&lt;/strong&gt; The store's audit verifier is a linear walker: it sorts rows by &lt;code&gt;(timestamp ASC, id ASC)&lt;/code&gt; and asserts each row's &lt;code&gt;prev_entry_hash&lt;/code&gt; matches its predecessor. That contract is exactly right for one clone's own history and exactly wrong for a merge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — why a linear walk fires a FALSE break at a merge seam.
clone A chain (its wallclock):   a0 → a1 → a2
clone B chain (its wallclock):   b0 → b1
merged, sorted by timestamp:     a0  b0  a1  b1  a2     # interleaved by wallclock
linear walk: a1.prev_entry_hash == a0  but predecessor is now b0  →  PREV_LINK_MISMATCH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two clones evolve on independent wallclocks; concatenating their chains by timestamp interleaves rows that never linked to each other, and the walker fires &lt;code&gt;PREV_LINK_MISMATCH&lt;/code&gt; at the boundary. So the naive merge produces a chain that &lt;em&gt;looks&lt;/em&gt; tampered even when nobody tampered — which means a real tamper at the seam would be indistinguishable from the normal noise of merging. A verifier that cries wolf at every merge can't be trusted to catch a wolf.&lt;/p&gt;

&lt;p&gt;Put together: the merge admits ungoverned data &lt;em&gt;and&lt;/em&gt; destroys the signal that would tell you. You cannot patch this by trusting either input chain harder. The trust has to be rebuilt at the merge, not inherited through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The design: re-derive the union as untrusted
&lt;/h2&gt;

&lt;p&gt;The move is to stop treating a row that crossed a clone boundary as "already governed." At merge time the gate (&lt;code&gt;mergeGovern&lt;/code&gt;) takes the union of both clones' promoted rows and re-runs every one of them through the &lt;em&gt;same&lt;/em&gt; front-door governance it would apply to a brand-new capture — as if none had ever been trusted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — the merge gate, in five moves.
union = dedupe_by_content_id(cloneA.promoted ++ cloneB.promoted)
union = sort_by_content_id(union)              # makes A∪B and B∪A byte-identical
for index, row in enumerate(union):
    cand = project_to_untrusted(row)           # strip governance metadata; trustLevel = untrusted
    if not disclosure_clean(cand):  quarantine(row); continue   # fail-closed secret/PII choke point
    if not policy_pipeline(cand):   quarantine(row); continue   # dedupe, sensitivity, tenancy, ...
    promote(cand, clock = merge_clock(index))  # canonical promotion path, deterministic timestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The load-bearing detail is &lt;code&gt;project_to_untrusted&lt;/code&gt;. A promoted row carries governance metadata — its lifecycle, its prior policy evaluations, its &lt;code&gt;trustLevel&lt;/code&gt;. The gate strips all of that, keeps only the source content and provenance, and forces &lt;code&gt;trustLevel = untrusted&lt;/code&gt;, so the &lt;code&gt;source_trust&lt;/code&gt; rule evaluates the row as if it had no standing. The merge trusts &lt;em&gt;nothing&lt;/em&gt; that came across a boundary. Survivors get promoted through the canonical path; a secret that rode in on a clone is quarantined, never written. The merged state is, by construction, indistinguishable from one assembled by feeding every row through the front door one at a time.&lt;/p&gt;

&lt;p&gt;Quarantine has its own non-leak contract, and it matters more than it looks. A rejected row is recorded as a &lt;code&gt;QuarantinedRow&lt;/code&gt; carrying the offending memory's content-derived id, a category (&lt;code&gt;disclosure&lt;/code&gt; or &lt;code&gt;policy&lt;/code&gt;), and a human-readable reason — the disclosure category, or the id of the rule that rejected it. What it never carries is the matched secret or PII value.&lt;/p&gt;

&lt;p&gt;The whole point of the disclosure choke point is to keep secrets out of the governed store; a quarantine record that quoted the secret back in its &lt;code&gt;reason&lt;/code&gt; field would re-leak exactly what it just blocked, into the very audit trail meant to be safe to share. So the gate reports &lt;em&gt;that&lt;/em&gt; a row was turned away and &lt;em&gt;why category&lt;/em&gt;, never &lt;em&gt;what&lt;/em&gt; the offending value was. Refusing to leak on the rejection path is as much a part of the design as refusing to admit on the success path.&lt;/p&gt;

&lt;p&gt;The gate also fails loud, and atomically, on a row it cannot reason about. Before any promotion runs, it pre-passes the entire union and checks that every row's id reproduces under re-derivation — that &lt;code&gt;id == deriveMemoryId(candidateId, contentHash)&lt;/code&gt;. A row whose id does &lt;em&gt;not&lt;/em&gt; reproduce did not come from the canonical promotion path; it carries a stray, non-content-derived id (a &lt;code&gt;crypto.randomUUID()&lt;/code&gt; v4 from some out-of-band code path), and that id is poison: it would survive de-dup as its own twin and sort to a per-clone position, breaking both de-dup and commutativity.&lt;/p&gt;

&lt;p&gt;Rather than corrupt the merge quietly, the gate throws at entry — before a single row is written — so one bad row aborts the whole merge instead of silently skewing it. Walking the already-sorted union makes the first reported offender deterministic across clones, too: even the &lt;em&gt;failure&lt;/em&gt; is reproducible. The error carries the offending id and the id it should have been — both already-public identifiers — but never the row's content, the same non-leak discipline the quarantine path follows.&lt;/p&gt;

&lt;p&gt;Critically, the gate invents no new governance primitive. It re-runs the existing disclosure choke point and the existing policy pipeline. That is a deliberate honesty constraint, and it sets a ceiling I'll name later: re-governing the union is exactly as strong as the governance you already had, and not one bit stronger.&lt;/p&gt;

&lt;p&gt;That "front door" is itself worth a sentence, because it's what makes re-running it meaningful. The same day this merge gate landed, the store's EPIC 0 hardening put real teeth on the intake path: a fail-closed PII/secret/comp choke point at &lt;code&gt;CandidateRepository.insert()&lt;/code&gt; that rejects on detection rather than flagging-and-continuing, token-bound tenancy so a row can't claim a tenant its caller isn't scoped to, and token hashing plus revocation so a leaked credential can be killed.&lt;/p&gt;

&lt;p&gt;The merge gate's whole value proposition — "re-derive the union through the front door" — only pays off because the front door is fail-closed. Re-running a permissive gate over the union would launder bad rows, not catch them. The order of operations matters: harden intake first (EPIC 0), then point the merge at it (EPIC 1).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why content-derived ids make the re-derivation reproducible
&lt;/h3&gt;

&lt;p&gt;Re-running policy is necessary but not sufficient. For the result to mean anything across clones, two clones re-deriving the &lt;em&gt;same&lt;/em&gt; union have to land on the &lt;em&gt;same&lt;/em&gt; answer — same ids, same hashes, same verdict. That reproducibility is bought by three deterministic choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content-derived UUID v5.&lt;/strong&gt; Every memory and audit-event id is derived from its content (&lt;code&gt;deriveMemoryId(candidateId, contentHash)&lt;/code&gt;) under a locked namespace, not minted from &lt;code&gt;crypto.randomUUID()&lt;/code&gt;. So the same logical memory produces the &lt;em&gt;same&lt;/em&gt; id in every clone. That makes the union's de-dup a real set operation — two clones holding the same fact collapse to one row — and it makes a sort-by-id traversal deterministic. The gate even refuses, loudly, any row whose id does not reproduce under re-derivation: a stray v4 id would survive de-dup twice and sort to a per-clone position, so the gate aborts the whole merge atomically rather than corrupt it silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An entry hash that excludes the wallclock.&lt;/strong&gt; The audit chain's &lt;code&gt;entry_hash&lt;/code&gt; was split into versions: v2 canonicalizes the row body &lt;em&gt;without&lt;/em&gt; its &lt;code&gt;timestamp&lt;/code&gt;, so the same logical event hashes identically no matter when each clone happened to write it. (v1 rows are deliberately frozen, never rehashed — their original hashes are their tamper-evidence; v1 and v2 coexist and verify in one pass.) Determinism here is a deliberate substitution: the timestamp leaves the hash body so two clones agree, and a separate deterministic merge clock puts ordering back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A deterministic merge clock.&lt;/strong&gt; Promotion still needs to write &lt;em&gt;some&lt;/em&gt; timestamp. Sourcing it from &lt;code&gt;new Date()&lt;/code&gt; would make the merged DB differ run-to-run. Instead &lt;code&gt;merge_clock(index)&lt;/code&gt; derives a strictly-monotonic timestamp from the row's position in the id-sorted traversal. Because the sort is content-stable, a given memory lands at the same index — and gets the same timestamp — on every clone. This is what makes &lt;code&gt;mergeGovern(A, B)&lt;/code&gt; and &lt;code&gt;mergeGovern(B, A)&lt;/code&gt; produce byte-identical durable state &lt;em&gt;and&lt;/em&gt; an identical audit chain. The merge is commutative, by construction, not by luck.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three clones, one pass, same answer
&lt;/h3&gt;

&lt;p&gt;Two-way commutativity is the property you can demo. The one you actually need in a team is that it doesn't matter &lt;em&gt;how&lt;/em&gt; the merge was assembled — three people's clones reconciled in any grouping, any order, must land on the same governed state. The N-way path (&lt;code&gt;mergeGovernFold&lt;/code&gt;) gets this almost for free, and the "almost" is instructive.&lt;/p&gt;

&lt;p&gt;The obvious implementation is an iterated reduce: govern A and B, then re-govern those survivors against C, writing the running database each pass. That is rejected for a concrete reason. The deterministic merge clock restarts at its epoch on every call, so a second pass would mint audit timestamps that collide with or precede the first pass's — and the verifier re-walks the chain by &lt;code&gt;(timestamp ASC, id ASC)&lt;/code&gt; while the store anchors each new row to the most-recent by &lt;code&gt;(timestamp DESC, id DESC)&lt;/code&gt;. Colliding per-pass clocks desynchronize insertion order from chronological order, and the chain breaks.&lt;/p&gt;

&lt;p&gt;So the fold instead concatenates every clone's rows into the &lt;em&gt;inputs of a single&lt;/em&gt; &lt;code&gt;mergeGovern&lt;/code&gt; call: one union, one id-sorted traversal, one strictly-monotonic clock over the whole thing. Because the first step is a set union and the traversal erases input order, &lt;code&gt;fold([X, Y, Z])&lt;/code&gt;, &lt;code&gt;fold([Z, Y, X])&lt;/code&gt;, and either grouping of the pairwise merges all produce byte-identical durable state and an identical audit chain. Associative and commutative, proven by a fold-of-three test that asserts every ordering verifies with zero breaks. The reduction shape is the only new code; the governance is reused verbatim.&lt;/p&gt;

&lt;h3&gt;
  
  
  The signed DAG anchor: binding the merge to who made it
&lt;/h3&gt;

&lt;p&gt;Determinism gives reproducibility. It does not give attribution. A wholesale rewrite that recomputes every hash is internally consistent and self-verifies clean — the same gap a single-writer external anchor closes for a linear chain. A merge needs more, because a merged head has &lt;em&gt;two&lt;/em&gt; parents and the question "who reconciled these two histories?" has no answer in the hashes alone.&lt;/p&gt;

&lt;p&gt;So each merge appends a &lt;strong&gt;per-actor Ed25519-signed anchor record&lt;/strong&gt; (&lt;code&gt;SignedMergeAnchorRecord&lt;/code&gt;). It binds the merged chain head to a &lt;code&gt;parents&lt;/code&gt; &lt;em&gt;set&lt;/em&gt; — the two pre-merge clone heads, sorted so the set canonicalizes identically regardless of input order — a per-actor Lamport clock, the signer's public key embedded inline, and a detached Ed25519 signature over the canonical body. Two independent guards then ride every record: the existing SHA-256 integrity hash (the bytes weren't accidentally corrupted) &lt;em&gt;and&lt;/em&gt; the signature (a forger who edits the merged chain and re-hashes it forward still cannot produce a valid signature without the actor's private key).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — what the merge-aware verifier checks, and what each catches.
per-clone linear walk        →  tamper inside one clone's own history
id-sorted merged re-walk     →  merged DB NOT produced by mergeGovern (or reordered after)
Ed25519 signature verifies   →  WHO anchored this merge — not just that the bytes agree
parents == {headA, headB}    →  the anchor attests to the correct two chains
chainHead == merged head     →  the anchor describes THIS merged head, not a different one
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;parents&lt;/code&gt; field is what makes this a DAG anchor rather than a fancier linear one. A merged head has two ancestors, and the anchor records them as a &lt;em&gt;set&lt;/em&gt;, not a tuple — sorted before the body is serialized — so the same two clone heads canonicalize to identical signable bytes no matter which clone was passed to &lt;code&gt;mergeGovern&lt;/code&gt; first. The Lamport clock alongside it is a per-actor monotonic counter the caller owns and bumps before each signed anchor, so an auditor can order one actor's anchors causally even when wallclocks across machines disagree. And an optional &lt;code&gt;commitHash&lt;/code&gt; records the Dolt or git SHA the merge landed on, giving an external auditor a thread back into version-control history. None of these are required to verify a single merge; all of them are what let a &lt;em&gt;sequence&lt;/em&gt; of merges across actors be reconstructed after the fact.&lt;/p&gt;

&lt;p&gt;The merge-aware verifier owns the canonical id-based ordering as a first-class contract — it re-walks the union by content-derived event id and checks &lt;code&gt;prev_entry_hash&lt;/code&gt; against the &lt;em&gt;id-sorted&lt;/em&gt; predecessor, not the wallclock one. It validates each clone's chain linearly first (a break there is tamper in that clone's own history, surfaced before any merge-level reasoning), then re-walks the merged union, then cross-checks the signed anchor. That ordering cures the false &lt;code&gt;PREV_LINK_MISMATCH&lt;/code&gt; from earlier: a clean merge re-walks to byte-identical hashes and reports zero breaks, so a real reorder or tamper now stands out instead of drowning in merge noise. Against a keyless adversary, the skeptic harness that gated this work reported 20,000 brute-force signature attempts with zero accepted, and a tamper-then-rehash trips &lt;code&gt;HISTORY_REWRITTEN&lt;/code&gt; or &lt;code&gt;DAG_HEAD_MISMATCH&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Five layers stack into the property, and it helps to see them as a unit — each earns one specific guarantee, and none is load-bearing alone:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;What it earns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Content-derived UUID v5&lt;/td&gt;
&lt;td&gt;same content derives the same id in every clone&lt;/td&gt;
&lt;td&gt;union de-dup is a real set op; id-sorted traversal is reproducible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deterministic entry hash (v2)&lt;/td&gt;
&lt;td&gt;canonical hash body excludes the wallclock timestamp&lt;/td&gt;
&lt;td&gt;the same logical event hashes identically across clones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Govern-at-merge gate&lt;/td&gt;
&lt;td&gt;re-derive the union as untrusted through the front door&lt;/td&gt;
&lt;td&gt;ungoverned rows are quarantined, never admitted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Signed DAG anchor (Ed25519)&lt;/td&gt;
&lt;td&gt;parents &lt;em&gt;set&lt;/em&gt; + signature over the merged head&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;which key&lt;/em&gt; reconciled the histories — identity needs a trusted key roster on top&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merge-aware verifier&lt;/td&gt;
&lt;td&gt;per-clone walk + id-sorted re-walk + anchor cross-check&lt;/td&gt;
&lt;td&gt;a real reorder or tamper stands out instead of drowning in merge noise&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pull any row and the property degrades: without content-derived ids the merge isn't commutative; without the deterministic hash the merged chain won't re-walk clean; without the gate the union is ungoverned; without the signature you can't attribute the merge; without the merge-aware verifier you can't tell a clean merge from a tampered one. They were built and merged as one epic for that reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it matters: verification that travels with the content
&lt;/h2&gt;

&lt;p&gt;The payoff is that verification is no longer tied to a place or a trusted party. Any clone, anywhere, that re-derives the same content reproduces the same ids, the same &lt;code&gt;entry_hash&lt;/code&gt;, the same merged chain, and the same quarantine-or-promote verdict. "Is this merged store governed?" becomes a question you answer by recomputation, not by trusting the machine that produced it. The signed anchor adds the one thing recomputation can't: evidence that a specific &lt;em&gt;key&lt;/em&gt; performed the reconciliation. The signature is self-contained — verifiable against the public key embedded in the record, with no key server in the loop — but that earns integrity plus "one keyholder signed this," not identity. Attributing that key to a named actor still needs a trusted roster of which public key belongs to whom: the anchor proves a key signed; the roster says whose key it is. That roster is the out-of-band trust this design still leans on, and naming it is the honest version of the claim.&lt;/p&gt;

&lt;p&gt;Determinism also makes the merge safe to &lt;em&gt;preview&lt;/em&gt;. The gate takes a &lt;code&gt;dryRun&lt;/code&gt; flag that runs the entire pipeline — validation, de-dup, disclosure check, policy, audit-event construction — and writes nothing. Because the result is byte-identical to what a real run would produce, a dry run is a faithful forecast, not an approximation: you can see exactly which rows would promote and which would quarantine before committing the merge. A non-deterministic gate couldn't offer that; its preview and its real run could disagree.&lt;/p&gt;

&lt;p&gt;That is the whole shape of the thing. Where a version-control merge says "I reconciled the rows," this says "I re-governed the union from scratch as untrusted, here is the deterministic result you can reproduce, and here is my signature over the head." The merge stops being a place where trust is silently inherited and becomes a place where it is explicitly rebuilt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ceiling, named honestly
&lt;/h2&gt;

&lt;p&gt;The discipline this work inherited from its single-writer predecessor is sizing the claim to the mechanism. So, plainly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — the ceiling.
forger WITHOUT the actor's private key   →  cannot mint an accepted anchor   (caught)
the key-holder edits a row and RE-SIGNS  →  a self-consistent signed anchor  (NOT caught here)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;strong&gt;tamper-evident, not tamper-proof.&lt;/strong&gt; A legitimate key-holder — or an exfiltrated key — can still edit a row, recompute the chain, and produce a fresh signed anchor that verifies perfectly. The signature proves the edit was made by the key, not that the edit was honest.&lt;/p&gt;

&lt;p&gt;That gap is &lt;em&gt;mitigated, not eliminated&lt;/em&gt;, by two things outside the cryptography. First, key custody: the Ed25519 private key lives only in a SOPS/age-encrypted file, mode 600, loaded at sign time and never written to disk in plaintext. Second, committing the anchor log somewhere append-only that the local writer cannot quietly rewrite — a &lt;code&gt;git push&lt;/code&gt;, an OpenTimestamps proof. The anchor is only as trustworthy as the place it is committed to; an anchor that lives next to the chain it witnesses is no witness at all. Naming where the stronger claim would need more machinery is part of the claim — the same recursion the single-writer version faced: where do you anchor the anchor?&lt;/p&gt;

&lt;p&gt;Two more ceilings worth stating. The re-derivation is only as strong as the detectors it re-runs — the disclosure choke point inherits whatever recall ceiling its secret/PII pattern set has, and re-governing the union doesn't improve that; expanding the detectors is the only lever. And the content-derived ids use UUID v5, which is SHA-1 under the hood per RFC 4122 — that is deterministic namespacing, &lt;em&gt;not&lt;/em&gt; a security primitive (the static analyzer's weak-crypto alert on it is a dismissed false positive); the audit integrity that actually matters runs on SHA-256. Calling the namespacing "cryptographic integrity" would be the kind of overstatement this whole design exists to avoid.&lt;/p&gt;

&lt;p&gt;One more piece of honesty belongs in the same breath, about the substrate. The clone-and-merge story sounds like it demands a distributed database, and the Dolt spike that exposed the ungoverned-merge problem made Dolt the obvious destination. The decision recorded the opposite: the store of record stays SQLite, and the merge gate operates on the store's &lt;em&gt;types&lt;/em&gt;, not on any one engine's merge — so it is substrate-agnostic, and adopting Dolt later is a swap, not a rewrite.&lt;/p&gt;

&lt;p&gt;Migrating the brain onto a distributed substrate is explicitly demand-gated: not built until a real multiplayer need is logged — two people actually blocked on a shared brain, or a real cross-person recall miss. Provisioning a distributed control plane before that signal is exactly the premature optimization the de-risked plan exists to prevent. The merge gate is the part you need &lt;em&gt;first&lt;/em&gt;, because the governance gap is real the moment two clones exist; the distributed plumbing is the part you can defer until someone is genuinely blocked without it.&lt;/p&gt;

&lt;p&gt;The forbidden-words list is short and load-bearing: tamper-proof, immutable, blockchain, and — for the local single-actor case — non-repudiation. What the merge gate earns is real and worth having: governance that survives a merge, deterministic and commutative re-derivation, and cross-actor attribution of the merge event once the anchor is externally committed. Sized to that, the claim holds. Inflated past it, it would be the falsifiable kind that ages into a liability the day someone asks you to prove it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;A merge is the one operation where two trust domains touch, and the reflex everywhere — version control, caches, replicas — is to assume the operation preserves whatever properties the inputs had. For data you actually govern, that reflex is the bug. Properties don't compose across a reconciliation you didn't perform yourself.&lt;/p&gt;

&lt;p&gt;The fix is not a stronger merge operator; it's refusing to inherit trust at all — re-deriving the union as untrusted, re-running the governance you already trust, anchoring the deterministic result with a signature, and being honest that the signature proves authorship, not virtue. The merge is the trust boundary. Treat it like one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related posts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/governed-second-brain-local-first-mcp/"&gt;A Second Brain You Can Audit Beats One You Must Trust&lt;/a&gt; — the single-writer half of this system, where one external anchor closes the silent-rewrite gap for a linear chain. This post is what happens when a second writer arrives.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/governed-knowledge-two-releases-freshness-daemon/"&gt;Governed Knowledge: Two Releases, a Freshness Daemon, and Export Gating&lt;/a&gt; — earlier groundwork on governing a knowledge corpus rather than just retrieving from it.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/the-api-is-the-real-boundary/"&gt;MCP Server Auth: The API Is the Real Boundary&lt;/a&gt; — the same knowledge system, and why the governance audit trail must stay pure for verification to mean anything.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributedsystems</category>
      <category>audit</category>
      <category>cryptography</category>
      <category>knowledgebase</category>
    </item>
    <item>
      <title>An Agent Allowlist Is a Comment Until a Gate Checks the Body</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:45:26 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/an-agent-allowlist-is-a-comment-until-a-gate-checks-the-body-2g47</link>
      <guid>https://dev.to/jeremy_longshore/an-agent-allowlist-is-a-comment-until-a-gate-checks-the-body-2g47</guid>
      <description>&lt;p&gt;A Claude Code agent declares the tools it is allowed to use in its frontmatter. That &lt;code&gt;tools&lt;/code&gt; line looks like documentation — a courteous note about what the agent touches.&lt;/p&gt;

&lt;p&gt;It is not documentation. It is a runtime allowlist: tools not listed are &lt;em&gt;blocked at runtime&lt;/em&gt;. The agent can write all the instructions it wants in its body about calling some MCP tool; if that tool isn't in the allowlist, the call never happens.&lt;/p&gt;

&lt;p&gt;Which means a Claude Code agent has two surfaces that can disagree. The frontmatter declares a capability. The body exercises a behavior. Nothing in between proves they match. And because the allowlist is enforced at runtime, a mismatch isn't a cosmetic doc-drift bug — it's a latent failure. An agent that invokes a tool it forgot to declare doesn't error loudly. It silently can't do its job.&lt;/p&gt;

&lt;p&gt;At one agent, you catch that by reading the file. This repo ships &lt;strong&gt;317&lt;/strong&gt; of them. This post is about the gate that made all 317 declare the truth about themselves in one automated sweep, instead of someone hand-auditing 317 files and hoping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: a runtime gate that nobody checked against behavior
&lt;/h2&gt;

&lt;p&gt;Every agent file is two parts. Frontmatter — &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;tools&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, and the rest — and a body, which is the prose that actually instructs the agent what to do and which tools to call. The &lt;code&gt;tools&lt;/code&gt; field is the allowlist, and the allowlist is load-bearing in a way a comment never is: the runtime reads it and blocks anything not on it.&lt;/p&gt;

&lt;p&gt;So there are exactly two ways for the frontmatter and the body to fall out of sync, and both are silent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The body uses a tool the allowlist doesn't grant.&lt;/strong&gt; The agent's prose says "collect the analytics with the umami tools," but &lt;code&gt;tools&lt;/code&gt; lists only &lt;code&gt;Read&lt;/code&gt;. At runtime every umami call is blocked. The agent is shipped, looks fine, validates against a frontmatter-only schema, and cannot perform its one job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The allowlist grants a tool the body never uses.&lt;/strong&gt; Over-declaration. The agent claims privilege it doesn't exercise. The allowlist is now lying in the other direction — it overstates the blast radius of the agent, which is exactly the wrong thing for an allowlist to do.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither of these shows up in a frontmatter-only validator. You can check that &lt;code&gt;tools&lt;/code&gt; is a well-formed list of real tool names and still have an allowlist that has nothing to do with what the agent does. The schema was green and the agents were wrong, and the only thing that would have caught it was a human reading each body against each allowlist — which does not scale to 317.&lt;/p&gt;

&lt;p&gt;The fix is the same move I keep coming back to: a declared property is only worth what the check that compares it to behavior is worth. Without that check, the allowlist is a comment that also happens to block your runtime calls — the worst of both, because being wrong is invisible until the agent fails quietly in front of a user.&lt;/p&gt;

&lt;h2&gt;
  
  
  The approach: a kernel-strict gate that lands with the fleet
&lt;/h2&gt;

&lt;p&gt;The check didn't arrive in one commit. It arrived as a schema-versioned progression — &lt;code&gt;3.9&lt;/code&gt; to &lt;code&gt;3.10&lt;/code&gt; to &lt;code&gt;3.11&lt;/code&gt; — where each step tightened what "a valid agent" means, and the fleet was dragged up to meet each new bar in the same branch that raised it.&lt;/p&gt;

&lt;p&gt;The version number isn't decoration either. Every change to what the validator accepts bumps &lt;code&gt;SCHEMA_VERSION&lt;/code&gt; and lands a changelog entry, so "the day the agent contract got stricter" is a diff with a number, not a vibe. When an agent that passed last month fails today, the version delta tells you which rule moved and where it was signed off. A quality gate that changes silently is its own kind of drift; versioning the gate is how the gate stays auditable while it tightens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema 3.10.0 — flip the gate from lenient to kernel-strict.&lt;/strong&gt; The agent validator used to require almost nothing: &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, and a warning if you used a banned field. That's a gate in name only. 3.10.0 flipped it to a two-layer required set, all of it &lt;em&gt;errors&lt;/em&gt; at every tier (agents aren't tier-gated the way skills are):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The kernel floor — 8 fields:&lt;/strong&gt; &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;tools&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;color&lt;/code&gt;, &lt;code&gt;version&lt;/code&gt;, &lt;code&gt;author&lt;/code&gt;, &lt;code&gt;tags&lt;/code&gt;. This is the &lt;code&gt;@intentsolutions/core&lt;/code&gt; agent-definition required set, consumed from the kernel with an inline fallback mirror so the validator and the spec can't drift apart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The enterprise live set on top:&lt;/strong&gt; &lt;code&gt;disallowedTools&lt;/code&gt; / &lt;code&gt;skills&lt;/code&gt; / &lt;code&gt;background&lt;/code&gt; on every agent, plus &lt;code&gt;hooks&lt;/code&gt; / &lt;code&gt;mcpServers&lt;/code&gt; / &lt;code&gt;permissionMode&lt;/code&gt; on standalone agents (those three are plugin-level and ignored at runtime on a plugin agent, so they're only required where they mean something).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also promoted the banned fields — &lt;code&gt;capabilities&lt;/code&gt;, &lt;code&gt;expertise_level&lt;/code&gt;, &lt;code&gt;activation_priority&lt;/code&gt;, &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;, &lt;code&gt;when_to_use&lt;/code&gt; and friends — from WARN to ERROR, and added &lt;code&gt;fable&lt;/code&gt; to the model enum.&lt;/p&gt;

&lt;p&gt;The discipline that makes this real is in the commit note: &lt;strong&gt;CI goes red until the in-repo agents are remediated in the same branch.&lt;/strong&gt; The gate and the fleet land together. A gate you merge &lt;em&gt;before&lt;/em&gt; the fleet conforms is just a broken &lt;code&gt;main&lt;/code&gt; with a TODO. A gate you merge &lt;em&gt;with&lt;/em&gt; a deterministic remediation of all 317 agents is enforcement from the first commit — the bar and everything held to it move in one motion.&lt;/p&gt;

&lt;p&gt;The two-pass order matters and is worth slowing down on. The remediation that turned CI green could not also be the remediation that got things &lt;em&gt;right&lt;/em&gt;, because "green" and "correct" are different bars and conflating them is how a big migration goes sideways.&lt;/p&gt;

&lt;p&gt;So the green pass was deliberately dumb and deterministic: every agent missing a &lt;code&gt;tools&lt;/code&gt; field got the full canonical tool set (a faithful default that preserved the inherit-everything behavior those agents already had), &lt;code&gt;tags&lt;/code&gt; scaffolded from plugin category plus name, &lt;code&gt;version&lt;/code&gt; set to &lt;code&gt;1.0.0&lt;/code&gt;, a stable &lt;code&gt;color&lt;/code&gt;, and empty &lt;code&gt;disallowedTools&lt;/code&gt; / &lt;code&gt;skills&lt;/code&gt; / &lt;code&gt;background false&lt;/code&gt;. Nothing clever, nothing that required judgment per agent — just enough to satisfy the new required set so the gate could go live with a green fleet under it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The A-grade pass — narrow to least privilege.&lt;/strong&gt; &lt;em&gt;Then&lt;/em&gt; a second pass took the allowlists from "technically valid" to "actually minimal." It did three things to every agent: rewrote &lt;code&gt;description&lt;/code&gt; into capability plus an explicit "Use when…" plus trigger phrases; replaced the scaffolded category &lt;code&gt;tags&lt;/code&gt; with at least two real lowercase topic tags; and narrowed &lt;code&gt;tools&lt;/code&gt; from full-canonical to least-privilege. The result: median allowlist around five tools, and &lt;strong&gt;zero&lt;/strong&gt; agents retaining the full ten-tool set, down from 311 inheriting all of it.&lt;/p&gt;

&lt;p&gt;The edits were surgical — only &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;tools&lt;/code&gt;, and &lt;code&gt;tags&lt;/code&gt; changed; every body and all other frontmatter byte-identical across all 317. Splitting it this way meant the risky, judgment-heavy narrowing happened against an already-green baseline, so any regression had a single obvious cause instead of being buried in the same diff that introduced the gate.&lt;/p&gt;

&lt;p&gt;And here's the gap that remained. After the A-grade pass, every allowlist was structurally valid &lt;em&gt;and&lt;/em&gt; least-privilege — and still nothing proved any of them matched the body. Least privilege you assert by hand is just a tighter comment. A smaller allowlist that the body contradicts is, if anything, &lt;em&gt;more&lt;/em&gt; dangerous than an over-broad one: the tighter you draw the grant, the more likely an honest behavior in the body falls outside it and gets blocked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The check: body-vs-allowlist consistency (schema 3.11.0)
&lt;/h2&gt;

&lt;p&gt;3.11.0 is the step that closes the gap. &lt;code&gt;validate_agent()&lt;/code&gt; stopped reading only the frontmatter and started reading the body too, then cross-checking what the body &lt;em&gt;does&lt;/em&gt; against what the allowlist &lt;em&gt;grants&lt;/em&gt;. The structure, paraphrased:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Illustrative — the shape of the body-vs-allowlist consistency check.
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_agent_body_vs_allowlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frontmatter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Example code isn't behavior. Strip fenced ```
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="err"&gt;…&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="sb"&gt;``&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;
    &lt;span class="c1"&gt;# so a tool shown in a usage snippet doesn't read as a tool call.
&lt;/span&gt;    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;strip_fenced_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;declared&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mcp_tools_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frontmatter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;     &lt;span class="c1"&gt;# mcp__server__tool entries
&lt;/span&gt;    &lt;span class="n"&gt;fq_in_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_fully_qualified_mcp_refs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# mcp__server__tool used in prose
&lt;/span&gt;
    &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warnings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 3 (ERROR): zero MCP tools declared, but the body calls them.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;fq_in_body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body references MCP tools but the allowlist declares &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none — every call would be runtime-blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 1 (ERROR): a specific tool the body invokes isn't on the allowlist.
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fq_in_body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body invokes &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; but it is not in the allowlist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 2 (WARN): heuristic short-name mention, only on MCP-oriented agents.
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;backtick_verb_camelcase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;matches_any_declared&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body mentions `&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;` (tool-call-shaped) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;but no declared tool matches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# WARN: declared but never used — over-declared privilege / drift.
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fq_in_body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;declares &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; but the body never uses it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warnings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The severities are sized to the confidence, and that's the part that makes it a usable gate rather than a noise machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A fully-qualified &lt;code&gt;mcp__server__tool&lt;/code&gt; reference in the body that isn't on the allowlist is an ERROR.&lt;/strong&gt; That string is unambiguous — it's the exact runtime name, it doesn't appear by accident, and if it's in the body and not the allowlist, the runtime &lt;em&gt;will&lt;/em&gt; block it. High confidence, hard failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An allowlist with zero MCP tools whose body still calls them is an ERROR.&lt;/strong&gt; Same reasoning, whole-agent scale: every MCP call the agent makes is dead on arrival.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A backtick short-name that looks tool-call-shaped is a WARN, and only on MCP-oriented agents.&lt;/strong&gt; This is the heuristic, and it's deliberately timid. An earlier, broader version of this check would have flagged &lt;code&gt;getStaticProps&lt;/code&gt; and every other &lt;code&gt;verbCamelCase&lt;/code&gt; token in a body as a phantom tool. So the short-name heuristic is gated to agents that actually deal in MCP tools, and it only warns — because a check that cries wolf gets switched off, and a switched-off check protects nothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two engineering details earn their place. &lt;strong&gt;Fenced code is stripped before any matching&lt;/strong&gt; — a tool shown inside a triple-backtick usage example is documentation of how to call it, not the agent calling it, so it shouldn't trip the gate. And the strong signal (exact &lt;code&gt;mcp__&lt;/code&gt; strings) carries the errors while the weak signal (short names) carries only warnings. Both choices come from the same instinct: a precise gate that fails hard on certainty and merely whispers on a guess is one people leave on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result: one real defect, found by machine
&lt;/h2&gt;

&lt;p&gt;The check ran against all 317 agents and surfaced a genuine pre-existing defect that had been sitting in the repo, green, the whole time. &lt;code&gt;data-collector.md&lt;/code&gt; invoked six &lt;code&gt;mcp__umami__*&lt;/code&gt; tools in its body and declared only &lt;code&gt;Read&lt;/code&gt; in its allowlist.&lt;/p&gt;

&lt;p&gt;Read that again in runtime terms: the agent whose entire purpose is collecting analytics data could not make a single analytics call. Every one would have been blocked. It wasn't doc-drift — it was a latent outage the frontmatter-only schema had no way to see. The fix was to declare the six tools, because the agent genuinely needed them.&lt;/p&gt;

&lt;p&gt;That single catch is the whole argument for the check. One agent, wrong in a way that looked fine, in a fleet too large to read by hand, found deterministically the moment the validator was taught to compare declaration against behavior. And notice it was an &lt;em&gt;under&lt;/em&gt;-declaration — the body asked for more than the allowlist granted. A privilege auditor that only worried about agents asking for too much would have walked right past it. The check has to run both directions because the silent runtime failure lives in the direction most reviewers don't think to look. Nine new unit tests pin the check's behavior; all 317 agents validate to zero agent-level errors.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;What it claims&lt;/th&gt;
&lt;th&gt;What enforces it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;tools&lt;/code&gt; frontmatter&lt;/td&gt;
&lt;td&gt;the agent's allowed capability&lt;/td&gt;
&lt;td&gt;the runtime — undeclared tools are blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agent body&lt;/td&gt;
&lt;td&gt;the agent's actual behavior&lt;/td&gt;
&lt;td&gt;the model, following the prose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;consistency gate (3.11.0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;declaration and behavior agree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CI — ERROR on a fully-qualified mismatch&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The principle: declared capability must be machine-checked against behavior
&lt;/h2&gt;

&lt;p&gt;Strip away the specifics and this is the same boundary lesson as &lt;a href="https://dev.to/posts/the-api-is-the-real-boundary/"&gt;an MCP server's write gate&lt;/a&gt;. There, the client-side tool gate was UX and the server-side gate was the boundary, and the mistake was confusing the two. Here, the allowlist is a declaration and the body is the behavior, and the mistake is assuming they agree because you wrote them in the same file.&lt;/p&gt;

&lt;p&gt;They don't agree until a check makes them. A declaration nobody verifies decays into a comment — and a comment that's also a runtime gate is a latent failure waiting for the worst moment to surface.&lt;/p&gt;

&lt;p&gt;The move generalizes past Claude Code agents. Any place a system declares a capability separately from where it exercises it — an RBAC scope versus the endpoints it guards, a capability manifest versus the syscalls a process makes, a type annotation versus what a dynamically-typed function actually returns, an OpenAPI contract versus the handler — has this same fault line. The declaration is cheap to write and ages into a liability the day it stops matching the behavior, because nobody was checking.&lt;/p&gt;

&lt;p&gt;The fix is never "be more careful when you write the declaration." Care doesn't survive contact with 317 files and a year of edits. The fix is a deterministic check that reads both sides and fails when they diverge — and it has to live in CI, not in a reviewer's discipline, because the whole problem is that the discipline doesn't scale. Once the check exists, the property it guards stops being a thing you hope is true and becomes a thing that can't merge while false.&lt;/p&gt;

&lt;p&gt;Scale is what turns this from nice-to-have into non-negotiable. At one agent, the author is the auditor and the body fits on a screen. At 317, the only auditor that scales is a check in CI that reads every body against every allowlist on every push — and that's the difference between asserting your agents are correct and proving it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;The body check is heuristic, not an AST. It matches text patterns — fully-qualified &lt;code&gt;mcp__server__tool&lt;/code&gt; strings and backtick &lt;code&gt;verbCamelCase&lt;/code&gt; mentions — so it cannot catch a tool invoked through a dynamically constructed string in the body, and it scopes the short-name heuristic narrowly on purpose.&lt;/p&gt;

&lt;p&gt;That's a deliberate trade of recall for precision: a gate that throws false ERRORs gets disabled, so the certain signal fails hard and the uncertain signal only warns. Strong over complete.&lt;/p&gt;

&lt;p&gt;Stripping fenced code means a tool that appears &lt;em&gt;only&lt;/em&gt; inside an example block is invisible to the check. Acceptable, because an example isn't behavior — but worth naming, because it's the one place a real call could hide.&lt;/p&gt;

&lt;p&gt;And the check bites hardest on MCP tools specifically, because their fully-qualified namespace makes them matchable with confidence. A plain &lt;code&gt;Bash&lt;/code&gt;-vs-&lt;code&gt;Read&lt;/code&gt; mismatch isn't gated the same way, since those canonical tools don't carry an unambiguous call signature in prose. That's fine by design: the MCP surface is the privileged, high-blast-radius one, and that's exactly where you want the allowlist to be provably honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also shipped
&lt;/h2&gt;

&lt;p&gt;The same day, the team knowledge-base work continued on a parallel track: the qmd adapter got a native FTS5/BM25 keyword backend with no external binary, a retrieval eval harness reporting Recall@10 and nDCG@10, and SHA-256 pinning of the retrieval-model weights that fails closed if the hash doesn't match. Plus a round of CI cleanup across the plugins repo. Each is its own thread; here they're the backdrop to a day whose lead story was making 317 agents' declared tools provably match what they actually do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related posts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/the-api-is-the-real-boundary/"&gt;MCP Server Auth: The API Is the Real Boundary&lt;/a&gt; — the same lesson on a different surface: the declared gate and the enforced gate are not the same thing, and confusing them is how security theater ships.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/governed-second-brain-local-first-mcp/"&gt;A Second Brain You Can Audit Beats One You Must Trust&lt;/a&gt; — replace every promise with a command that checks it, applied to a governed knowledge store.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/when-green-ci-proves-nothing/"&gt;Green CI Proves Nothing: Why Your Tests Gate Zero Calls&lt;/a&gt; — the same instinct turned on tests: a green check that verifies nothing is worse than no check at all.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>agents</category>
      <category>validation</category>
      <category>qualitygates</category>
    </item>
    <item>
      <title>A Second Brain You Can Audit Beats One You Must Trust</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:45:25 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/a-second-brain-you-can-audit-beats-one-you-must-trust-4ng5</link>
      <guid>https://dev.to/jeremy_longshore/a-second-brain-you-can-audit-beats-one-you-must-trust-4ng5</guid>
      <description>&lt;p&gt;Most "second brain" products ask for the same thing in the end-user license agreement: trust us. Trust that your notes stay where you put them. Trust that the model only saw what we told you it saw. Trust that the record we keep is the record that happened. None of those are verifiable from the outside — they're promises, backed by a privacy policy and a logo.&lt;/p&gt;

&lt;p&gt;The Governed Second Brain plugin shipped this day inverts that contract. It's a local-first, in-process MCP server — read and write, daemon-free — that you install over a folder of your own files and query inside Claude Code.&lt;/p&gt;

&lt;p&gt;The whole design is organized around one substitution: replace &lt;em&gt;trust me&lt;/em&gt; with &lt;em&gt;check it yourself&lt;/em&gt;. Where a hosted RAG product would say "your data is safe," this says "here is the command that proves what ran, what left, what changed, and what you installed." Four audit questions, four design choices. This post is about why each one is structural and not a feature you could bolt onto a trust-me product after the fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: governed knowledge can't run on faith
&lt;/h2&gt;

&lt;p&gt;Retrieval over a personal or team knowledge base is a solved demo and an unsolved product. The demo is easy: embed some documents, do nearest-neighbor search, stuff the hits into a prompt. The product is hard because the moment the knowledge is &lt;em&gt;governed&lt;/em&gt; — client material under contract, regulated data, an institutional memory other people rely on — the interesting questions stop being about retrieval quality and start being about provenance.&lt;/p&gt;

&lt;p&gt;Second brains attract the trust-me default more than almost any other category, and the reason is a mismatch. The data is the most sensitive a person owns — private notes, half-formed ideas, client material under NDA — while the architecture most products wrap around it is the most opaque: a hosted index, a sync loop, an embedding API, all behind a dashboard. You are asked to put your most guarded material into the system you can see the least of. That trade is fine for a to-do list and unacceptable for governed knowledge.&lt;/p&gt;

&lt;p&gt;Three of the questions that follow have no good answer in a hosted, trust-me architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What actually ran against my files?&lt;/strong&gt; A cloud service runs code you can't see on data you can't withhold. "Local-first" in marketing copy frequently means "we cache locally and sync everything anyway."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What left the machine?&lt;/strong&gt; RAG pipelines egress document text to an embedding or completion API by default. The user usually finds out which bytes left by reading a SOC 2 report, not by reading their own logs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Did my record of what happened get quietly rewritten?&lt;/strong&gt; This is the subtle one. A knowledge store accumulates a history — what was promoted, transitioned, superseded. If that history can be edited after the fact and nobody can tell, then "audit trail" is a decoration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three plus a fourth — what did I install — are the questions a governed knowledge store has to answer to be worth governing. The rest of this post is the four answers, each a design choice rather than a disclosure.&lt;/p&gt;

&lt;p&gt;The plugin's umbrella repo got renamed this day from &lt;em&gt;Compile-Then-Govern&lt;/em&gt; to &lt;strong&gt;Governed Second Brain&lt;/strong&gt; with an explicit commit note: "Apache-2.0, honest audit claim." That second clause is the thesis in three words. The rename forced the README to stop overstating what the audit does and start stating exactly what it does — because the previous claim turned out to be falsifiable, and an audit feature you can falsify is worse than none.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choice 1 — local-first and daemon-free: you can see what runs
&lt;/h2&gt;

&lt;p&gt;The plugin is an in-process MCP server. There is no background daemon, no &lt;code&gt;:3847&lt;/code&gt; HTTP service to keep alive, no separate process holding your data open. The earlier incarnation of this system split the write path across a long-running curator daemon; folding it in-process removed a whole class of "is the service up, is it the right version, what is it doing while idle" questions.&lt;/p&gt;

&lt;p&gt;The full loop runs in one process: capture a fact, govern it through dedupe and policy, promote it, write the hash-chained audit entry, export, index, and answer a query with a &lt;code&gt;qmd://&lt;/code&gt; citation pointing back at the source. That entire path — capture → govern → promote → audit → index → cited search — executes against local SQLite and a local vector index with no service in the middle. The verification criterion for the first phase was exactly this loop returning a cited hit from local data, with the audit chain verifying intact afterward. No daemon ran; nothing was waiting in the background to be trusted.&lt;/p&gt;

&lt;p&gt;Daemon-free is an auditability choice before it's an operability one. A background service is a thing that runs when you aren't looking. An in-process tool runs only when a tool call invokes it, inside a session you initiated, and stops when the call returns. The surface you have to reason about is the tool list, not a process tree. "What runs against my files" collapses to "what tools did this session call" — which is a question the transcript already answers, without you having to take anyone's word for it.&lt;/p&gt;

&lt;p&gt;This is the difference between "local-first" as an architecture and "local-first" as a marketing word. Plenty of products that wear the label cache locally and sync everything to a server anyway; the local copy is a convenience, not a boundary. Here the local copy &lt;em&gt;is&lt;/em&gt; the system — there is no server-side index to drift out of sync with, because there is no server. The data plane and the trust plane are the same machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choice 2 — no egress by default: you can see what leaves
&lt;/h2&gt;

&lt;p&gt;Building the brain has two modes, and the default is the conservative one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build a governed brain from a folder — zero LLM egress.&lt;/span&gt;
npx governed-second-brain init ./notes &lt;span class="nt"&gt;--index-only&lt;/span&gt;

&lt;span class="c"&gt;# Opt into the full compile (document text egresses to the model).&lt;/span&gt;
npx governed-second-brain init ./notes        &lt;span class="c"&gt;# pre-flight consent first&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;init &amp;lt;folder&amp;gt; --index-only&lt;/code&gt; builds the entire brain — index, governance, cited retrieval — with &lt;strong&gt;zero egress&lt;/strong&gt;. No document text leaves the machine; the index is built locally and queries resolve locally. This is the mode for regulated or client data, and it is the default precisely so that "local-first" cannot quietly imply "nothing leaves" while a compile step ships your files to an API behind your back.&lt;/p&gt;

&lt;p&gt;The richer mode is the full ICO (Intentional Cognition OS) compile, which runs on DeepSeek as an opt-in egress path. That mode genuinely sends document text to a model — and the design's honesty rule is that this fact is loud, not buried. The egress mode is gated behind a pre-flight consent step that names what it does on the tin: it reads every file under the target folder, runs local tooling, and — in this mode only — sends document text to the model. Consent is a step you pass through, not a flag you forget you set.&lt;/p&gt;

&lt;p&gt;This is also why the default is index-only rather than full-compile. Defaults are the decisions users don't make, so the default has to be the one that's safe to not think about. The README is correspondingly forbidden from letting "local-first" do the work of implying "air-gapped." Two modes, one honest sentence each: index-only sends nothing; full-compile sends document text to the model, and asks first.&lt;/p&gt;

&lt;p&gt;The pre-flight summary is itself an auditable artifact. It is a printed contract — these are the files I will read, these are the tools I will run, this is what I will send and to whom — presented before you approve, not logged after you've already lost the choice. A consent screen you can't decline is a notification; a consent screen that gates the action is a decision point. This one gates.&lt;/p&gt;

&lt;p&gt;The point isn't that egress is bad. It's that egress should be a decision the user makes with the facts in front of them, not a default they discover later in someone else's compliance document. Auditability of &lt;em&gt;what leaves&lt;/em&gt; means the answer is knowable before the bytes move, not reconstructable after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choice 3 — the external audit-chain anchor: you can see what changed
&lt;/h2&gt;

&lt;p&gt;This is the design choice that earned the "honest audit claim" rebrand, because the original claim was wrong in an instructive way.&lt;/p&gt;

&lt;p&gt;The knowledge store keeps a hash-chained audit log: every governance event (a memory promoted, transitioned, superseded) is an entry whose hash folds in the hash of the previous entry. The intuition is the familiar one — change any record and the chain breaks, so the log is tamper-evident. That intuition is half right, and the missing half is the whole problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — what a self-contained hash chain does and does not catch.
chain:  e0 → e1 → e2 → e3      entry_hash = H(row || prev_entry_hash)

# Edit e1 in place but DON'T re-hash e2, e3  →  chain breaks  →  caught. Good.
# Silently REWRITE the whole chain (recompute every hash)    →  internally
#   consistent again  →  self-verify still passes.            →  NOT caught.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A self-contained chain only proves &lt;em&gt;internal&lt;/em&gt; consistency. It catches a sloppy edit that doesn't recompute everything downstream. It does &lt;strong&gt;not&lt;/strong&gt; catch a wholesale rewrite, because a rewrite recomputes every hash and produces a new chain that is perfectly consistent with itself. A local writer can throw away the history, fabricate a clean replacement, and &lt;code&gt;verify&lt;/code&gt; returns green. Calling that "tamper-proof" was the overstatement the rebrand corrected; the accurate word is tamper-&lt;em&gt;detection&lt;/em&gt;, and only against in-place edits.&lt;/p&gt;

&lt;p&gt;The external anchor closes the rewrite gap. The mechanism is two commits working as a pair: &lt;strong&gt;&lt;code&gt;govern&lt;/code&gt; commits the chain head to an external anchor log&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;brain_audit_verify&lt;/code&gt; checks the live chain head against that anchored value.&lt;/strong&gt; The companion store change in &lt;code&gt;qmd-team-intent-kb&lt;/code&gt; is described exactly this way — "external anchor log for the audit chain (detect silent full rewrites)."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Illustrative — the anchor is an outside witness to the chain head.
govern:               anchor.log  ←  head_hash(e3) = "9f66…"   (committed externally)
brain_audit_verify:   live_head == anchored_head ?  intact  :  SILENT REWRITE DETECTED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A wholesale rewrite is internally consistent, but it produces a &lt;em&gt;different head hash&lt;/em&gt; than the one already committed to the anchor. The anchor is an outside witness: it remembers what the head was, so a fresh chain that disagrees with it is caught even though the fresh chain agrees with itself.&lt;/p&gt;

&lt;p&gt;The trust model then gets stated honestly per mode rather than as one blanket claim. A local single-writer install gets integrity, ordering, and rewrite-detection — that is genuinely what the chain plus anchor deliver, and it's a real guarantee worth having. Non-repudiation across multiple distrusting actors is a different, stronger property that needs keyed signatures and a tamper-resistant anchor, and the local mode does not pretend to it. Stating the weaker-but-true claim for the common case, and naming where the stronger claim would require more machinery, is the honesty the rebrand was about. The forbidden-words list is short and deliberate: tamper-proof, immutable, blockchain. The feature is real; the claim is now sized to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choice 4 — reproducible, provenanced install: you can see what you installed
&lt;/h2&gt;

&lt;p&gt;Auditability that stops at runtime leaves the biggest hole open: the install itself. A second brain you can audit, delivered by a supply chain you can't, is still trust-me — you've just moved the trust from the data plane to the package manager.&lt;/p&gt;

&lt;p&gt;So the install is pinned and provenanced end to end:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gsb.lock.json&lt;/code&gt;&lt;/strong&gt; pins the exact version tuple — the plugin runtime, the ICO compiler, and qmd (pinned to the canonical &lt;code&gt;2.5.3&lt;/code&gt;). The spool contract between compile and govern is a versioned wire format, so the lockfile owns a single known-good combination rather than letting three components drift independently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A hermetic full-chain CI smoke&lt;/strong&gt; runs the entire capture-to-cited-search loop against that pinned tuple before any release, with a real corpus and a real audit verify. The reproducible build is &lt;em&gt;tested as a whole&lt;/em&gt;, not asserted component by component.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm provenance&lt;/strong&gt; is generated by the CI release workflow, so the published package carries a verifiable link back to the commit and workflow that built it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The package ships only &lt;code&gt;plugin-runtime/governed-brain.cjs&lt;/code&gt;&lt;/strong&gt;, not a vendored &lt;code&gt;node_modules&lt;/code&gt;. A stripped, self-contained runtime artifact is one file you can inspect, not a dependency forest you have to take on faith. (An early publish accidentally externalized dependencies that weren't bundled — an inert package — which is exactly the failure a single self-contained artifact prevents.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tuple matters more than a normal lockfile because the three components talk to each other over a versioned wire format. The compile stage produces a spool that the govern stage consumes; that spool contract can change shape between versions. Pinning the plugin runtime, the compiler, and qmd as a single tuple — rather than letting each float to its own latest — is what keeps the contract coherent. The lockfile owns a combination that was proven to work together, not three independently-blessed versions that happen to be installed at the same time.&lt;/p&gt;

&lt;p&gt;These read like packaging chores. They're the fourth audit axis. "What did I install" has a verifiable answer — a provenanced package, a pinned tuple, one inspectable runtime file, a CI smoke that proves the whole chain works at those versions — instead of "whatever npm resolved today."&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;By end of day the plugin was live as &lt;code&gt;npx governed-second-brain init&lt;/code&gt;, version strings aligned at &lt;code&gt;0.1.4&lt;/code&gt; (a finding the &lt;code&gt;validate-plugin&lt;/code&gt; gate surfaced), the MCP server auto-registering with Claude Code so a fresh install can &lt;code&gt;/brain "..."&lt;/code&gt; against its own data without manual config. The umbrella repo carries the productization epic and its phased child work; the audit-chain anchor moved from "on the roadmap" to "implemented" in the same pass that shipped it.&lt;/p&gt;

&lt;p&gt;The auto-registration is a small thing that matters for the thesis. The installer doesn't hand you a block of &lt;code&gt;.mcp.json&lt;/code&gt; to paste and hope you got right — it registers the server with Claude Code itself, prints a first &lt;code&gt;/brain&lt;/code&gt; query and a capture example, and the loop is live. An honest, checkable system that's also annoying to wire up loses to a trust-me product that just works; closing the setup gap is what lets the auditable option win on convenience too, not only on principle.&lt;/p&gt;

&lt;p&gt;The honest summary is the four audit questions, each with a command behind it rather than a promise:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Audit question&lt;/th&gt;
&lt;th&gt;Design choice&lt;/th&gt;
&lt;th&gt;How you check it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What runs against my files?&lt;/td&gt;
&lt;td&gt;Local-first, daemon-free in-process MCP&lt;/td&gt;
&lt;td&gt;The session transcript — no background process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What leaves the machine?&lt;/td&gt;
&lt;td&gt;No-egress &lt;code&gt;--index-only&lt;/code&gt; default; opt-in compile with consent&lt;/td&gt;
&lt;td&gt;The mode you chose, stated before bytes move&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Did my history get rewritten?&lt;/td&gt;
&lt;td&gt;External audit-chain anchor&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;brain_audit_verify&lt;/code&gt; vs. the committed head&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What did I install?&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gsb.lock.json&lt;/code&gt; pin + npm provenance + single &lt;code&gt;.cjs&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;The lockfile, the provenance, one inspectable file&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The principle: replace the promise with the command
&lt;/h2&gt;

&lt;p&gt;Strip away the specifics and the design is one move applied four times. Every place a hosted product would hand you a promise, this hands you a command.&lt;/p&gt;

&lt;p&gt;"Your data stays local" is a promise. "There is no daemon and the loop runs in-process" is a fact you can confirm from the tool list. "We only access what you authorize" is a promise. "Index-only egresses nothing; the other mode asks first" is a behavior you can observe before any bytes move. "We keep an immutable audit log" is a promise — and, as it turned out, a falsifiable one. "&lt;code&gt;brain_audit_verify&lt;/code&gt; checks the live chain head against an externally committed anchor" is a command that returns a verdict. "Our package is trustworthy" is a promise. "Here is the provenance, the pinned tuple, and the one file we ship" is a manifest you can read.&lt;/p&gt;

&lt;p&gt;The pattern generalizes past second brains. Any system that touches data you don't fully control faces the same fork: it can ask you to trust an unobservable property, or it can expose an artifact that makes the property checkable. The first is cheaper to build and ages into a liability the day someone asks you to prove it. The second costs more up front — an anchor log, a consent gate, a lockfile, a provenance step — and the cost buys you a claim that survives scrutiny.&lt;/p&gt;

&lt;p&gt;The discipline that made this honest was the willingness to &lt;em&gt;shrink&lt;/em&gt; a claim to fit its mechanism. The original "tamper-proof" language was aspirational; the mechanism underneath only delivered tamper-detection, and only against careless edits. Rather than build toward the bigger claim or quietly keep the wrong word, the rebrand sized the words to the substrate — tamper-detection where the chain earns it, rewrite-detection where the anchor earns it, and a short list of words the system is forbidden from using because it hasn't earned them. A claim you can defend is worth more than a claim that sounds better, and the gap between the two is exactly where trust-me products live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;None of this is free, and the costs are worth naming. The external anchor adds a write on every govern and a check on every verify — cheap, but it means the anchor log is now load-bearing state the system has to keep somewhere durable. And it raises an honest recursion: where do you anchor the anchor? An anchor log that lives next to the chain it witnesses is no witness at all — a rewriter who can touch one can touch both. The substrate that makes it real is an external, append-only, shared history that the local writer doesn't unilaterally control — committing the head into git history is the cheap version, a timestamping service the stronger one. The anchor is only as trustworthy as the place it's committed to, and naming that ceiling is part of the honest claim.&lt;/p&gt;

&lt;p&gt;Index-only mode trades retrieval richness for zero egress; the fuller compile is genuinely better at some queries, which is why it exists, which is why the consent prompt has to be honest rather than discouraging. The in-process design has its own ceiling: there's no shared service, so a brain that a whole team queries at once wants the hosted path — and that path is deliberately out of scope here, a client opt-in rather than the default, because hosting reintroduces exactly the trust surface this version removed. A single pinned tuple in &lt;code&gt;gsb.lock.json&lt;/code&gt; buys reproducibility at the cost of staleness — the known-good combination is known-good until a security fix lands upstream and the pin has to move deliberately. Each trade leans the same direction: a little more ceremony in exchange for a claim you can actually back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also shipped
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;contributing-clanker&lt;/strong&gt; cut its v0.2.0 release with new contribution-quality gates and a versioned product-landing README. &lt;strong&gt;intent-outreach&lt;/strong&gt; got a ground-up rebuild — provider-pluggable LLM seam, a deterministic pipeline behind an MCP server, the GCP stack retired — and reached marketplace grade B on &lt;code&gt;validate-skillmd&lt;/code&gt;. Both are their own posts; here they're context for a busy day whose lead story was making a knowledge store you can check instead of one you have to believe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related posts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/the-api-is-the-real-boundary/"&gt;MCP Server Auth: The API Is the Real Boundary&lt;/a&gt; — the same knowledge system, and why the governance audit trail must stay pure for verification to mean anything.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/governed-knowledge-two-releases-freshness-daemon/"&gt;Governed Knowledge: Two Releases, a Freshness Daemon, and Export Gating&lt;/a&gt; — earlier groundwork on governing a knowledge corpus rather than just retrieving from it.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/when-green-ci-proves-nothing/"&gt;Green CI Proves Nothing: Why Your Tests Gate Zero Calls&lt;/a&gt; — the same instinct applied to tests: a green check that verifies nothing is worse than no check.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>aiengineering</category>
      <category>localfirst</category>
      <category>rag</category>
    </item>
    <item>
      <title>Green CI Proves Nothing: Why Your Tests Gate Zero Calls</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:43:59 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/green-ci-proves-nothing-why-your-tests-gate-zero-calls-1po9</link>
      <guid>https://dev.to/jeremy_longshore/green-ci-proves-nothing-why-your-tests-gate-zero-calls-1po9</guid>
      <description>&lt;p&gt;Your test passed. It gated zero tool calls. It proved nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Dogfood Had to Prove
&lt;/h2&gt;

&lt;p&gt;The agent-governance-plane (AGP) is a Slack-native, OSS (Apache-2.0) governance gate for Claude Code. It runs the agent inside a Docker sandbox, checks every tool call against a policy engine, gates suspicious ones through human approval in Slack, and writes each decision to an Ed25519-signed, hash-chained audit journal you can verify offline with &lt;code&gt;agp verify&lt;/code&gt;. It holds no credentials. It fails closed on anything unverified. Phase B, pre-1.0.&lt;/p&gt;

&lt;p&gt;The live-dogfood task was straightforward: prove that a REAL Claude Code harness actually obeys the gate end-to-end, not in a mock test. Three increments, each independently shippable.&lt;/p&gt;

&lt;p&gt;Increment 1 wired the gate. A &lt;code&gt;hook-bridge.ts&lt;/code&gt; parses Claude Code's PreToolUse hook (the real measured contract), asks AGP's gate over a Unix socket, and translates the verdict back. Allow = exit 0. Deny = exit 2 + reason on stderr. Fail-closed by construction: a missing verdict is never an allow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// allow → exit 0; deny → exit 2 with the reason on stderr (the measured contract).&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ALLOW&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;BridgeResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;exitCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deny&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;BridgeResult&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;exitCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;reason&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Fail-closed: connection failure, timeout, a deny verdict, or any non-allow&lt;/span&gt;
&lt;span class="c1"&gt;// response all return a deny. A missing verdict is never an allow.&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;policy_verdict&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;allow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;ALLOW&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`AGP gate denied: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Increment 2 placed the harness in a network-enabled Docker container with real FS/process isolation. The hook still gated every call through the host control-plane gate over a bind-mounted Unix socket. The image was hardened: &lt;code&gt;--cap-drop ALL&lt;/code&gt;, &lt;code&gt;--security-opt no-new-privileges&lt;/code&gt;. API key passed by name, never in argv. Live validation on a real task: Read allowed, Bash denied.&lt;/p&gt;

&lt;p&gt;Increment 3 was the reproducible CI dogfood. A &lt;code&gt;.github/workflows/dogfood.yml&lt;/code&gt; workflow runs the harness and assembles an evidence bundle: journal, pubkey, &lt;code&gt;agp verify&lt;/code&gt; output, event timeline, filled AAR. All artifact, cryptographic proof — no service dependency, since the signed hash-chained journal verifies offline. Run by hand first against an actual flake in the claude-code-slack-channel repo, it worked: 95 journal events, 47 tool calls gated — Claude tried Agent, Bash, and ToolSearch (denied), then dozens of Reads (allowed). &lt;code&gt;agp verify&lt;/code&gt; confirmed chain, signatures, and signed head.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Run Claude Inside the Sandbox?
&lt;/h2&gt;

&lt;p&gt;The obvious design is to run the harness inside AGP's own sandbox. That sandbox is &lt;code&gt;--network none&lt;/code&gt; — an actively-verified, fail-closed default. But real Claude needs network egress to reach the model API. "Claude inside a &lt;code&gt;--network none&lt;/code&gt; sandbox" is self-contradictory: the harness can't start.&lt;/p&gt;

&lt;p&gt;So Topology B puts the harness in a &lt;em&gt;network-enabled&lt;/em&gt; container with real FS and process isolation of tool execution, while every tool call still gates through the host control plane over the bind-mounted socket. The honest limitation: that container has full egress, not a model-only allowlist. A model-only egress allowlist (Topology C) is the north star — a real networking subsystem, filed as its own work, not a v0 blocker. Same class of honest limit as "Docker, not Firecracker."&lt;/p&gt;

&lt;p&gt;The proof never depends on any of it. AGP's proof of what a governed run did is the signed, hash-chained journal, verifiable offline with the published key. No model provider, no hosted log, no third party is ever in the trust path. That independence is the moat — make provability depend on an external service and you weaken it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Green Run Proved Nothing
&lt;/h2&gt;

&lt;p&gt;The first CI dogfood run went green. It was hollow.&lt;/p&gt;

&lt;p&gt;Zero tool calls gated. The containerized intendant — AGP's per-harness adapter, the thing that actually runs Claude under the gate — could not connect to the host-process gate socket. The bind-mounted socket existed (&lt;code&gt;ls&lt;/code&gt; confirmed it), but &lt;code&gt;connect()&lt;/code&gt; returned ENOENT. The run completed. The workflow passed. It had exercised nothing. Note what did &lt;em&gt;not&lt;/em&gt; happen: the gate's own fail-closed guarantee never fired, because the harness never reached the gate. The hollow green lived in the CI evidence path, one layer above the policy gate.&lt;/p&gt;

&lt;p&gt;This reproduced on a clean CI runner. That mattered. An earlier ADR (037) had blamed the dev sandbox's filesystem virtualization for the same symptom; a clean-runner repro falsified that — there was no dev sandbox left to blame. It was a real socket-sharing bug: a single process, Bun, spawning Docker and multi-mounting one Unix socket. We corrected the ADR to name the real cause, reopened the tracking issue for the container socket bug (it had been closed prematurely on the false diagnosis), and made &lt;code&gt;evidence-bundle.sh&lt;/code&gt; fail closed when the gate was never exercised.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;event_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; &amp;lt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$OUT&lt;/span&gt;&lt;span class="s2"&gt;/events.txt"&lt;/span&gt; | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;gated&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-cE&lt;/span&gt; &lt;span class="s1"&gt;'gate\.(allow|deny)'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$OUT&lt;/span&gt;&lt;span class="s2"&gt;/events.txt"&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;0&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# ... assemble journal, pubkey, agp verify output, AAR ...&lt;/span&gt;

&lt;span class="c"&gt;# A dogfood that gated nothing proves nothing — fail closed (no fake-green).&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;gated&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"evidence-bundle: FAIL — 0 tool calls gated; the governed run did not exercise the gate"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi
&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"evidence-bundle: wrote &lt;/span&gt;&lt;span class="nv"&gt;$OUT&lt;/span&gt;&lt;span class="s2"&gt; (&lt;/span&gt;&lt;span class="nv"&gt;$event_count&lt;/span&gt;&lt;span class="s2"&gt; events, &lt;/span&gt;&lt;span class="nv"&gt;$gated&lt;/span&gt;&lt;span class="s2"&gt; gated, verify rc=&lt;/span&gt;&lt;span class="nv"&gt;$verify_rc&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Host-Path Pivot
&lt;/h2&gt;

&lt;p&gt;We pivoted CI to the host path: Claude installed on the ephemeral runner, gated, signed. The runner itself is the isolation. That path works end-to-end and is genuinely reproducible. The container path (Topology B) stays the north star — the socket bug stays open, not papered over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preventing Hung Runners in CI
&lt;/h2&gt;

&lt;p&gt;The first host-path dogfood hung. A fresh-runner Claude produced no output before stalling — likely first-run, no-TTY behavior, not AGP. A CI dogfood must never hang. We wrapped the run in &lt;code&gt;timeout 360&lt;/code&gt; and set the job &lt;code&gt;timeout-minutes: 20&lt;/code&gt;, so a stuck harness fails fast instead of burning the runner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dogfood&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;      &lt;span class="c1"&gt;# a stuck harness fails the job, never burns the runner&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# ...&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Governed run&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;timeout 360 bun src/cli/index.ts run --intendant claude-code \&lt;/span&gt;
            &lt;span class="s"&gt;--task "$TASK" --repo "$REPO"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Transferable Lesson: Assert the Work Was Done
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The transferable lesson:&lt;/strong&gt; an integration test that exercises zero of the thing under test is worse than a red one, because that fake-green reads as "covered" when it covered nothing. The fix is not "trust the green." It is to assert that the test did the work. For a governance gate, that assertion is &lt;code&gt;gated_count &amp;gt; 0&lt;/code&gt;. A green check is a claim. The evidence bundle is the proof. This generalizes everywhere: any test whose green can be reached without exercising the property it claims should fail closed when the property was not exercised.&lt;/p&gt;

&lt;p&gt;Also shipped: the sprite→intendant rename (ADR 038). Fly.io ships "Sprites" at sprites.dev—stateful sandboxes for AI agents with Claude Code as an explicit use case. A direct product/lane collision. "Intendant" is the agent-noun of Latin &lt;em&gt;intendere&lt;/em&gt;, the root of &lt;em&gt;intent&lt;/em&gt;, "one who executes on behalf of an authority." And the cross-repo echo: CCSC (AGP's substrate) shipped a "footgun-inversion" regression-test epic the same week, pinning fail-closed defaults (session isolation, "every policy decision is journaled—no gaps"). The same move as the no-fake-green guard: prove the safety property holds rather than trust it. This echoes the principle we follow in &lt;a href="https://dev.to/posts/honor-the-gate-when-the-verdict-is-inconvenient/"&gt;honoring the gate when the verdict is inconvenient&lt;/a&gt; — the gate is only as good as your commitment to enforcing it even when it blocks your path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related Posts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/honor-the-gate-when-the-verdict-is-inconvenient/"&gt;Honor the Gate When the Verdict Is Inconvenient&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/postgres-approval-sink-bugs-the-tests-caught/"&gt;The Two Postgres Bugs the Tests Caught: A Real-DB Integration Test Case Study&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/hitl-delivery-is-a-fail-closed-exactly-once-problem/"&gt;Human-in-the-Loop Is a Delivery Guarantee, Not a UI Feature&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;{&lt;br&gt;
  "&lt;a class="mentioned-user" href="https://dev.to/context"&gt;@context&lt;/a&gt;": "&lt;a href="https://schema.org" rel="noopener noreferrer"&gt;https://schema.org&lt;/a&gt;",&lt;br&gt;
  "@type": "BlogPosting",&lt;br&gt;
  "headline": "Green CI Proves Nothing: Why Your Tests Gate Zero Calls",&lt;br&gt;
  "description": "CI dogfood for AI-agent governance went green while gating zero tool calls. Here's why a passing test proving nothing is worse than a red one.",&lt;br&gt;
  "datePublished": "2026-06-11T09:00:00-05:00",&lt;br&gt;
  "author": {&lt;br&gt;
    "@type": "Person",&lt;br&gt;
    "name": "Jeremy Longshore"&lt;br&gt;
  },&lt;br&gt;
  "publisher": {&lt;br&gt;
    "@type": "Organization",&lt;br&gt;
    "name": "Start AI Tools",&lt;br&gt;
    "logo": {&lt;br&gt;
      "@type": "ImageObject",&lt;br&gt;
      "url": "&lt;a href="https://startaitools.com/images/og-image.png" rel="noopener noreferrer"&gt;https://startaitools.com/images/og-image.png&lt;/a&gt;"&lt;br&gt;
    }&lt;br&gt;
  },&lt;br&gt;
  "articleBody": "Your test passed. It gated zero tool calls. It proved nothing. The agent-governance-plane (AGP) is a Slack-native, OSS (Apache-2.0) governance gate for Claude Code. This post explores why a passing test that exercises zero of the thing under test is worse than a red one, and how to prevent fake-green dogfood runs in CI.",&lt;br&gt;
  "keywords": ["ci-cd", "ai-agents", "testing", "claude-code", "docker", "devops"]&lt;br&gt;
}&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>aiagents</category>
      <category>testing</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>Spec graduation: when a partner email rewrites architecture</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:43:56 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/spec-graduation-when-a-partner-email-rewrites-architecture-45ba</link>
      <guid>https://dev.to/jeremy_longshore/spec-graduation-when-a-partner-email-rewrites-architecture-45ba</guid>
      <description>&lt;p&gt;The cleanest architectural moves of the last six months didn't come from a whiteboard. They came from a partner email that essentially said: hey, remember what we actually signed.&lt;/p&gt;

&lt;p&gt;Yesterday's post was about &lt;a href="https://dev.to/posts/coherence-day-drift-detection-strategic-spine/"&gt;coherence as a deliverable&lt;/a&gt; — a full audit-day where four advisor agents independently flagged the same drift in the same engagement. The drift was real, but the post stayed at the diagnostic layer. It identified the misalignment without committing to the structural change that misalignment implied. Today is the structural change.&lt;/p&gt;

&lt;h2&gt;
  
  
  The drift
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;kobiton/CLAUDE.md&lt;/code&gt; for the M2 milestone work said, in plain text, that the primary venue for "Blog 1" was startaitools.com. That sentence had been there since the file was first scaffolded, and nobody had questioned it. It read like settled fact.&lt;/p&gt;

&lt;p&gt;It wasn't. It was a gloss. Somewhere during SOW negotiation a verbal "we're flexible on venue, you do what makes sense" had been transcribed into the operating doc as "primary venue: startaitools.com." Internal interpretation became internal canon, and internal canon then started shaping how I scoped everything downstream — what the post would argue, what tone it would use, who the audience was, where it would link to.&lt;/p&gt;

&lt;p&gt;The drift was small. The drift was three months old. The drift had compounded into structural assumptions about M3 and M4 already.&lt;/p&gt;

&lt;h2&gt;
  
  
  The check-in
&lt;/h2&gt;

&lt;p&gt;Frank Moyer is the Kobiton stakeholder on the M-series. He replied to a thread that was nominally about WordPress publishing access. The actual content of his reply was a course correction: Kobiton's blog is the canonical home for Blog 1. Cross-publication elsewhere — startaitools.com, the personal portfolio, anywhere — needs to align on canonical strategy ahead of publication.&lt;/p&gt;

&lt;p&gt;He was right. He didn't argue the point, he just stated the position the contract supported. My reply was "10-4." It was the only register that fit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The re-read
&lt;/h2&gt;

&lt;p&gt;This is the part where, if I'm being honest, the temptation was to negotiate. There was a half-formed counter forming in my head about cross-promotion, about audience reach, about how methodology content benefits both parties.&lt;/p&gt;

&lt;p&gt;I closed the email tab and pulled the actual signed SOW from &lt;code&gt;kobiton/000-docs/001-BL-CNTR-&lt;/code&gt;. I'd opened that document maybe three times since signing it.&lt;/p&gt;

&lt;p&gt;The project title — the literal first line on the document — included the word "Promotion." The M2 milestone description required positive supportive promotional tone and content reinforcing plugin value. Venue was not specified, but the framing was unambiguous. Frank's read of the contract was well-supported. My &lt;code&gt;kobiton/CLAUDE.md&lt;/code&gt; gloss was not.&lt;/p&gt;

&lt;p&gt;This is a small humiliation worth sitting with for a moment. The contract is the artifact. The CLAUDE.md is a working note. When they disagree, the working note loses every time, and the working note had been silently steering decisions for a quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boundary
&lt;/h2&gt;

&lt;p&gt;The fix was a new section in &lt;code&gt;kobiton/CLAUDE.md&lt;/code&gt; titled "Content boundaries." Two columns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contracted M2 deliverables&lt;/strong&gt; — three blogs, Frank-coordinated, canonical at kobiton.com/blog, vendor-specific framing per the SOW, positive promotional tone, plugin-value-reinforcing. This is theirs. I'm the author of record but Kobiton owns the venue, the timing, and the framing constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Independent methodology track&lt;/strong&gt; — anything that generalizes across engagements: the &lt;code&gt;intent-eval-lab&lt;/code&gt; spec, the matcher-map template, public technical comments on GitHub, anything that lives at the "shape of the audit" layer rather than the "this specific plugin" layer. This is mine. Authoring domain, venue, timing all under my control.&lt;/p&gt;

&lt;p&gt;The two columns aren't in tension. They're orthogonal. A finding I make while auditing the Kobiton plugin can produce both a Kobiton-specific blog (theirs) and a contribution to the vendor-neutral spec (mine), as long as I don't co-mingle the artifacts on one publication track or one CLAUDE.md row.&lt;/p&gt;

&lt;p&gt;Writing that section took fifteen minutes. The clarity it produced is the part of this whole episode I keep returning to, because it's the part that unblocked everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The graduation
&lt;/h2&gt;

&lt;p&gt;Here's the spec graduation I hadn't seen until the boundary was on paper.&lt;/p&gt;

&lt;p&gt;Four advisor agents — business-analyst, ai-engineer, architect-reviewer, content-marketer — had each separately produced reviews over the prior week recommending some variant of the same structural move: stop being the auditor of one plugin, become the methodology layer that Kobiton is one instance of.&lt;/p&gt;

&lt;p&gt;The business-analyst framed it as positioning: "auditor of N plugins" doesn't compound; "author of the conformance spec N plugins are graded against" does. The architect-reviewer framed it as artifact taxonomy: the matcher-map template is being reinvented for every engagement and that's a sign it wants to be a vendor-neutral artifact. The ai-engineer framed it as eval-loop reusability. The content-marketer framed it as audience: methodology content has a different reader than vendor-specific content and the two were eating each other.&lt;/p&gt;

&lt;p&gt;Four roads, same destination. The stall was recent — two weeks — but the misalignment feeding it was three months old: the same CLAUDE.md gloss that misread the SOW's venue position was also keeping the methodology entangled with vendor-specific framing. Pushing the methodology to a public vendor-neutral spec while simultaneously delivering vendor-specific promotional content for one of the vendors-in-question is a real tension if you don't have a content boundary.&lt;/p&gt;

&lt;p&gt;With the content boundary in place, the tension disappears. The spec lives on the independent methodology track. The Kobiton blogs live on the contracted track. Neither contaminates the other.&lt;/p&gt;

&lt;p&gt;This is the inversion.&lt;/p&gt;

&lt;p&gt;Until Frank's email, the engagement architecture was: &lt;strong&gt;Kobiton M-series is the primary work, methodology extraction is a side effect.&lt;/strong&gt; After the boundary, it's: &lt;strong&gt;Methodology spec is the primary structural asset, Kobiton M-series is one instance.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The work I do for Kobiton hasn't changed. What it produces, structurally, has.&lt;/p&gt;

&lt;h2&gt;
  
  
  The artifact
&lt;/h2&gt;

&lt;p&gt;I executed what the architect-reviewer agent had labeled "Option B — soft spec move." Vendor-neutral spec, draft status, public artifact, no marketing announcement. The on-disk layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;intent-eval-lab/
  specs/
    mcp-plugin-observability/
      v0.1.0-draft/
        SPEC.md
        matcher-map-template.md
        case-studies/
        conformance-test-suite/
    methodology/
      README.md
    validator-contract-reliability/
      README.md
    forecasting-drift-detection/
      README.md
    decentralized-crypto-evaluation/
      README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Module 1 — &lt;code&gt;mcp-plugin-observability&lt;/code&gt; — ships as &lt;code&gt;v0.1.0-draft&lt;/code&gt;. The other three placeholder modules live as siblings under &lt;code&gt;specs/&lt;/code&gt;, not children of the first module, because the methodology umbrella is broader than any single shape. The cross-module &lt;code&gt;methodology/&lt;/code&gt; directory sits alongside.&lt;/p&gt;

&lt;p&gt;The spec has five normative requirements, R1–R5. Most anchor to a section in Anthropic's published Claude Code monitoring, hooks, or plugins documentation; the structural ones cite &lt;a href="https://www.rfc-editor.org/rfc/rfc2119" rel="noopener noreferrer"&gt;RFC 2119&lt;/a&gt; plus the spec's own anchoring rule. The discipline is: I'm not inventing prescriptions, I'm codifying what's canonical and pointing at the source. If a requirement can't cite a canonical doc section or an upstream MCP/OTel spec section, it doesn't go in.&lt;/p&gt;

&lt;p&gt;The matcher-map template is the structural piece. Three columns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;finding-shape&lt;/th&gt;
&lt;th&gt;hook matcher&lt;/th&gt;
&lt;th&gt;OTel signal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;race&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;PreToolUse&lt;/code&gt; on Bash&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mcp.tool.race.detected&lt;/code&gt; span event&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shape drift&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;PostToolUse&lt;/code&gt; on Edit&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mcp.context.shape.drift&lt;/code&gt; metric&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;finding-shape&lt;/strong&gt; — a class of failure mode that recurs across plugins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hook matcher&lt;/strong&gt; — the Claude Code hooks declaration that detects an instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OTel signal&lt;/strong&gt; — what the hook should emit so the failure becomes observable downstream.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.1.0-draft includes six baseline shapes — race, shape drift, cooldown, side-effect verification, mandatory context, strict-mode protocol. Each one is a row that's been instantiated against at least one real engagement. The Kobiton M2 audit produced four of them. The remaining two came from earlier work.&lt;/p&gt;

&lt;p&gt;The three module placeholders — &lt;code&gt;validator-contract-reliability&lt;/code&gt; (Polygon-shape), &lt;code&gt;forecasting-drift-detection&lt;/code&gt; (Nixtla-shape), &lt;code&gt;decentralized-crypto-evaluation&lt;/code&gt; (Lit-shape) — are README-only. Each README names a domain-specific failure mode and reserves the structural slot. No timeline, no commitment, no marketing. They exist so the methodology umbrella is visible without the umbrella having to ship every panel at once.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;methodology/&lt;/code&gt; directory is the cross-module layer. Its README enumerates five patterns observed early — how to write findings (Toulmin's argumentation model — claim, grounds, warrant — from Stephen Toulmin's &lt;em&gt;The Uses of Argument&lt;/em&gt;, 1958), why diagnostic precedes prescription, the distinction between conformance (does the plugin meet the spec?) and eval (does the plugin do its job?), the vendor-neutrality rule, and the anchoring rule that requires every normative claim to cite a canonical source. Each pattern is provisional today; the README states the promotion gate explicitly: when at least two modules ship to a release-candidate version, the patterns get promoted to properly authored documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runner decision (separate)
&lt;/h2&gt;

&lt;p&gt;The same boundary discipline that partitioned content also forced a smaller separation inside the spec itself: the language the runner is written in is not the language the spec speaks. A second multi-advisor pass — architecture, Go-fit, DX, business-strategy — converged on Go for the reference runner. Reasons stack neatly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single static binary distribution. No Python venv, no Node version dance. Plugin authors clone a release and run it.&lt;/li&gt;
&lt;li&gt;OTel ecosystem is Go-native. Collector, Tempo, &lt;code&gt;opentelemetry-go-contrib&lt;/code&gt;, and Honeycomb's Refinery proxy — all Go. The runner inherits the ecosystem instead of bridging to it.&lt;/li&gt;
&lt;li&gt;Fast cold start matters when the runner is in a CLI eval loop being invoked dozens of times per session.&lt;/li&gt;
&lt;li&gt;The official MCP Go SDK (&lt;code&gt;modelcontextprotocol/go-sdk&lt;/code&gt;) gives clean protocol integration without me writing glue.&lt;/li&gt;
&lt;li&gt;Matches the precedent set by the &lt;a href="https://dev.to/posts/forge-dogfood-plane-plugin-grade-a-and-jrig-verified-loop/"&gt;j-rig-binary-eval runner&lt;/a&gt; — same operational shape, same maintenance surface for one operator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The critical separation: &lt;strong&gt;the runner-implementation decision is decoupled from the spec-implementation decision.&lt;/strong&gt; The spec is language-neutral. The matcher-map template is language-neutral. The conformance reports are language-neutral JSON.&lt;/p&gt;

&lt;p&gt;A plugin author writing in Python or TypeScript or Rust can author a conformant matcher map, run their own test harness, and produce a valid conformance report without installing Go. The Go runner is one reference implementation. A &lt;code&gt;runner-py/&lt;/code&gt; or &lt;code&gt;runner-ts/&lt;/code&gt; contributed later is a valid contribution, not a fork.&lt;/p&gt;

&lt;p&gt;I want to be explicit about why this matters. The cardinal sin of methodology specs is conflating "the spec" with "the tool that checks the spec." When that happens, the tool's language choice becomes a barrier to spec adoption, and the spec dies on a toolchain hill. Keeping the layers separate is not architectural purity, it's adoption strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The discipline holds the line
&lt;/h2&gt;

&lt;p&gt;R3 — the M2 deliverable due May 25 — keeps the vendor-specific framing the SOW originally scoped. No reframing. No spec-aware asides. No "this is one instance of a broader pattern" gestures. Frank gets the deliverable he contracted for, in the tone he contracted for, on the schedule he contracted for.&lt;/p&gt;

&lt;p&gt;The public announcement of the spec — a blog post, a tweet, an entry in the methodology hub on this site — is deferred until after M3 ships. At that point I'll go back to Frank, walk him through the spec, ask for consent on a co-credit announcement, and time it accordingly. Same compounding upside, lower partner-relationship risk, three or four weeks slower timeline.&lt;/p&gt;

&lt;p&gt;The trade is correct. A clean structural reframing that risks the partner deliverable is strictly worse than a slower reframing that doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes for future engagements
&lt;/h2&gt;

&lt;p&gt;Every new engagement now lands as "an instance of a shape." The intake question is: which finding-shape does this engagement's failure modes correspond to? If the shape exists in the spec, the engagement inherits the matcher-map row, the hook patterns, the OTel signal definitions, and the conformance criteria for free. The engagement-specific work is then the actual audit: do the hooks fire, do the signals show up, what's the conformance gap.&lt;/p&gt;

&lt;p&gt;If the shape doesn't exist, the engagement produces a new module. The Polygon engagement, when it activates, produces &lt;code&gt;validator-contract-reliability&lt;/code&gt; content. The Nixtla engagement produces &lt;code&gt;forecasting-drift-detection&lt;/code&gt; content. The Lit engagement produces &lt;code&gt;decentralized-crypto-evaluation&lt;/code&gt; content. The shape gets added to the matcher-map. The next engagement of the same shape inherits it.&lt;/p&gt;

&lt;p&gt;The auditor-of-one model required me to rebuild the audit framework for each engagement. The methodology-layer model means each engagement either is-a (inherits) or extends (contributes). N engagements produces an asset that compounds. N engagements under the old model produced N audit reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaways, packaged tight
&lt;/h2&gt;

&lt;p&gt;Six items I'd hand to anyone running a sole-prop AI consultancy or a small AI-engineering team:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Re-read the contract before arguing it.&lt;/strong&gt; Internal interpretations of verbal context drift. The document doesn't. When a partner pushes back, the first move is the SOW, not the rebuttal.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Partition contracted-deliverable content from methodology-track content.&lt;/strong&gt; Don't let them share a row in your operating doc, a publication channel, or an authorship surface. The boundary is what makes both possible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A spec is a structural artifact, not a marketing artifact.&lt;/strong&gt; Anchor every normative requirement to a canonical published source. If you can't cite it, don't prescribe it. The discipline protects the spec's integrity, and credibility follows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decouple the spec language from the runner language.&lt;/strong&gt; Pick the runner language for the runner's reasons (distribution, ecosystem, cold start). Keep the spec language-neutral so contributors aren't blocked by toolchain. Reference implementations are references, not gates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reserve structural slots with placeholder READMEs.&lt;/strong&gt; Three READMEs that name domain-specific failure modes signal "this is a methodology umbrella" without overcommitting to timelines for modules you haven't earned the right to write yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hold the timeline when the reframing risks the partner deliverable.&lt;/strong&gt; The spec is forever. The deliverable is May 25. Don't trade a fixed-date partner asset for a marketing window on something that's going to compound for years either way.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Also shipped today, in an unrelated repo: PR #707 on &lt;code&gt;claude-code-plugins&lt;/code&gt; — a CSS grid overflow fix for the marketplace landing on iPhone 13 viewports. &lt;code&gt;1fr → minmax(0, 1fr)&lt;/code&gt; on the pcard-hosting grids, &lt;code&gt;minmax(min(320px, 100%), 1fr)&lt;/code&gt; on the multi-column auto-fill grids, and &lt;code&gt;min-width: 0; max-width: 100%&lt;/code&gt; on &lt;code&gt;.pcard&lt;/code&gt; itself. Defense in depth against the &lt;a href="https://css-tricks.com/preventing-a-grid-blowout/" rel="noopener noreferrer"&gt;nested-flex-grid overflow&lt;/a&gt; that 47 px of hidden width was causing. Worth noting only because it landed the same day; it has nothing to do with the rest of this post.&lt;/p&gt;

&lt;p&gt;The version on disk is &lt;code&gt;v0.1.0-draft&lt;/code&gt;. It will move. The spec that exists now is not the spec I would have written before Frank's email — that one would have been a Kobiton-shaped artifact dressed up as methodology. Instead it's a methodology layer that Kobiton, among others, will be instances of. The shape it took on this morning, after a partner email forced a contract re-read, is the shape it kept.&lt;/p&gt;

</description>
      <category>methodology</category>
      <category>consulting</category>
      <category>partnerengagement</category>
      <category>spec</category>
    </item>
  </channel>
</rss>
