<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeremy Longshore</title>
    <description>The latest articles on DEV Community by Jeremy Longshore (@jeremy_longshore).</description>
    <link>https://dev.to/jeremy_longshore</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842419%2Ff5d02b54-daf0-4520-9aef-118fbd0c24ac.jpeg</url>
      <title>DEV Community: Jeremy Longshore</title>
      <link>https://dev.to/jeremy_longshore</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeremy_longshore"/>
    <language>en</language>
    <item>
      <title>Five Tags, Zero Ships: How an Auto-Release Workflow Lied for a Whole Day</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Sun, 24 May 2026 13:00:27 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/five-tags-zero-ships-how-an-auto-release-workflow-lied-for-a-whole-day-3cii</link>
      <guid>https://dev.to/jeremy_longshore/five-tags-zero-ships-how-an-auto-release-workflow-lied-for-a-whole-day-3cii</guid>
      <description>&lt;p&gt;Five GitHub tags. v1.0.4 through v1.1.0. Five green checkmarks on the workflow. Five formatted release notes. The npm registry stayed at v1.0.5 the entire time.&lt;/p&gt;

&lt;p&gt;This is what it looks like when a release workflow ships tags without shipping code. Every observable surface said "done" except the one that mattered — the registry. The bug wasn't in one place; it was three independent failures that combined to make the lie convincing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Checkmarks Promised
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;gh release list&lt;/code&gt; showed all five tags with formatted changelogs. The workflow run logs were entirely green. If you ran &lt;code&gt;npm install -g intentional-cognition-os&lt;/code&gt;, you got v1.0.5. No error. No warning. Silently wrong for anyone relying on v1.0.5+, silently right for everyone else.&lt;/p&gt;

&lt;p&gt;The pattern repeated across the morning: commit → auto-release fires → tag appears → npm registry unchanged. The workflow was perfectly honest about tagging. It just wasn't releasing anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 1: Tests That Passed by Lying
&lt;/h2&gt;

&lt;p&gt;The "Verify readiness" step was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify readiness&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pnpm test || &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;|| true&lt;/code&gt; is the tell. Every test failed. &lt;code&gt;Failed to resolve entry for package @ico/types&lt;/code&gt; — the workspace packages hadn't been built yet, so &lt;code&gt;pnpm test&lt;/code&gt; resolved nothing, threw hard errors, and the &lt;code&gt;|| true&lt;/code&gt; swallowed them all. The workflow saw exit code 0 and kept going.&lt;/p&gt;

&lt;p&gt;In a monorepo, the build step is not optional ceremony. The test runner needs the workspace packages to be built first. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify readiness&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;set -e&lt;/span&gt;
    &lt;span class="s"&gt;pnpm build&lt;/span&gt;
    &lt;span class="s"&gt;pnpm test&lt;/span&gt;
    &lt;span class="s"&gt;pnpm lint&lt;/span&gt;
    &lt;span class="s"&gt;pnpm typecheck&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;set -e&lt;/code&gt; means any non-zero exit stops the workflow. If tests fail after the build, you find out. If the build fails, you stop. Lint and typecheck went into the same step because they were already in the local pre-push hook; the only reason to keep them out of the release gate is laziness or speed, and a release gate is the wrong place to optimize either.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 2: Nine Version Sources, Six Ignored
&lt;/h2&gt;

&lt;p&gt;Nine surfaces emit a version string in this repo: root &lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;version.txt&lt;/code&gt;, &lt;code&gt;CHANGELOG.md&lt;/code&gt;, the five workspace &lt;code&gt;package.json&lt;/code&gt; files (&lt;code&gt;packages/cli&lt;/code&gt;, &lt;code&gt;packages/kernel&lt;/code&gt;, &lt;code&gt;packages/compiler&lt;/code&gt;, &lt;code&gt;packages/types&lt;/code&gt;, &lt;code&gt;packages/benchmarks&lt;/code&gt;), and the runtime constant at &lt;code&gt;packages/kernel/src/version.ts&lt;/code&gt;. The workflow bumped three of them — root, &lt;code&gt;version.txt&lt;/code&gt;, &lt;code&gt;CHANGELOG.md&lt;/code&gt; — and silently left the other six behind.&lt;/p&gt;

&lt;p&gt;Result: root said 1.0.4, workspace packages said 1.0.3. Root said 1.0.5, workspace said 1.0.4. Drift every run. &lt;code&gt;ico --version&lt;/code&gt; told users the workspace's number, not the tag's.&lt;/p&gt;

&lt;p&gt;Lock-step monorepos need single-source-of-truth version sync. A helper that picks up the six the workflow was missing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bump_pkg_json&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;
  node &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"
    const fs = require('fs');
    const pkg = JSON.parse(fs.readFileSync('&lt;/span&gt;&lt;span class="nv"&gt;$file&lt;/span&gt;&lt;span class="s2"&gt;', 'utf8'));
    pkg.version = '&lt;/span&gt;&lt;span class="nv"&gt;$version&lt;/span&gt;&lt;span class="s2"&gt;';
    fs.writeFileSync('&lt;/span&gt;&lt;span class="nv"&gt;$file&lt;/span&gt;&lt;span class="s2"&gt;', JSON.stringify(pkg, null, 2) + '&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;');
  "&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

bump_pkg_json package.json &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VERSION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;pkg &lt;span class="k"&gt;in &lt;/span&gt;packages/&lt;span class="k"&gt;*&lt;/span&gt;/package.json&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;bump_pkg_json &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$pkg&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VERSION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"s/export const VERSION = '.*';/export const VERSION = '&lt;/span&gt;&lt;span class="nv"&gt;$VERSION&lt;/span&gt;&lt;span class="s2"&gt;';/"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  packages/kernel/src/version.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All nine sources now move together. &lt;code&gt;ico --version&lt;/code&gt; reports the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 3: The Step That Wasn't There
&lt;/h2&gt;

&lt;p&gt;The workflow tagged releases. It never published to npm. There was no &lt;code&gt;npm publish&lt;/code&gt; step. That's not a typo — the workflow was complete without it. Every release ran. Every release skipped the one thing that makes it a release.&lt;/p&gt;

&lt;p&gt;Here's what belongs after "Create GitHub Release":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Publish to npm&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;NPM_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.NPM_TOKEN }}&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;set -e&lt;/span&gt;
    &lt;span class="s"&gt;if [ -z "$NPM_TOKEN" ]; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "NPM_TOKEN not set — skipping publish"&lt;/span&gt;
      &lt;span class="s"&gt;exit 0&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="s"&gt;if npm view "intentional-cognition-os@$VERSION" version 2&amp;gt;/dev/null; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "intentional-cognition-os@$VERSION already on npm — skipping"&lt;/span&gt;
      &lt;span class="s"&gt;exit 0&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="s"&gt;echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" &amp;gt; ~/.npmrc&lt;/span&gt;
    &lt;span class="s"&gt;pnpm --filter intentional-cognition-os publish --no-git-checks&lt;/span&gt;
    &lt;span class="s"&gt;sleep 5&lt;/span&gt;
    &lt;span class="s"&gt;npm view "intentional-cognition-os@$VERSION" version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three guards, all in the script — not in the step's &lt;code&gt;if:&lt;/code&gt; condition. (Step-level &lt;code&gt;env:&lt;/code&gt; isn't available to that step's own &lt;code&gt;if:&lt;/code&gt; in GitHub Actions, so &lt;code&gt;if: env.NPM_TOKEN != ''&lt;/code&gt; would always evaluate false. The check belongs inside &lt;code&gt;run:&lt;/code&gt;, where the env is real.) Token presence fails safe if it's missing. Idempotency skips if already published (covers manual publishes). Post-publish verification re-queries the registry to confirm it landed.&lt;/p&gt;

&lt;p&gt;A release workflow that doesn't end with a verifiable artifact in the registry isn't a release workflow. It's a tagging workflow with extra steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  The State Behind the Process
&lt;/h2&gt;

&lt;p&gt;Fixing the workflow forward didn't fix the present. When the workflow was corrected (commit &lt;code&gt;7681dd5&lt;/code&gt;), &lt;code&gt;main&lt;/code&gt; was drifted: root at 1.1.0, workspace at 1.0.5. Users running &lt;code&gt;ico --version&lt;/code&gt; got &lt;code&gt;1.0.5&lt;/code&gt;. One-time backfill in commit &lt;code&gt;c651de8&lt;/code&gt; aligned all nine version sources to 1.1.0. Then verified: &lt;code&gt;pnpm build&lt;/code&gt; succeeded, &lt;code&gt;pnpm test&lt;/code&gt; 1,210/1,210 passing, &lt;code&gt;ico --version&lt;/code&gt; → &lt;code&gt;1.1.0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Process bugs leave state behind. Fixing the process doesn't heal the damage. You clean it up separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Bug Pattern
&lt;/h2&gt;

&lt;p&gt;Every CI/CD pipeline that ships has these three failure modes available:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Quality gates that pass on failure (&lt;code&gt;|| true&lt;/code&gt;, swallowed errors). Fix: &lt;code&gt;set -e&lt;/code&gt; and explicit step order.&lt;/li&gt;
&lt;li&gt;Monorepo workspaces with distributed version state. Fix: single-source-of-truth version sync in the workflow.&lt;/li&gt;
&lt;li&gt;A release workflow that doesn't end with verification the artifact reached the registry. Fix: final step that queries the registry and confirms.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The icos release workflow had all three. The checkmarks lied because the workflow wasn't designed to catch itself lying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also Shipped 2026-05-19
&lt;/h2&gt;

&lt;p&gt;Daily-log convention — the rest of the day, in one paragraph each. Not connected to the release-workflow thread; logged here because they happened on the same git day.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;claude-code-slack-channel v2 cluster&lt;/strong&gt; — 4 PRs merged with enterprise governance substrate framing. RFC 8785 JCS interop vectors (#175), cross-tier shadow detection (#176), journal v2 Ed25519 signing (#177), strip denied tool-call detail (#178).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kobiton R3 close-out&lt;/strong&gt; — deliverable final review, Blog 3 rewrite, 5 close-out PRs merged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;claude-code-plugins partner portal&lt;/strong&gt; — Kobiton and Nixtla brand integration, Killer Skill of the Week refresh.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;intentional-cognition-os test infra&lt;/strong&gt; — Intent Solutions Testing SOP layers L0-L7 installed (&lt;code&gt;.husky/&lt;/code&gt;, dependency-cruiser, stryker, RTM/PERSONAS/JOURNEYS docs). 3,447 insertions in commit &lt;code&gt;e0efdee&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/v1-release-gate-conditional-go/"&gt;v1.0.0: Conditional GO Through a Release Gate&lt;/a&gt; — The gate that flagged this path.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/honest-perf-benchmarks-paid-api-compiler/"&gt;Honest Performance Benchmarks for a Paid-API Compiler&lt;/a&gt; — Earlier icos work from this release cycle; same repo, different failure mode.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>releaseengineering</category>
      <category>cicd</category>
      <category>monorepo</category>
      <category>debugging</category>
    </item>
    <item>
      <title>A v1.0 Is a Gate, Not a Tag</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Thu, 21 May 2026 13:00:39 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/a-v10-is-a-gate-not-a-tag-3bc4</link>
      <guid>https://dev.to/jeremy_longshore/a-v10-is-a-gate-not-a-tag-3bc4</guid>
      <description>&lt;p&gt;Two beads were open at the start of 2026-05-18. E10-B11 was the v1.0 release-readiness gate. E10-B12 was the v1.0 release cut, blocked-by-design on B11. Epic 10 was the last epic in &lt;code&gt;intentional-cognition-os&lt;/code&gt; (ICO). The release pipeline was wired through &lt;code&gt;/release&lt;/code&gt;. Everything that mattered had to clear one ritual.&lt;/p&gt;

&lt;p&gt;Five npm releases shipped that day: v0.21.0 → v0.22.0 → v0.22.1 → v0.22.2 → &lt;strong&gt;v1.0.0&lt;/strong&gt; → v1.0.1. The interesting one is v1.0.0, because the gate said &lt;strong&gt;GO with conditions&lt;/strong&gt;, not GO. And the same-day v1.0.1 is the proof that "GO with conditions" is the correct verdict shape for a real release, not a binary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3× degradation gate
&lt;/h2&gt;

&lt;p&gt;The release ran on top of fresh benchmark infrastructure. &lt;code&gt;625691e&lt;/code&gt; and &lt;code&gt;f7bd287&lt;/code&gt; closed out E10-B06 (performance profiling) with a 500-source large-corpus benchmark. The headline addition was a &lt;strong&gt;3× degradation gate&lt;/strong&gt; — a configurable cap (default 3.0) that fails the run if per-unit cost at large scale exceeds 3× the moderate-corpus baseline.&lt;/p&gt;

&lt;p&gt;The gate is intentionally narrow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// utils/degradation.ts — gate stays honest by NOT inferring per-unit costs&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;computeDegradation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;moderatePerUnitMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;largePerUnitMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;cap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;moderatePerUnitMs&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;Infinity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt; &lt;span class="c1"&gt;// catch degenerate samples loudly&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;largePerUnitMs&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;moderatePerUnitMs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;cap&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The runner does per-unit derivation BEFORE calling the gate. Ingest's &lt;code&gt;perFile.medianMs&lt;/code&gt; is already per-unit (each iteration was one file). Lint's &lt;code&gt;result.medianMs&lt;/code&gt; is whole-workspace, so the runner divides by page count first. Putting that decision in the runner instead of the gate is the difference between "gate that knows what it's measuring" and "gate that guesses at the measurement units."&lt;/p&gt;

&lt;p&gt;Results at 500 sources: ingest 1.25× (PASS), lint 0.33× (PASS — got faster at scale, likely amortized constants). The gate had teeth and the system passed cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The release-readiness checklist (E10-B11, PR #73)
&lt;/h2&gt;

&lt;p&gt;Eight items, verified item-by-item, recorded honestly. No "looks good to me" entries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CI passes&lt;/strong&gt; — all 4 jobs green on last 3 main runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evals pass&lt;/strong&gt; — smoke eval clean; retrieval/citation/compilation handlers wired with 30+ unit tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage targets&lt;/strong&gt; — PARTIAL. Types 100%, kernel 84.6% (target 90%), compiler 62.3% (target 80%), CLI 45.2% (target 70%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs updated&lt;/strong&gt; — current per E10-B07/B08&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CHANGELOG complete&lt;/strong&gt; — auto-generated, current through v0.22.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No critical beads open&lt;/strong&gt; — only B11 (this) + B12 (release cut, blocked by design)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User journey walkthrough&lt;/strong&gt; — &lt;code&gt;ico init&lt;/code&gt; → status → 14-command CLI surface, live smoke-tested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance targets met&lt;/strong&gt; — ingest 200× headroom, lint 3000× headroom, 3× degradation gate PASS&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Verdict: &lt;strong&gt;GO with two conditions.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;C1:&lt;/strong&gt; &lt;code&gt;ico --version&lt;/code&gt; reported &lt;code&gt;0.1.0&lt;/code&gt; (a stale kernel constant) instead of the published &lt;code&gt;0.22.x&lt;/code&gt;. Fix in-cut.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C2:&lt;/strong&gt; Coverage shortfall on kernel/compiler/cli. Documented as post-v1, not blocking. 1,210 passing tests, zero known bugs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That verdict is the artifact. Most release rituals make GO/NO-GO a binary. The conditional verdict is honest: state the gap, decide if it blocks, ship if it doesn't, document the gap permanently if it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "GO with conditions" actually means
&lt;/h2&gt;

&lt;p&gt;A conditional release verdict is the three-state model: &lt;strong&gt;fix what's fixable in-cut, document what isn't, ship anyway.&lt;/strong&gt; Unlike a binary GO/NO-GO gate that forces a boolean choice, a conditional gate acknowledges that real releases ship with known imperfections. The conditions are documented forever in the release record — no lying about readiness, no pretending gaps don't exist, but no unnecessary delays waiting for the perfect threshold that never comes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not GO/NO-GO binary?
&lt;/h2&gt;

&lt;p&gt;Binary GO/NO-GO encourages two bad behaviors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavior one: lower the bar to ship.&lt;/strong&gt; "The version-string bug is fine, users will figure it out." The release ships, the operator-visible defect ships with it, and the next person debugging an environment ends up reading the wrong build into their incident postmortem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavior two: delay until the gate is perfect.&lt;/strong&gt; Coverage targets met on a Tuesday that never comes. Kernel at 84.6% is allegedly not 90%, so v1.0 slips. Then 90% becomes 95%, because some new code landed during the wait. The gate becomes a treadmill.&lt;/p&gt;

&lt;p&gt;Coverage at kernel 84.6% / compiler 62.3% / CLI 45.2% with 1,210 passing tests and zero known bugs &lt;strong&gt;is shippable&lt;/strong&gt;. Blocking v1.0 on coverage uplift would have been a bigger lie than shipping with documented shortfalls. The AAR opens C2 as a post-v1 bead for the next planning cycle. The truth is in the record.&lt;/p&gt;

&lt;p&gt;C1 is the inverse case — &lt;code&gt;ico --version&lt;/code&gt; reporting the wrong number is shippable but ugly, and the fix is small. So fix it in-cut, document it, move on. The gate didn't pretend C1 was fine; it just didn't pretend it was a v2.0-blocker either.&lt;/p&gt;

&lt;p&gt;The prescription is a three-part rule, not a two-part one: &lt;strong&gt;fix what's fixable in-cut, document what isn't, ship anyway.&lt;/strong&gt; Binary GO/NO-GO collapses three states into two and loses the most useful one — the "shippable with known imperfections" state where most real releases actually live.&lt;/p&gt;

&lt;h2&gt;
  
  
  C1 fix: read your own version (PR #74)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;packages/cli/src/index.ts&lt;/code&gt; had been importing &lt;code&gt;version&lt;/code&gt; from &lt;code&gt;@ico/kernel&lt;/code&gt;, which exported a hardcoded string. The kernel constant was never maintained in lock-step with the published CLI package — and &lt;strong&gt;shouldn't be&lt;/strong&gt;, since they are independent artifacts on independent release cadences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// packages/cli/src/index.ts — read from CLI's own package.json&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;readCliVersion&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pkgPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../package.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pkg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pkgPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;utf-8&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;pkg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;version&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[ico] failed to read CLI package.json:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;0.0.0-unknown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// sentinel — CLI keeps working, operator sees clear msg&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cliVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;readCliVersion&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The try/catch is load-bearing. &lt;code&gt;readCliVersion()&lt;/code&gt; runs at module load, BEFORE the process-level error handlers are installed further down the file. An uncaught throw here would surface as a raw Node stack trace and bypass the friendly &lt;code&gt;[ico]&lt;/code&gt;-prefixed message convention every other CLI error uses. The sentinel path is what makes this safe to call at import time — the CLI keeps working, the operator gets a legible message, and the bug is visible without crashing.&lt;/p&gt;

&lt;p&gt;The test was tightened in the same PR. &lt;code&gt;/^\d+\.\d+\.\d+/&lt;/code&gt; (no end anchor — would accept nonsense like &lt;code&gt;0.22.1.99&lt;/code&gt;) became:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cliVersion&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;\.\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;\.\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;-&lt;/span&gt;&lt;span class="se"&gt;[\w&lt;/span&gt;&lt;span class="sr"&gt;.-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)?&lt;/span&gt;&lt;span class="sr"&gt;$/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strict semver core plus optional pre-release tag. The previous regex was a one-character bug; the fix is one character plus an opt-in pre-release group.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cut itself (52fa7a4 → v1.0.0)
&lt;/h2&gt;

&lt;p&gt;The cut commit was tiny: 11 files, +54/-10 lines. It did one thing: aligned &lt;strong&gt;all 6 workspace &lt;code&gt;package.json&lt;/code&gt;&lt;/strong&gt; + &lt;code&gt;version.txt&lt;/code&gt; + &lt;code&gt;kernel/src/version.ts&lt;/code&gt; at 1.0.0.&lt;/p&gt;

&lt;p&gt;The auto-release workflow had been bumping the root &lt;code&gt;package.json&lt;/code&gt; and &lt;code&gt;version.txt&lt;/code&gt; only — internal packages had drifted to 0.1.0 or 0.22.1 depending on history. &lt;code&gt;/release&lt;/code&gt; Phase 3 caught the drift. Phase 5 required explicit SHA approval before any push (&lt;code&gt;f1a627b&lt;/code&gt;). Phases 6-8 ran atomically.&lt;/p&gt;

&lt;p&gt;Verified at v1.0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,210 / 1,210 tests pass across 5 packages&lt;/li&gt;
&lt;li&gt;Lint + typecheck clean&lt;/li&gt;
&lt;li&gt;escape-scan REFUSE=0 CHALLENGE=0 FLAG=0&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ico --version&lt;/code&gt; reports &lt;code&gt;1.0.0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The tarball turned out incomplete (v1.0.1, same day)
&lt;/h2&gt;

&lt;p&gt;During the actual &lt;code&gt;npm publish&lt;/code&gt; flow, the pack dry-run reported &lt;strong&gt;7 files&lt;/strong&gt; when expected was 9: dist + package.json, no README, no LICENSE. The CLI's &lt;code&gt;package.json&lt;/code&gt; declared:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"dist"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"README.md"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"LICENSE"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the CLI directory didn't OWN those files. The canonical &lt;code&gt;README.md&lt;/code&gt; and &lt;code&gt;LICENSE&lt;/code&gt; live at the monorepo root.&lt;/p&gt;

&lt;p&gt;Fix landed inline before the real publish:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// packages/cli/tsup.config.ts — copy README + LICENSE at build time&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="c1"&gt;// ... entry, format, dts, sourcemap ...&lt;/span&gt;
  &lt;span class="na"&gt;onSuccess&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cp ../../README.md ../../LICENSE ./&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The copies are gitignored (their source of truth is the repo root). v1.0.0 on npm now includes both. No version bump for the build-infra fix itself, but the same day shipped v1.0.1 for the next user-visible change.&lt;/p&gt;

&lt;p&gt;This is the test of whether "GO with conditions" was the right shape. A binary GO/NO-GO ritual would have caught the version string (C1) and either fixed it before re-running the whole gate or punted to v1.0.1. The conditional model said: ship, here's what we know is imperfect. When the tarball turned out incomplete during the actual publish — a discovery that &lt;strong&gt;couldn't&lt;/strong&gt; have been made during gate verification, because it only surfaces in the publish pipeline itself — the answer was just: ship v1.0.1 the same day. No drama. No "release is broken" panic. The model already accepted that real releases generate follow-on releases.&lt;/p&gt;

&lt;h2&gt;
  
  
  AAR same day
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;d17e10e docs(aar): v1.0.0 release after-action report&lt;/code&gt; landed within hours. Three lessons-for-next-release, captured while they were still warm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Beads JSONL/Dolt sync flapping&lt;/strong&gt; during multi-PR sessions — repeated need to re-close beads after merges. Filed as a follow-up to investigate the sync ordering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-release workflow bumps root + &lt;code&gt;version.txt&lt;/code&gt; only&lt;/strong&gt; — should bump &lt;code&gt;packages/*/package.json&lt;/code&gt; in lock-step. The 11-file cut commit was entirely correcting drift the workflow could have prevented.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/release&lt;/code&gt; skill execution worked as designed&lt;/strong&gt; — Phase 0 surfaced no blockers, Phases 1-3 caught the version drift, Phase 5 required SHA approval, Phases 6-8 atomic.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Same-day AAR is non-negotiable. The version-drift issue, the tarball issue, the conditional-verdict pattern — all of them lose 80% of their teaching value if you write the AAR a week later, after the warm memory of "wait, why didn't the workflow catch that?" has faded into "yeah, we shipped, it was fine."&lt;/p&gt;

&lt;h2&gt;
  
  
  Also shipped
&lt;/h2&gt;

&lt;p&gt;The release gate constrained the v1.0.0 cut, not the working day. Three other repos kept moving in parallel — exactly the behavior the conditional-verdict model is designed to enable. A release that takes the whole org offline isn't a release ritual; it's an outage.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;hustle:&lt;/strong&gt; Phase 3 auth landed in three commits — NextAuth + Drizzle/SQLite infrastructure, dashboard cutover, password reset flow. Coordinated migration from the previous auth stack on a single feature branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;claude-code-slack-channel:&lt;/strong&gt; ACP session/cancel boundary adapter extracted into a module, and JSON-RPC &lt;code&gt;id&lt;/code&gt; widened to nullable per spec §5.1 (#172, #173).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;claude-code-plugins:&lt;/strong&gt; Six PRs — repo quality audit, private vulnerability reporting enabled, validator discovers root-level &lt;code&gt;SKILL.md&lt;/code&gt; (Anthropic-spec layout), slack-channel mirror stopped stripping upstream tests, blog cross-post infra fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/honest-perf-benchmarks-paid-api-compiler/"&gt;Honest perf benchmarks for a paid-API compiler&lt;/a&gt; — yesterday's post on the benchmark infrastructure that fed this release gate&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/five-releases-fifteen-minutes-mandy-cutover-and-freeze-break/"&gt;Five releases in fifteen minutes: Mandy cutover and freeze break&lt;/a&gt; — earlier five-releases-in-a-day pattern&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/github-release-workflow-uncommitted-changes-semantic-versioning/"&gt;GitHub release workflow: uncommitted changes and semantic versioning&lt;/a&gt; — related release-engineering theme&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>releaseengineering</category>
      <category>cicd</category>
      <category>typescript</category>
      <category>testing</category>
    </item>
    <item>
      <title>Honest Perf Benchmarks for a Paid-API Compiler</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Wed, 20 May 2026 13:00:40 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/honest-perf-benchmarks-for-a-paid-api-compiler-56h4</link>
      <guid>https://dev.to/jeremy_longshore/honest-perf-benchmarks-for-a-paid-api-compiler-56h4</guid>
      <description>&lt;p&gt;&lt;code&gt;intentional-cognition-os&lt;/code&gt; is a TypeScript "compiler" — markdown sources go in one end, a structured artifact comes out the other, and several of the middle stages call paid Claude APIs to do the cognitive work. Up to today there were zero performance gates on any of it. No baseline, no regression alarm, no "did that refactor make ingest 4× slower" check.&lt;/p&gt;

&lt;p&gt;The benchmark suite that landed across four PRs answers two design questions that had to be settled before a single line of timing code got written:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How do you compare numbers across machines when half the corpus is randomly generated text?&lt;/li&gt;
&lt;li&gt;What do you do about the steps that cost real money on every run?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Get either answer wrong and the benchmark suite is worse than no benchmark suite — it produces numbers that look authoritative and aren't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The corpus has to be byte-identical
&lt;/h2&gt;

&lt;p&gt;The first scenario — &lt;code&gt;ingest&lt;/code&gt; — needs a corpus. Hand-curated fixtures committed to disk were considered and rejected: they don't scale, they go stale, and they encode whoever-wrote-them's idea of "representative." A generator is the right answer, but a generator has to be deterministic or before/after diffs are noise.&lt;/p&gt;

&lt;p&gt;The generator uses a seeded &lt;code&gt;mulberry32&lt;/code&gt; PRNG and pulls UUIDs from the same stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;mulberry32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;function &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;seed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mh"&gt;0x6d2b79f5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;imul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;^=&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;imul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;61&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4294967296&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;seededUuidV4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 16 bytes from the seeded stream, version + variant nibbles set per RFC 4122&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x0f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mh"&gt;0x40&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x3f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mh"&gt;0x80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;formatUuid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious trap is &lt;code&gt;crypto.randomUUID&lt;/code&gt;. It would have looked correct, passed every unit test, and silently produced different UUIDs on every run — so every "identical" corpus would have differed in the front-matter &lt;code&gt;id&lt;/code&gt; field. That breaks ingest's content-hash cache in different ways on different machines. Same seed, same count, same body-word count yields byte-identical output everywhere. That's the contract.&lt;/p&gt;

&lt;p&gt;One more gotcha worth a sentence: the corpus generator writes front matter through &lt;code&gt;gray-matter&lt;/code&gt;, which quotes string values. The compiler's wiki-page validator uses a hand-rolled YAML parser that does NOT strip quotes — so wiki fixtures emit all values unquoted. A quoted &lt;code&gt;compiled_at&lt;/code&gt; would arrive at Zod's datetime check with literal &lt;code&gt;"&lt;/code&gt; characters in it and fail. Two parsers, two rules, documented inline at the parser boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  An API key is not consent
&lt;/h2&gt;

&lt;p&gt;The render, compile, and ask scenarios call Claude. Running them on every CI pass would either drain a budget or quietly stop running when the budget hit zero. Neither is acceptable.&lt;/p&gt;

&lt;p&gt;The gate is two env vars, both required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-... &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nv"&gt;ICO_BENCH_INCLUDE_CLAUDE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
pnpm &lt;span class="nt"&gt;--filter&lt;/span&gt; @ico/benchmarks bench
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From PR #70's design notes, kept verbatim because the framing matters:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The double gate is intentional. An API key alone is not consent — many developers have it set for normal CLI use. &lt;code&gt;ICO_BENCH_INCLUDE_CLAUDE&lt;/code&gt; is the explicit "yes, burn tokens on this benchmark run" signal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This pattern shows up elsewhere — &lt;code&gt;CI=true&lt;/code&gt; plus &lt;code&gt;RUN_E2E=1&lt;/code&gt;, prod credentials plus &lt;code&gt;--really-really-yes&lt;/code&gt;. The shape is the same: one signal proves capability, the second proves intent. A single-gate design fails open the first time someone forgets which shell they're in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skipped is not zero
&lt;/h2&gt;

&lt;p&gt;The interesting design call was what to do when the gate is closed. The wrong answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't run, don't record. Trend tooling then can't tell "we stopped running render" from "render still passes."&lt;/li&gt;
&lt;li&gt;Record a zero. Trend tooling thinks render got infinitely fast and stops alarming.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right answer: record the scenario as &lt;code&gt;skipped: true&lt;/code&gt; with a stable &lt;code&gt;skipReason&lt;/code&gt;. &lt;code&gt;ScenarioRecord&lt;/code&gt; is &lt;code&gt;Partial&amp;lt;CommonTiming&amp;gt;&lt;/code&gt; so the timing fields legitimately don't exist on skipped records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"render"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skipped"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skipReason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ICO_BENCH_INCLUDE_CLAUDE not set"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"git_sha"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"9c14f02"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v22.21.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"linux-x64"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A baseline-comparison script can now answer three different questions instead of two: did this scenario regress, did it improve, or did it not run? Skipped runs stay visible in the JSON timeline. They don't pollute the histogram, but they prove the scenario still exists and the runner saw it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four PRs, briefly
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PR #68&lt;/strong&gt; scaffolded the &lt;code&gt;packages/benchmarks/&lt;/code&gt; workspace, the corpus generator, a &lt;code&gt;bench()&lt;/code&gt; timer with warmup + N-iteration median + RSS delta, and the runner that captures git SHA, Node version, and platform into &lt;code&gt;results/&amp;lt;iso&amp;gt;-&amp;lt;sha&amp;gt;.json&lt;/code&gt;. The &lt;code&gt;results/&lt;/code&gt; directory is gitignored except &lt;code&gt;.gitkeep&lt;/code&gt; — baselines get tracked explicitly, not by accident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #69&lt;/strong&gt; added the &lt;code&gt;lint&lt;/code&gt; scenario and moved &lt;code&gt;runLint&lt;/code&gt;, &lt;code&gt;scanWikiPages&lt;/code&gt;, &lt;code&gt;extractWikilinks&lt;/code&gt;, &lt;code&gt;detectOrphans&lt;/code&gt;, &lt;code&gt;LintResult&lt;/code&gt;, and &lt;code&gt;SchemaError&lt;/code&gt; out of &lt;code&gt;packages/cli/src/commands/lint.ts&lt;/code&gt; into a new &lt;code&gt;packages/compiler/src/lint.ts&lt;/code&gt;. The function only composes compiler + kernel primitives and has no CLI dependency — it belonged in the compiler the whole time. The CLI's lint command shrunk to a thin wrapper around commander wiring and &lt;code&gt;renderLintReport&lt;/code&gt;. Side fix: &lt;code&gt;extractWikilinks&lt;/code&gt; had a module-level &lt;code&gt;/g&lt;/code&gt; regex whose &lt;code&gt;lastIndex&lt;/code&gt; carried state between calls — the same class of bug that landed in PR #67 the day before. Fixed by constructing the regex per call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #70&lt;/strong&gt; added the &lt;code&gt;render&lt;/code&gt; scenario and the double-gate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #71&lt;/strong&gt; added &lt;code&gt;compile&lt;/code&gt; and &lt;code&gt;ask&lt;/code&gt;, each using the same gating pattern. Roughly 70 lines of additions across both files — the gate had already done the hard work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why not the obvious alternatives
&lt;/h2&gt;

&lt;p&gt;Vitest's built-in &lt;code&gt;bench&lt;/code&gt; was considered. It does microbenchmarks well and integrates with the existing test runner. It does not produce the JSON timeline shape needed for cross-run comparison, and bolting that on means owning the storage layer anyway. Build it once, build it right.&lt;/p&gt;

&lt;p&gt;Committing fixture corpora to disk was considered. They go stale, balloon the repo, and encode one author's idea of "moderate." The seeded generator is reproducible AND parameterizable — same determinism guarantee, no committed binary blobs.&lt;/p&gt;

&lt;p&gt;Running Claude scenarios always was considered for about a minute, then rejected on cost grounds. Even with caching, a benchmark suite that costs $2 per run on a busy day stops getting run.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers say
&lt;/h2&gt;

&lt;p&gt;Three scenarios ran on the dev box this afternoon (Claude-gated ones skipped because the opt-in wasn't set):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Median&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Headroom&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ingest (per-file, 50 sources × 500 words)&lt;/td&gt;
&lt;td&gt;~9 ms&lt;/td&gt;
&lt;td&gt;&amp;lt; 2 s&lt;/td&gt;
&lt;td&gt;220×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lint (50 sources + 30 wiki pages)&lt;/td&gt;
&lt;td&gt;~12 ms&lt;/td&gt;
&lt;td&gt;&amp;lt; 30 s&lt;/td&gt;
&lt;td&gt;2400×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;render&lt;/td&gt;
&lt;td&gt;SKIPPED (no opt-in)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;recorded&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headroom isn't the point — those targets are deliberately generous because the goal is regression detection, not perf bragging. The point is that there are now numbers to regress &lt;em&gt;against&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also shipped today
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;claude-code-plugins repo audit.&lt;/strong&gt; A 232-line audit landed at &lt;code&gt;266-RA-AUDT-repo-quality-audit-2026-05-17.md&lt;/code&gt; cataloguing a broken &lt;code&gt;/about&lt;/code&gt; route, missing 404 handling, 14 stale &lt;code&gt;MS-OLDV&lt;/code&gt; files still claiming v1.0.0 while the repo is at v4.30.0, and notebook content teaching the old 6-required-fields skill spec when the current spec requires 8. The first commit incorrectly flagged the wiki as empty, because &lt;code&gt;gh api repos/.../wiki&lt;/code&gt; returns 404 even when the wiki has content — that endpoint isn't a content probe, it's a metadata probe with bad error semantics. Followup commit cloned the wiki, found 23 pages, and refreshed all of them with current numbers. Lesson noted inline: don't use API existence probes as content probes. Clone and read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;claude-code-slack-channel threat model.&lt;/strong&gt; Added T11 (EchoLeak — instructions exfiltrated via legitimate-looking message replies) and invariant #7: admin verbs are not chat content. An operational key-management doc for the audit-signing key landed alongside the threat model update.&lt;/p&gt;

&lt;h2&gt;
  
  
  The transferable pattern
&lt;/h2&gt;

&lt;p&gt;Five scenarios in source tree, three actively measured, two gated behind explicit consent. The numbers that get reported are honest because the inputs are reproducible and the skipped runs are visible. Forget the opt-in flag and three scenarios show up as &lt;code&gt;skipped&lt;/code&gt; in the JSON — they don't disappear, and they don't pretend to be zero.&lt;/p&gt;

&lt;p&gt;Any benchmark suite that mixes deterministic and paid steps needs all three pieces: a deterministic corpus that survives machine swaps, an opt-in gate strong enough to mean something, and a record shape that distinguishes "didn't run" from "ran fast." Miss one and the suite will quietly lie to you the first time someone forgets which mode they're in. The lie is worse than the gap it filled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/five-silent-failures-one-day/"&gt;Five Silent Failures in One Day&lt;/a&gt; — the regex &lt;code&gt;lastIndex&lt;/code&gt; bug that re-appeared in PR #69 was one of these.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/deterministic-first-llm-advisory-ci/"&gt;Deterministic-First, LLM-Advisory CI&lt;/a&gt; — same principle: the deterministic gate decides, the paid gate informs.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/transitive-cve-clearance-dual-layer-pattern/"&gt;Transitive CVE Clearance: A Dual-Layer Pattern&lt;/a&gt; — the double-gate is the same shape as that two-layer defense.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>typescript</category>
      <category>testing</category>
      <category>architecture</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Five Silent Failures in One Day</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Tue, 19 May 2026 13:00:41 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/five-silent-failures-in-one-day-4n7d</link>
      <guid>https://dev.to/jeremy_longshore/five-silent-failures-in-one-day-4n7d</guid>
      <description>&lt;p&gt;&lt;strong&gt;A silent failure is when a tool reports PASS without doing the work it was supposed to do — the legitimate empty-set case and the broken-but-silent case produce identical output, and nothing downstream can tell them apart.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A green check is not evidence of work. It is evidence that whatever ran did not raise an error. Those are different claims, and on 2026-05-16 the difference surfaced five times in five unrelated systems before lunch.&lt;/p&gt;

&lt;p&gt;The pattern is the same in all five: a tool reported PASS without doing the work it was supposed to do. Not a wrong answer — no answer, dressed up as a correct one. The legitimate empty-set case and the broken-but-silent case produced identical output. CI was green. Reviewers saw nothing to push back on. The signal that something was wrong came from downstream consumers noticing the work was missing.&lt;/p&gt;

&lt;p&gt;The five instances, in the order they were found:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A CI prescreen that ran on zero plugins and called itself green&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;.gitignore&lt;/code&gt; rule that silently dropped plugin configs from every commit&lt;/li&gt;
&lt;li&gt;Prettier that reformatted an 11,000-line catalog and exited 0&lt;/li&gt;
&lt;li&gt;An SSH deploy that succeeded by doing nothing&lt;/li&gt;
&lt;li&gt;A regex that quietly skipped matches because the &lt;code&gt;/g&lt;/code&gt; flag left state behind&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each one shipped past code review. Each one was caught by a downstream user, not by the gate that was supposed to catch it. Each one has now been re-armed with a guard whose job is to assert the work actually happened — not to assert that the command exited zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The prescreen that ran on zero plugins
&lt;/h2&gt;

&lt;p&gt;Repo: &lt;code&gt;claude-code-plugins&lt;/code&gt;, PR #730.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;pr-prescreen.yml&lt;/code&gt; workflow's "Compute changed plugin paths" step combined &lt;code&gt;gh api --paginate&lt;/code&gt; with &lt;code&gt;--jq&lt;/code&gt; in a single pipe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Compute changed plugin paths&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;gh api --paginate \&lt;/span&gt;
      &lt;span class="s"&gt;"/repos/${{ github.repository }}/pulls/${{ github.event.pull_request.number }}/files" \&lt;/span&gt;
      &lt;span class="s"&gt;--jq '.[].filename' \&lt;/span&gt;
      &lt;span class="s"&gt;| grep -E '^plugins/[^/]+/' \&lt;/span&gt;
      &lt;span class="s"&gt;| cut -d/ -f1-2 \&lt;/span&gt;
      &lt;span class="s"&gt;| sort -u &amp;gt; changed-plugins.txt || true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works on every local shell. On the GitHub Actions runner, the &lt;code&gt;--paginate&lt;/code&gt; + &lt;code&gt;--jq&lt;/code&gt; combination silently produced empty stdout. No error. No exit code. Just nothing on the pipe. The downstream &lt;code&gt;grep | cut | sort -u&lt;/code&gt; happily processed zero lines and wrote an empty file. The trailing &lt;code&gt;|| true&lt;/code&gt; swallowed any failure that might have escaped the pipeline.&lt;/p&gt;

&lt;p&gt;The classifier then read &lt;code&gt;changed-plugins.txt&lt;/code&gt;, saw zero entries, and emitted &lt;code&gt;PASS: no plugin paths matched the PR diff&lt;/code&gt;. Two external PRs — #726 and #728, the first contributions through the new pipeline — both landed false PASS verdicts on PRs that obviously added new plugin directories.&lt;/p&gt;

&lt;p&gt;The fix is two changes and a guard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Fetch changed files&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;gh api --paginate \&lt;/span&gt;
      &lt;span class="s"&gt;"/repos/${{ github.repository }}/pulls/${{ github.event.pull_request.number }}/files" \&lt;/span&gt;
      &lt;span class="s"&gt;&amp;gt; pr-files.json&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract plugin paths&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;jq -r '.[].filename' pr-files.json \&lt;/span&gt;
      &lt;span class="s"&gt;| grep -E '^plugins/[^/]+/' \&lt;/span&gt;
      &lt;span class="s"&gt;| cut -d/ -f1-2 \&lt;/span&gt;
      &lt;span class="s"&gt;| sort -u &amp;gt; changed-plugins.txt&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sanity guard&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;if jq -r '.[].filename' pr-files.json | grep -qE '^plugins/'; then&lt;/span&gt;
      &lt;span class="s"&gt;if [ ! -s changed-plugins.txt ]; then&lt;/span&gt;
        &lt;span class="s"&gt;echo "HARD_BLOCK: PR touches plugins/ but extraction produced zero dirs"&lt;/span&gt;
        &lt;span class="s"&gt;exit 1&lt;/span&gt;
      &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Splitting &lt;code&gt;gh api --paginate&lt;/code&gt; from &lt;code&gt;jq&lt;/code&gt; removes the pipe-buffering interaction that ate stdout. Dropping the blanket &lt;code&gt;|| true&lt;/code&gt; lets real errors propagate. The third step is the actual fix: it asserts that &lt;em&gt;if&lt;/em&gt; the PR diff touched any plugin path, the extraction must have produced at least one row. "I found nothing" becomes "I would have found something — fail loud."&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The gitignore that ate plugin configs
&lt;/h2&gt;

&lt;p&gt;Repo: &lt;code&gt;claude-code-plugins&lt;/code&gt;, PR #733.&lt;/p&gt;

&lt;p&gt;The root &lt;code&gt;.gitignore&lt;/code&gt; contained one line that was never meant to apply globally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The original intent was dev-local — devs sometimes drop a &lt;code&gt;.mcp.json&lt;/code&gt; at the repo root for personal MCP servers. The pattern matched everywhere. Three plugins — &lt;code&gt;slack-channel&lt;/code&gt;, &lt;code&gt;pr-to-spec&lt;/code&gt;, &lt;code&gt;x-bug-triage&lt;/code&gt; — had a &lt;code&gt;.mcp.json&lt;/code&gt; on disk because the mirror sync wrote them, and git silently never tracked any of the three. The mirror produced the file. The working tree showed the file. &lt;code&gt;git status&lt;/code&gt; showed it as ignored. Nothing red anywhere.&lt;/p&gt;

&lt;p&gt;Plugins without their &lt;code&gt;.mcp.json&lt;/code&gt; fail the MCP handshake at install time. Claude Code can't determine how to spawn the server. The plugin loads, registers nothing, and the user sees commands that do nothing.&lt;/p&gt;

&lt;p&gt;A second silent failure lived in the same PR. The mirror's &lt;code&gt;sources.yaml&lt;/code&gt; listed source files explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;plugins/x-bug-triage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;server.ts&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;lib.ts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;server.ts&lt;/code&gt; imports &lt;code&gt;journal.ts&lt;/code&gt;, &lt;code&gt;manifest.ts&lt;/code&gt;, &lt;code&gt;policy.ts&lt;/code&gt;, &lt;code&gt;supervisor.ts&lt;/code&gt; — none of which were in the allow-list. The mirror shipped a non-functional server, not because anything errored, but because the include list silently skipped the missing files. No "file not in sources" warning. No diff check. Just a partial build that compiled because the imports themselves were valid module references at type-check time but missing at runtime.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# .gitignore
.mcp.json
!plugins/**/.mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;plugins/x-bug-triage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.ts"&lt;/span&gt;
    &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.test.ts"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.spec.ts"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The negation rule re-tracks plugin configs. The glob-with-exclude replaces named-file allow-lists with a pattern that can't silently miss a new file. The three affected &lt;code&gt;.mcp.json&lt;/code&gt; files were force-added in the same commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Prettier that reformatted 11,000 lines and exited 0
&lt;/h2&gt;

&lt;p&gt;Repo: &lt;code&gt;claude-code-plugins&lt;/code&gt;, PR #730 (same PR as the prescreen failure).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.claude-plugin/marketplace.extended.json&lt;/code&gt; is the canonical plugin catalog — eleven thousand lines, hand-formatted with deliberate multi-line &lt;code&gt;keywords&lt;/code&gt; arrays for git-diff hygiene:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example-plugin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"ci"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"validation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"marketplace"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A contributor's format-on-save action ran prettier across the catalog. Prettier collapsed every keyword array to a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example-plugin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ci"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"validation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"marketplace"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The JSON was still valid. Prettier exited 0. The &lt;code&gt;validate-plugins.yml&lt;/code&gt; workflow loaded the catalog, parsed it, ran every entry through the schema — all green. The actual diff was +1 plugin entry, -1,200 lines of reformatted catalog. Every other in-flight PR's merge base was now unrecoverable without rebase-and-reformat.&lt;/p&gt;

&lt;p&gt;The fix has two parts. First, &lt;code&gt;.prettierignore&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude-plugin/marketplace.extended.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second, an active line-budget guard at &lt;code&gt;scripts/check-catalog-format.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;expected_line_delta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_catalog&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head_catalog&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_catalog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;head_catalog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;head&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;base_by_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plugins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="n"&gt;head_by_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plugins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="n"&gt;added&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;head_by_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_by_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;removed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_by_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;head_by_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;modified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;head_by_name&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;base_by_name&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;head_by_name&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;base_by_name&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="c1"&gt;# Average plugin block is ~30 lines.
&lt;/span&gt;    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;added&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;removed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modified&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;

&lt;span class="n"&gt;actual_delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;file_line_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;file_line_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expected_line_delta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;  &lt;span class="c1"&gt;# slack for inline edits
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual_delta&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL: catalog diff &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual_delta&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; lines, budget &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The guard parses both catalogs structurally, computes the expected line delta from the actual content changes, and rejects PRs where the file delta exceeds that by more than 300 lines. "The file is still valid" becomes "the diff is the size we expected from the work that was claimed."&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The SSH deploy that succeeded by doing nothing
&lt;/h2&gt;

&lt;p&gt;Repo: &lt;code&gt;hustle&lt;/code&gt;, PR #40. Documented in the &lt;code&gt;intentsolutions-vps-runbook&lt;/code&gt; AAR for Phase 2.5 of the VPS migration.&lt;/p&gt;

&lt;p&gt;The new Hustle VPS deploy workflow merged green. The first auto-deploy reported success. The container on the VPS was untouched.&lt;/p&gt;

&lt;p&gt;The canonical reusable VPS deploy workflow is one SSH call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ssh ${{ env.DEPLOY_USER }}@${{ env.DEPLOY_HOST }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no command argument. The whole architecture relies on a &lt;code&gt;command="..."&lt;/code&gt; force-command directive in &lt;code&gt;authorized_keys&lt;/code&gt; to bind the deploy key to a specific script. Connect with the key, the forced command runs, deploy happens, connection closes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;hustle-deploy&lt;/code&gt; user's &lt;code&gt;authorized_keys&lt;/code&gt; had no force-command. Plain &lt;code&gt;ssh user@host&lt;/code&gt; with no command and no force-command opens an interactive session. The runner has no TTY. The session sits idle for a moment, the server times out the silent connection, exit 0. From the runner's perspective: SSH connected, SSH closed cleanly, deploy step SUCCESS. From the VPS's perspective: a key authenticated, nothing happened, the session ended.&lt;/p&gt;

&lt;p&gt;The fix is a deploy script and a force-command lock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /usr/local/sbin/deploy-hustle&lt;/span&gt;
&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nb"&gt;cd&lt;/span&gt; /srv/hustle
git fetch origin
git reset &lt;span class="nt"&gt;--hard&lt;/span&gt; origin/main
docker compose pull
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--remove-orphans&lt;/span&gt;
docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# /home/hustle-deploy/.ssh/authorized_keys
command="/usr/local/sbin/deploy-hustle",no-port-forwarding,no-X11-forwarding,no-pty ssh-ed25519 AAAA... deploy@github
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now there is no path where the SSH channel can do nothing. The forced command runs or the key fails to authenticate. The second deploy ran the script end-to-end, recreated the container, and produced visible log output the runner could grep.&lt;/p&gt;

&lt;p&gt;The generalization matters more than the fix. Every Docker-variant deploy in the fleet that depends on a force-command and doesn't have one is silently broken in the same way. &lt;code&gt;lilly-75-holy&lt;/code&gt; and &lt;code&gt;braves-booth&lt;/code&gt; are flagged for audit; &lt;code&gt;partner-portals&lt;/code&gt; and &lt;code&gt;claude-code-plugins-plus-skills&lt;/code&gt; are safe — both have the force-command directive in place. The fleet sweep is tracked as a follow-up bead off the P7 Stage C epic, not folded into this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The regex that skipped matches because &lt;code&gt;/g&lt;/code&gt; left state behind
&lt;/h2&gt;

&lt;p&gt;Repo: &lt;code&gt;intentional-cognition-os&lt;/code&gt;, PR #67 (a Gemini review followup on E10-B03).&lt;/p&gt;

&lt;p&gt;Two module-level constants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SOURCE_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\[\^&lt;/span&gt;&lt;span class="sr"&gt;src:&lt;/span&gt;&lt;span class="se"&gt;([^\]]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)\]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;WIKILINK_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\[\[([^\]]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)\]\]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Used in two back-to-back &lt;code&gt;RegExp.exec&lt;/code&gt; loops to iterate citation markers in a body of text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractCitations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;Citation&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Citation&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;SOURCE_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;source&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;WIKILINK_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wikilink&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;RegExp&lt;/code&gt; instances with the &lt;code&gt;/g&lt;/code&gt; flag carry a mutable &lt;code&gt;lastIndex&lt;/code&gt; between calls. The &lt;code&gt;exec&lt;/code&gt; loop is supposed to walk it to the end and let the final non-match reset it to 0 — but any code path that exits the loop early, throws mid-iteration, or runs concurrently on the same regex object leaves &lt;code&gt;lastIndex&lt;/code&gt; mid-string. The next call to &lt;code&gt;extractCitations&lt;/code&gt; starts searching from wherever the last one stopped.&lt;/p&gt;

&lt;p&gt;The citation handler kept reporting "verified" because the missed citations were not checked at all — not flagged as missing, not flagged as wrong. They were invisible. Whichever entries fell before the carried-over &lt;code&gt;lastIndex&lt;/code&gt; were skipped silently, every time.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractCitations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;Citation&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Required: SOURCE_RE and WIKILINK_RE are module-level /g regexes.&lt;/span&gt;
  &lt;span class="c1"&gt;// Reset lastIndex on entry so prior loop state cannot cause this call&lt;/span&gt;
  &lt;span class="c1"&gt;// to start mid-string and silently skip matches.&lt;/span&gt;
  &lt;span class="nx"&gt;SOURCE_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;WIKILINK_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Citation&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;SOURCE_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;source&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;WIKILINK_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wikilink&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The comment is load-bearing. Without it, the next refactor pulls the resets out as "redundant" and the silent skip comes back. Six regression tests pin the invariant: prebuilt-index honored, batch aggregation correct, 100 sequential calls return identical output, two interleaved bodies (one long, one short) stay independent of each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of a silent failure
&lt;/h2&gt;

&lt;p&gt;All five share the same anatomy. There exists a legitimate no-op outcome — no plugin paths matched, no files to include, no formatting changes needed, no command to run, no remaining matches in the string. The error path produces an observable state identical to the legitimate no-op. The downstream consumer cannot tell which one it got.&lt;/p&gt;

&lt;p&gt;The fixes are not better error handling. The fixes are active assertions about the work that was claimed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;prescreen:&lt;/strong&gt; if files matched the trigger, the extraction must have produced rows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gitignore + allow-list:&lt;/strong&gt; plugin configs must reach the tree, not just the working directory — and source allow-lists must fail on missing imports, not silently ship a partial build&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;prettier:&lt;/strong&gt; the diff size must match the structural work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSH deploy:&lt;/strong&gt; bind the command to the key — make it impossible for the channel to do nothing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;regex:&lt;/strong&gt; reset state to a known precondition before every call, and pin that contract with a test&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common verb in every fix is &lt;em&gt;assert&lt;/em&gt;, not &lt;em&gt;handle&lt;/em&gt;. The bug was not that errors weren't caught. The bug was that there was no point in the pipeline where the system stated, in code, what counted as the work actually being done.&lt;/p&gt;

&lt;p&gt;The hardest silent failures to catch are the ones where the tool's success state and its silent-failure state are observationally identical. That is the category. Once auditing for it begins, more keep surfacing — most CI pipelines have at least one step that exits 0 whether or not it did anything, and most of them are downstream of a step that &lt;em&gt;can&lt;/em&gt; legitimately produce empty output.&lt;/p&gt;

&lt;p&gt;Silent failures don't get worse over time. They get more confident. Each green check trains the audit instinct to skip them, and the audit instinct is the only thing standing between the build status and the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/deterministic-first-llm-advisory-ci/"&gt;Deterministic-first, LLM-advisory CI&lt;/a&gt; — the broader argument for keeping reject/accept decisions in code that can be reasoned about, with model output as advisory signal&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/three-guards-against-shipping-slop/"&gt;Three guards against shipping slop&lt;/a&gt; — earlier examples of the same assert-the-work pattern in plugin merges&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/two-false-positive-fixes-same-root-cause/"&gt;Two false-positive fixes, same root cause&lt;/a&gt; — when two unrelated bugs share an underlying shape&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cicd</category>
      <category>debugging</category>
      <category>devops</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>Deterministic First, LLM Second: An Advisory CI Pre-Screen</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Mon, 18 May 2026 13:00:26 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/deterministic-first-llm-second-an-advisory-ci-pre-screen-8o6</link>
      <guid>https://dev.to/jeremy_longshore/deterministic-first-llm-second-an-advisory-ci-pre-screen-8o6</guid>
      <description>&lt;p&gt;The old PR review system ran Gemini on every submission to the &lt;code&gt;claude-code-plugins&lt;/code&gt; repo. It broke every time — quota errors, timeout, malformed JSON, the works. On 2026-05-15 I shipped a replacement and deleted the original on the same day.&lt;/p&gt;

&lt;p&gt;The replacement is structured around two contracts. A deterministic classifier scores each submission against 12 rules and emits one of three verdicts. A Groq LLM bolted on top writes a 5-line summary as advisory polish. The deterministic layer is the product. The LLM never blocks.&lt;/p&gt;

&lt;p&gt;The first live invocation immediately caught two bugs in the new system. That's not failure. That's the design working exactly as intended.&lt;/p&gt;

&lt;p&gt;Five PRs in one day: &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/719" rel="noopener noreferrer"&gt;#719&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/723" rel="noopener noreferrer"&gt;#723&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/721" rel="noopener noreferrer"&gt;#721&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/724" rel="noopener noreferrer"&gt;#724&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/725" rel="noopener noreferrer"&gt;#725&lt;/a&gt;. Together they close the epic and demonstrate why the deterministic-first pattern lets you replace a live system without a transition period.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The never-block contract:&lt;/strong&gt; LLM outputs are advisory only. They never block the primary CI decision. If the LLM crashes, times out, or hallucinates, the deterministic verdict posts unchanged and the rest of the pipeline runs as if the LLM step never existed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The two contracts that matter
&lt;/h2&gt;

&lt;p&gt;The pre-screen workflow lives on two non-negotiable contracts that separate the product from the polish.&lt;/p&gt;

&lt;p&gt;The first: &lt;strong&gt;the deterministic classifier is the product.&lt;/strong&gt; It ingests validator JSON output, applies 12 rules to the changeset, and emits a verdict — &lt;code&gt;PASS&lt;/code&gt;, &lt;code&gt;CHANGES_REQUESTED&lt;/code&gt;, or &lt;code&gt;HARD_BLOCK&lt;/code&gt;. Three outcomes. No gray. No ambiguity.&lt;/p&gt;

&lt;p&gt;The classifier is a pure function. No I/O. No dependencies beyond the Python stdlib. Every test case maps to a rule. Every rule maps to observable, repeatable behavior. You can trace it from input to output without waiting for an API to respond or hoping an LLM doesn't hallucinate.&lt;/p&gt;

&lt;p&gt;PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/719" rel="noopener noreferrer"&gt;#719&lt;/a&gt; closed this layer in 579 new lines: &lt;code&gt;.github/workflows/pr-prescreen.yml&lt;/code&gt; (~270 lines of workflow), &lt;code&gt;scripts/pr-prescreen/classify.py&lt;/code&gt; (the classifier), and 12 unit tests covering every rule and edge case.&lt;/p&gt;

&lt;p&gt;The workflow pattern is fork-safe: &lt;code&gt;pull_request_target&lt;/code&gt;, SHA-pinned checkout, &lt;code&gt;persist-credentials: false&lt;/code&gt;, never executes PR-controlled code. I copied the security pattern verbatim from the broken Gemini workflow (lines 23-79 — those were the load-bearing security). The security model didn't change. The signal source did.&lt;/p&gt;

&lt;p&gt;The second contract: &lt;strong&gt;the LLM is advisory, never blocks.&lt;/strong&gt; When the deterministic layer says &lt;code&gt;PASS&lt;/code&gt;, the Groq LLM generates a 5-line human-readable summary. The summary is rendered as a GitHub comment. It carries zero veto power.&lt;/p&gt;

&lt;p&gt;If Groq times out, crashes, the API key leaks, or the model hallucinates — &lt;code&gt;continue-on-error: true&lt;/code&gt; on the workflow step ensures the pre-screen verdict still posts. The comment just doesn't appear. Slack doesn't ping a summary. The rest of the CI runs unchanged. The primary signal is independent of the advisory layer.&lt;/p&gt;

&lt;p&gt;The verdict table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Deterministic rule&lt;/th&gt;
&lt;th&gt;Slack&lt;/th&gt;
&lt;th&gt;Comment&lt;/th&gt;
&lt;th&gt;Retry&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All skills ≥ C, no fatal errors&lt;/td&gt;
&lt;td&gt;Ping&lt;/td&gt;
&lt;td&gt;LLM summary (Groq)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CHANGES_REQUESTED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missing fields or D/F grade&lt;/td&gt;
&lt;td&gt;Silent&lt;/td&gt;
&lt;td&gt;Deterministic details&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HARD_BLOCK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fatal validator error or missing impl&lt;/td&gt;
&lt;td&gt;Ping&lt;/td&gt;
&lt;td&gt;Deterministic details&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Groq runs only on &lt;code&gt;PASS&lt;/code&gt; verdicts. If Groq fails, the verdict still posts — just without the summary. The deterministic layer is the contract. The LLM is the enhancement that runs &lt;em&gt;inside&lt;/em&gt; the bounds of the contract.&lt;/p&gt;

&lt;p&gt;PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/723" rel="noopener noreferrer"&gt;#723&lt;/a&gt; added the Groq integration: 388 new lines, &lt;code&gt;scripts/pr-prescreen/summarize.py&lt;/code&gt;. It calls Groq directly via stdlib &lt;code&gt;urllib&lt;/code&gt; — no SDK, no dependency overhead, no transitive vulnerability surface. Model: &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt;. Wall-clock budget: 5 seconds. Single attempt. No retries.&lt;/p&gt;

&lt;p&gt;The function is dead simple: POST the verdict JSON plus the changeset summary to Groq, parse the response, format it, return. 11 unit tests including fixes from PR #720's review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;broad &lt;code&gt;except Exception&lt;/code&gt; for the never-block contract — catch literally everything&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OSError&lt;/code&gt; instead of &lt;code&gt;TimeoutError&lt;/code&gt; for broader I/O coverage (&lt;code&gt;socket.timeout&lt;/code&gt;, connection resets, the rest of the network failure surface)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;json.JSONDecodeError&lt;/code&gt; guard for malformed responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One tested invariant matters most: &lt;strong&gt;user-controlled PR content cannot override the fixed system prompt.&lt;/strong&gt; The system prompt is a string literal in the source. The PR body is data. They never meet in the same code path. The classifier output goes into the user-role message; the system role is hard-coded and unreachable from outside.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "never block" buys you
&lt;/h2&gt;

&lt;p&gt;The old Gemini system gave an LLM veto power over 3,000+ shipped artifacts. Every time it broke, you couldn't delete it — too much workflow depended on it staying alive. The downstream blast radius made retiring it a multi-week migration.&lt;/p&gt;

&lt;p&gt;You're stuck in maintenance mode: tuning prompts, chasing API changes, hoping the next model version handles the job the same way the last one did. You can't turn it off. You can't replace it. The veto power is a cage.&lt;/p&gt;

&lt;p&gt;The never-block contract changes the trade-off entirely. The LLM is an enhancement layered on top of a deterministic core, not the core itself. If it malfunctions, the workflow degrades gracefully to the deterministic verdict — which you already trust to be correct.&lt;/p&gt;

&lt;p&gt;You can replace the old system on the same day you deploy the new one. You're not hedging bets. You're not running both in parallel. You're not waiting for three weeks of production data to prove the new system is safe. You measure trust against the deterministic layer; the LLM is polish that can't revoke the decision.&lt;/p&gt;

&lt;p&gt;PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/719" rel="noopener noreferrer"&gt;#719&lt;/a&gt; merged. The next day PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/721" rel="noopener noreferrer"&gt;#721&lt;/a&gt; deleted &lt;code&gt;gemini-code-review.yml&lt;/code&gt; (179 lines of perpetually broken YAML) as a single breaking change.&lt;/p&gt;

&lt;p&gt;That PR removed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the workflow file itself&lt;/li&gt;
&lt;li&gt;the orphaned &lt;code&gt;ENABLE_GEMINI_REVIEW&lt;/code&gt; repo variable (operator deletes after merge)&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;--thorough&lt;/code&gt; flag on the validator (advertised in the README but with broken plumbing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also added two new surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scripts/pr-prescreen/audit.py&lt;/code&gt; — appends one row per pre-screen run to &lt;code&gt;freshie/inventory.sqlite&lt;/code&gt;, tracking the decision history for post-mortems and operator review. Inline &lt;code&gt;CREATE TABLE IF NOT EXISTS&lt;/code&gt; schema. &lt;code&gt;continue-on-error: true&lt;/code&gt; so DB failures don't mask the primary signal.&lt;/li&gt;
&lt;li&gt;a 265-line operator runbook at &lt;code&gt;000-docs/265-DR-GUID-pr-prescreen-system.md&lt;/code&gt; documenting the workflow, the verdicts, the audit schema, and the operator's playbook.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No transition period. No parallel runs. No "let's keep both for safety." The new system had been live for exactly one &lt;code&gt;workflow_dispatch&lt;/code&gt; manual invocation. That was enough to trust it — because the deterministic layer is the contract and it's testable end-to-end without the LLM in the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  First live invocation found two bugs
&lt;/h2&gt;

&lt;p&gt;The first production run was PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/722" rel="noopener noreferrer"&gt;#722&lt;/a&gt; — a hyperflow submission from an external contributor with 8 new skills. The run immediately surfaced two design flaws the test suite didn't catch, because the test suite ran against toy data and missed the production edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1: empty-changeset explosion.&lt;/strong&gt; PR #722 touched &lt;code&gt;sources.yaml&lt;/code&gt; only — no &lt;code&gt;plugins/&lt;/code&gt; paths. The changeset filter triggered a fallback I'd written without thinking: "no plugin paths matched" → pass through &lt;em&gt;all&lt;/em&gt; results → generate comment body for ~400 skills.&lt;/p&gt;

&lt;p&gt;GitHub's comment API caps bodies at 65,536 characters. The post failed silently. The deterministic verdict was correct, but the comment never landed and the Slack ping fired with an incomplete reference. Confusing signal to the operator. Real production bug caught by the first real-world input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2: pointless comments.&lt;/strong&gt; Even after fixing Bug 1, every infrastructure or documentation PR would still get a "PASS: no plugin paths matched" comment. That's accurate — nothing happened. But it's signal without value: visible noise on every non-plugin PR. Noise erodes signal over time. After a week, the operator stops reading the comments.&lt;/p&gt;

&lt;p&gt;PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/724" rel="noopener noreferrer"&gt;#724&lt;/a&gt; fixed both in 50 lines net (+50/-163 after deleting the dead fallback). Three changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Empty changeset → emit empty filtered list (the classifier reports "no plugin paths," doesn't dump everything).&lt;/li&gt;
&lt;li&gt;Skip the Post Comment and Slack Notify steps entirely when &lt;code&gt;steps.diff.outputs.count == '0'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Cap the per-skill table at 100 rows and truncate the body to 65,000 characters with a clear marker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Empty changeset → no comment, no ping. The deterministic signal has preconditions. When those preconditions aren't met, the system stays silent. No noise. No confusion.&lt;/p&gt;

&lt;p&gt;The system found its own design flaws on first contact with reality. That's not a weakness of the never-block contract — that's the whole point. The deterministic classifier is safe enough to trust on its first invocation. The advisor runs under safe conditions. When reality violated the conditions, the system degraded gracefully and the operator fixed the precondition. Not the core logic. The core logic was never wrong. The assumptions feeding it were.&lt;/p&gt;

&lt;h2&gt;
  
  
  The spec was invisible — fix the surface
&lt;/h2&gt;

&lt;p&gt;PR #722 was thoughtful. The contributor read CONTRIBUTING.md, followed the issue template, and wrote skills that made technical sense. All 8 were structurally sound. And all 8 were missing every one of the 6 marketplace-required frontmatter fields plus every one of the 7 body sections.&lt;/p&gt;

&lt;p&gt;The expectations were buried in &lt;code&gt;000-docs/6767-b-SPEC-DR-STND-claude-skills-standard.md&lt;/code&gt; — the Global Master Standard for Claude Skills, v3.6.0, with the 100-point rubric and source citations against Anthropic + AgentSkills.io. Authoritative. Comprehensive. Invisible to contributors.&lt;/p&gt;

&lt;p&gt;No link from the PR template. No mention in CONTRIBUTING.md. No signpost in the plugin-submission issue template. The deterministic classifier caught it all as D and F grades and reported each one. That's correct — the validator is working. But the feedback loop was broken: the spec was invisible to contributors. Invisible requirements produce work that looks wrong until you read the fine print. By then you've already written it.&lt;/p&gt;

&lt;p&gt;PR &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/725" rel="noopener noreferrer"&gt;#725&lt;/a&gt; surfaced the spec on three contributor surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CONTRIBUTING.md&lt;/strong&gt; — new "Read the spec before you start" callout above "Before You Submit," with the distilled requirements and direct links.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.github/PULL_REQUEST_TEMPLATE.md&lt;/code&gt;&lt;/strong&gt; — top-of-template now points to CONTRIBUTING and to the spec. Also replaces stale "auto-review bot" phrasing that referred to the deleted Gemini workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.github/ISSUE_TEMPLATE/plugin-submission.yml&lt;/code&gt;&lt;/strong&gt; — adds a markdown description block with the spec callout and replaces 5 generic checkboxes with 7 spec-aware ones covering the real validator gates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also corrected two stale strings while I was there: "Gemini 2.5 Pro will post a review" → "the PR Pre-screen workflow," and the example switched from the deprecated &lt;code&gt;--enterprise&lt;/code&gt; flag to the current &lt;code&gt;--marketplace&lt;/code&gt; flag.&lt;/p&gt;

&lt;p&gt;The validator still grades the same way. The standard didn't move. But now the spec isn't a surprise buried 8 directories deep; it's the first thing you see when you open a pull request. The expected drop in D/F submissions isn't a change to the validator. It's a change to the surface contributors actually touch.&lt;/p&gt;

&lt;p&gt;The five-PR arc — &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/719" rel="noopener noreferrer"&gt;#719&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/723" rel="noopener noreferrer"&gt;#723&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/721" rel="noopener noreferrer"&gt;#721&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/724" rel="noopener noreferrer"&gt;#724&lt;/a&gt;, &lt;a href="https://github.com/jeremylongshore/claude-code-plugins-plus-skills/pull/725" rel="noopener noreferrer"&gt;#725&lt;/a&gt; — is a case study in what never-block lets you do: ship something small, watch it collide with reality, and fix the collisions without unwinding the core. The deterministic classifier didn't change between Phase 1 and the hot-fix. The Groq advisory didn't change either. The preconditions and the surface visibility did, because reality demanded it.&lt;/p&gt;

&lt;p&gt;Deterministic first, LLM second, never-block contract always. That's the formula that lets you retire the old system on the same day and trust the replacement.&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>githubactions</category>
      <category>llm</category>
      <category>deterministicsystems</category>
    </item>
    <item>
      <title>Transitive CVE Clearance: The Dual-Layer Pattern</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Sat, 16 May 2026 13:00:33 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/transitive-cve-clearance-the-dual-layer-pattern-22k4</link>
      <guid>https://dev.to/jeremy_longshore/transitive-cve-clearance-the-dual-layer-pattern-22k4</guid>
      <description>&lt;p&gt;You bump a direct dependency to pull in a patched transitive. &lt;code&gt;bun audit&lt;/code&gt; goes green. The lockfile is committed. Two weeks later, someone does a clean install on a fresh machine, and the vulnerable transitive comes back. This is the transitive CVE trap, and it catches teams with the first move alone.&lt;/p&gt;

&lt;p&gt;The v0.9.1 release of claude-code-slack-channel cleared 6 high-severity CVEs in axios and fast-uri. It required two distinct moves: first, bump the direct deps that pull the patched transitives. Second, pin those transitives at the top-level overrides block so the lockfile cannot regress on the next &lt;code&gt;bun install&lt;/code&gt;. Both moves are mandatory. Here's why.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CVE Picture
&lt;/h2&gt;

&lt;p&gt;Six vulnerabilities came down from the audit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;axios&lt;/strong&gt; (multiple prototype-pollution and header-injection chains):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GHSA-q8qp-cvcw-x6jj — credential injection via prototype pollution&lt;/li&gt;
&lt;li&gt;GHSA-pmwg-cvhr-8vh7 — NO_PROXY bypass via 127.0.0.0/8&lt;/li&gt;
&lt;li&gt;GHSA-6chq-wfr3-2hj9 — header injection through polluted properties&lt;/li&gt;
&lt;li&gt;GHSA-pf86-5x62-jrwf — response-tampering gadgets in prototype chain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;fast-uri&lt;/strong&gt; (percent-encoding confusion):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GHSA-v39h-62p7-jpjc — host confusion via percent-encoded delimiters&lt;/li&gt;
&lt;li&gt;GHSA-q3j6-qgpj-74h6 — path traversal via percent-encoded dot segments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All were high severity. Axios was reachable through &lt;code&gt;@slack/web-api&lt;/code&gt;, and fast-uri through &lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move 1: Bump the Direct Deps
&lt;/h2&gt;

&lt;p&gt;The straightforward path: bump the deps that pull the patched versions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@slack/web-api 7.15.0 → 7.15.2    (pulls axios ^1.13.5 → ^1.15.0, resolves to 1.16.1)
@modelcontextprotocol/sdk 1.27.1 → 1.29.0    (refreshes ajv → fast-uri 3.1.2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commit this, run the lockfile lock, &lt;code&gt;bun audit&lt;/code&gt; shows green. Done, right?&lt;/p&gt;

&lt;p&gt;Not quite.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lockfile Trap
&lt;/h2&gt;

&lt;p&gt;Package managers use semantic versioning ranges. &lt;code&gt;@slack/web-api&lt;/code&gt; at 7.15.2 declares &lt;code&gt;axios ^1.15.0&lt;/code&gt;, which matches 1.15.x, 1.16.x, and newer. The first install on your CI or contributor's machine might pull 1.16.1 (the patched version). But six months later, when the MCP SDK maintainer releases a new version that also depends on axios &lt;em&gt;with a different range&lt;/em&gt; like &lt;code&gt;^1.13.0&lt;/code&gt;, and a contributor runs &lt;code&gt;bun install&lt;/code&gt; on a fresh checkout without the lockfile, the resolver has two legitimate paths to axios: one through Slack at 1.16.1 and one through MCP at 1.13.x. Package managers are free to choose — and if they pick the older one, the CVE is back.&lt;/p&gt;

&lt;p&gt;The lockfile prevents this &lt;em&gt;within a known tree&lt;/em&gt;, but it has a shelf life. Lockfiles can be ignored (clean install), overridden (manual dependency update), or corrupted (merge conflicts). The real guard is a top-level override that says: "No matter what ranges the transitives declare, axios stays at ^1.16.1 and fast-uri stays at ^3.1.2, always."&lt;/p&gt;

&lt;h2&gt;
  
  
  Move 2: Pin at the Top-Level Overrides Block
&lt;/h2&gt;

&lt;p&gt;In Bun (and npm/yarn with overrides support), you declare a top-level policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@slack/web-api"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"7.15.2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@modelcontextprotocol/sdk"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.29.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"overrides"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"axios"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.16.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fast-uri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^3.1.2"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;overrides&lt;/code&gt; block forces every transitive reference to those packages to resolve through the pinned versions, &lt;em&gt;regardless of what ranges the direct deps declare&lt;/em&gt;. Now a future lockfile, a fresh install, a contributor on a different machine — all of them get the patched versions. The CVE cannot re-emerge through a range mismatch.&lt;/p&gt;

&lt;p&gt;Without the override, the next &lt;code&gt;bun install&lt;/code&gt; on a clean tree could legally pull axios 1.13.x (or whatever version a new transitive path declares) and the CVE is back. With the override, it cannot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Both Moves Matter
&lt;/h2&gt;

&lt;p&gt;Move 1 (the dep bump) gets the patched version into the lockfile the first time and signals intent to the dependency tree. Move 2 (the override) is the insurance policy — it says "this version is non-negotiable" to any future resolver, whether it's a clean install, a new team member, or a GitHub Actions runner months from now.&lt;/p&gt;

&lt;p&gt;Neither move alone is complete. Bump without override = fragile; override without bump = signals a different problem (the direct dep is stale and needs its own fix). Both together = the CVE cannot come back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evidence: The Full Gauntlet
&lt;/h2&gt;

&lt;p&gt;The release ran the Intent Solutions testing gauntlet on every change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;704/704 tests passing&lt;/strong&gt; (unit + integration + system + E2E)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;98.47% line coverage, 98.82% function coverage&lt;/strong&gt; (floor enforced by CI gate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cyclomatic complexity max = 28&lt;/strong&gt; (threshold = 30, no violations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harness-hash integrity verified&lt;/strong&gt; (test policy signatures unchanged)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Depcruise clean&lt;/strong&gt; (dependency graph validated, no cycles, no forbidden imports)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gherkin-lint clean&lt;/strong&gt; (all acceptance test syntax valid)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;bun audit --audit-level=high clean&lt;/strong&gt; (excluding one known unpatched transitive marked safe by policy)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A version bump that doesn't clear the full gauntlet doesn't ship. This one did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel Work in the Same Release Window
&lt;/h2&gt;

&lt;p&gt;PR #162 (external contributor @PGMacDesign) fixed the file-upload extension bug — uploads were defaulting to &lt;code&gt;file.txt&lt;/code&gt; because the filename wasn't being passed to &lt;code&gt;filesUploadV2&lt;/code&gt;. That change rode the same release vehicle, showing the dual-layer pattern applies to all release-critical fixes, not just CVEs.&lt;/p&gt;

&lt;p&gt;PR #164 cleaned up documentation drift after the CVE work landed — updated CLAUDE.md cross-references, dropped the gemini-review workflow (now handled via GitHub App), refreshed the source file LoC table to match the 704-test count, and softened coverage claims to "~704 / ~4,035" with a note that the floor is the real gate, not the count.&lt;/p&gt;

&lt;p&gt;All three PRs (#162, #163, #164) merged into a single release tag with a 157-line AAR documenting the bump rationale, the CVE IDs, the test results, and the decision to include the external contribution in the same release window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Clearing a transitive CVE is not a one-move operation. Bump the direct dep, run the gauntlet, add the top-level override, and commit both. The override is the difference between a fix that sticks and a fix that waits for the next fresh install to fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/ccsc-five-releases-one-day-security-sprint/"&gt;CCSC: Five Releases in One Day — Security Sprint&lt;/a&gt; — the prior security sprint on the same repo, where the v0.8.x baseline got hardened before this v0.9.1 patch.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/slack-channel-security-hardening-v020-external-contributors/"&gt;Slack Channel Security Hardening v0.2.0 — External Contributors&lt;/a&gt; — earlier hardening pass plus an external-contributor merge story, parallel to today's #162.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/audit-harness-v010-enforcement-travels-with-code/"&gt;Audit Harness v0.1.0 — Enforcement Travels with the Code&lt;/a&gt; — the vendored gauntlet (&lt;code&gt;.audit-harness/&lt;/code&gt;) that produced the 704/704 + 98.47% evidence in this release.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>node</category>
      <category>dependencymanagement</category>
      <category>bunjs</category>
    </item>
    <item>
      <title>Three Guards Against Shipping Slop</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Fri, 15 May 2026 13:00:22 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/three-guards-against-shipping-slop-gd2</link>
      <guid>https://dev.to/jeremy_longshore/three-guards-against-shipping-slop-gd2</guid>
      <description>&lt;p&gt;Seven pull requests landed on a single partner fork in one day, alongside half a dozen upstream issue filings and the closeout of a prior audit round. That is a velocity that produces slop by default. The slop did not ship — not because the work was careful, but because three distinct guards were standing between the work and the partner, each catching a different class of failure the other two would have missed.&lt;/p&gt;

&lt;p&gt;This post is about those three guards. Not about the velocity. The velocity is the symptom. The guards are the system.&lt;/p&gt;

&lt;p&gt;The engagement is Kobiton, a mobile device cloud partner running an MCP server at &lt;code&gt;api.kobiton.com/mcp&lt;/code&gt;. The day's output included a hooks bundle, an agents addition, a server-side audit slate, and a consistency cleanup — PRs #39 through #45 on the fork. Any one of those, shipped wrong, would have cost partner credibility. None shipped wrong. Three guards caught the slop at three different moments in the workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guard 1: Adversarial pre-flight on the hooks bundle
&lt;/h2&gt;

&lt;p&gt;Before the hooks bundle PR went up, three specialist subagents ran in parallel against the raw artifact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;code-reviewer&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;security-auditor&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;test-automator&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output was not gentle. Six BLOCKERs and eight HIGHs surfaced between the three reviewers. The PreToolUse envelope shape was wrong. The credential-handling strategy was unsafe — hooks were going to make authenticated API calls from inside a Claude Code session, with a credential surface that nobody had thought through. The &lt;code&gt;appId&lt;/code&gt; parameter was an SSRF vector. Error responses echoed PII. There was a ReDoS in input parsing. &lt;code&gt;CLAUDE_PROJECT_DIR&lt;/code&gt; vs &lt;code&gt;CLAUDE_PLUGIN_ROOT&lt;/code&gt; was confused throughout. The shell-vs-exec form choice was wrong for several handlers. TLS and timeout defaults were missing entirely.&lt;/p&gt;

&lt;p&gt;The bundle was BLOCKED from submission. Not "submit with caveats" — blocked, with a re-review date of 2026-05-21.&lt;/p&gt;

&lt;p&gt;The hooks PR that actually landed — #44 — was a redesigned advisory-only bundle. No API calls from hooks at all. The credential surface that produced half the BLOCKERs was eliminated by design, not patched. 28 new tests passed. The artifact that shipped was a different artifact than the one that was queued to ship.&lt;/p&gt;

&lt;p&gt;The transferable insight is about reviewer parallelism. A security reviewer reading after a code reviewer reads a different file than the one the code reviewer read. The code reviewer has already mentally cleared the surface; the security reviewer inherits that clearance silently. Running the three reviewers in parallel against the raw artifact — each one seeing the actual code, none of them inheriting another reviewer's frame — is what surfaced the BLOCKERs.&lt;/p&gt;

&lt;p&gt;Serial review with the same three personas would likely have caught fewer issues — this is a structural inference from how reviewer framing inherits, not a measured comparison. The parallelism is load-bearing. It is also adversarial by construction: each reviewer is graded on what they find, not on consensus with the others.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guard 2: Empirical verification over inference on the server-side audit
&lt;/h2&gt;

&lt;p&gt;The R3 server-side audit slate for Kobiton — the set of findings about what the MCP server does and does not implement — started as a documentation review. Read the public docs, reason about the MCP protocol, file findings about apparent gaps.&lt;/p&gt;

&lt;p&gt;Several DRAFT findings carried inference-grade language. "Likely missing." "Probably not declared." "Appears to omit." That language is a tell. Inference-grade findings filed to a partner are slop with a hedge attached. The hedge does not protect anyone — the partner still has to spend cycles refuting wrong claims.&lt;/p&gt;

&lt;p&gt;The work shifted from inference to probe. Using the &lt;code&gt;getCredential&lt;/code&gt; MCP tool to obtain a real Kobiton API key, the audit executed raw authenticated probes against &lt;code&gt;api.kobiton.com/mcp&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;initialize&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;resources/list&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prompts/list&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;resources/templates/list&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The verbatim server response to &lt;code&gt;initialize&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;protocolVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2025-03-26&lt;/span&gt;
&lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}}&lt;/span&gt;
&lt;span class="na"&gt;serverInfo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kobiton"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resources, prompts, and templates list each returned JSON-RPC error &lt;code&gt;-32601&lt;/code&gt; (method not found).&lt;/p&gt;

&lt;p&gt;Six findings flipped from "likely missing" to verified against a server response: F36 (instructions field absent), F37 (resources capability absent), F38 (prompts capability absent), F42a (tools.listChanged not declared), F42b (resource subscriptions not declared), plus a newly-discovered protocol version lag — the server declares &lt;code&gt;2025-03-26&lt;/code&gt; against a current spec of &lt;code&gt;2025-11-25&lt;/code&gt;, two releases behind.&lt;/p&gt;

&lt;p&gt;The OAuth retraction is the load-bearing example for this guard, and it is worth describing in detail because the retraction is more valuable than the original finding would have been.&lt;/p&gt;

&lt;p&gt;Bundle 3 DRAFT claimed Kobiton was missing three things: RFC 9728 (Protected Resource Metadata), RFC 8414 (Authorization Server Metadata), and &lt;code&gt;WWW-Authenticate&lt;/code&gt; response headers entirely. Those claims were built from doc review. They were wrong.&lt;/p&gt;

&lt;p&gt;The empirical probe showed all three were already implemented. Kobiton's MCP server has OAuth 2.1 with PKCE S256 and dynamic client registration. The well-known metadata endpoints respond. &lt;code&gt;WWW-Authenticate&lt;/code&gt; is emitted. The original Bundle 3 finding — filed as a serious gap — was wrong on its central claims.&lt;/p&gt;

&lt;p&gt;The Bundle 3 issue body got rewritten. The wrong claims were withdrawn. The bundle was narrowed to two real, verified gaps: F41d (the &lt;code&gt;resource_indicators_supported&lt;/code&gt; field is undeclared) and F41e (the &lt;code&gt;WWW-Authenticate&lt;/code&gt; header is inconsistent on bad-token 401 responses). The issue body included an explicit sourcing-discipline paragraph: here is what was wrong, here is why it was wrong, here is the corrected scope.&lt;/p&gt;

&lt;p&gt;The transferable insight is about retraction economics. The credibility cost of an unwithdrawn wrong claim compounds — every future finding from the same audit gets read through the lens of "they got OAuth wrong, what else did they get wrong?" The credibility cost of an explicit retraction is small and decays fast. The partner reads the retraction, registers that the audit corrects itself, and the next finding gets evaluated on its merits.&lt;/p&gt;

&lt;p&gt;Inference-grade findings shipped to partners are not "drafts" or "starting points." They are slop with a hedge. If the system can produce a verbatim server response, the audit has to produce one before the finding ships.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guard 3: Post-delivery consistency sweep against fork main
&lt;/h2&gt;

&lt;p&gt;After PRs #39 through #44 landed, the work ran &lt;code&gt;/validate-consistency&lt;/code&gt; against the fork's &lt;code&gt;main&lt;/code&gt; branch. Each PR had been internally consistent. The sweep returned seven findings anyway — all of them cross-PR drift.&lt;/p&gt;

&lt;p&gt;Critical findings, two:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;AGENTS.md&lt;/code&gt; was missing from the fork root, but the agents and hooks PRs both referenced it as if it existed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;package.json&lt;/code&gt; was still at version &lt;code&gt;1.0.0&lt;/code&gt; while &lt;code&gt;plugin.json&lt;/code&gt; had been bumped to &lt;code&gt;1.0.2&lt;/code&gt;. The version bump happened on one surface but not the other.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Warning findings, four:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The README had no section for the new &lt;code&gt;agents/&lt;/code&gt; directory introduced by PR #41.&lt;/li&gt;
&lt;li&gt;The README had no section for the new &lt;code&gt;hooks/&lt;/code&gt; directory introduced by PR #44.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SKILL.md&lt;/code&gt; claimed Node &lt;code&gt;&amp;gt;=18&lt;/code&gt; while CI was already pinned to Node &lt;code&gt;20&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A fork-side issue reference read just &lt;code&gt;#28&lt;/code&gt; with no owner — ambiguous between upstream and fork.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Info finding, one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;package.json&lt;/code&gt; and &lt;code&gt;marketplace.json&lt;/code&gt; had divergent descriptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All seven resolved in PR #45, a single cleanup pass. &lt;code&gt;AGENTS.md&lt;/code&gt; got created at the fork root, 72 lines, every claim sourced. &lt;code&gt;package.json&lt;/code&gt; bumped to &lt;code&gt;1.0.2&lt;/code&gt; to match &lt;code&gt;plugin.json&lt;/code&gt;. README gained sections for both &lt;code&gt;agents/&lt;/code&gt; and &lt;code&gt;hooks/&lt;/code&gt;. &lt;code&gt;SKILL.md&lt;/code&gt; Node compatibility was updated to match CI. The bare &lt;code&gt;#28&lt;/code&gt; was disambiguated to &lt;code&gt;jeremylongshore/automate#28&lt;/code&gt;. The two manifest descriptions were aligned.&lt;/p&gt;

&lt;p&gt;None of those seven findings would have been caught by reviewing any individual PR. Each PR was internally consistent. The drift only existed in the relational space between the PRs — file A references file B that does not exist yet, version X on surface 1 lags version Y on surface 2, description in manifest M diverges from description in manifest N.&lt;/p&gt;

&lt;p&gt;The transferable insight is about review topology. Pre-submission review operates on one artifact at a time. Cross-artifact drift is structurally invisible to that frame. Running a consistency sweep as the closing move of the day catches a class of slop that pre-submission review cannot catch by design.&lt;/p&gt;

&lt;p&gt;The sweep is cheap. The cleanup PR is small. The slop it prevents is the kind partners notice quietly and never mention — the README that does not describe what the repo contains, the version numbers that disagree with themselves, the references to files that do not exist. Quiet slop is the most expensive kind because the partner does not file a bug; they just lower their estimate of the engagement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What three guards do not catch
&lt;/h2&gt;

&lt;p&gt;These three guards target three specific failure classes. The pre-flight guard catches surface flaws in the artifact being shipped. The empirical verification guard catches inference-grade claims about external systems. The post-delivery consistency guard catches cross-artifact drift.&lt;/p&gt;

&lt;p&gt;None of the three catches bad strategic choices. If the underlying decision to ship an advisory-only hooks bundle — rather than no hooks at all, or rather than blocking hooks with a serious credential design — was wrong, the guards would not flag it. They would clear a well-built version of the wrong thing.&lt;/p&gt;

&lt;p&gt;None of the three catches architectural drift over weeks. All three operate on a single day's window. A long arc of individually-consistent decisions adding up to a wrong system needs a different mechanism — typically a periodic architecture review or a deliberate retro, neither of which fits inside a daily ship cycle.&lt;/p&gt;

&lt;p&gt;None of the three catches bad communication with the partner. The guards catch wrong claims, not wrong tone, wrong cadence, or wrong escalation. A correctly-filed finding delivered with the wrong framing to the wrong person at the wrong moment is still a credibility hit. That problem lives outside the guards.&lt;/p&gt;

&lt;p&gt;The guards are necessary, not sufficient. They eliminate a category of public embarrassment; they do not produce good engineering. Good engineering happens upstream of the guards, in the choices about what to build and what to file. The guards make sure that the choices, once made, ship in a defensible form.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the day actually demonstrated
&lt;/h2&gt;

&lt;p&gt;Seven PRs in a day on a partner engagement — with upstream filings and a prior audit round closing out in parallel — is a velocity that produces slop by default. That is the baseline. Velocity without a system underneath it is the slop pattern.&lt;/p&gt;

&lt;p&gt;The slop did not ship today because three different mechanisms caught three different classes of error at three different moments. The pre-flight guard caught the hooks bundle before submission and forced a redesign that eliminated the credential surface entirely. The empirical verification guard caught the inference-grade OAuth claims before they shipped and converted the bundle into a narrower, defensible scope with an explicit retraction. The post-delivery consistency guard caught seven instances of cross-PR drift after the PRs landed and resolved them in a single cleanup pass.&lt;/p&gt;

&lt;p&gt;The retraction is worth a second mention. Withdrawing wrong claims with an explicit sourcing-discipline paragraph is the kind of artifact that builds long-term credibility with a partner more than a perfect first submission would. A perfect first submission demonstrates competence. A retraction demonstrates a working error-correction loop. Partners optimize for working error-correction loops because they assume errors will happen — what they care about is what happens after.&lt;/p&gt;

&lt;p&gt;This is the system. Not the velocity, the system underneath the velocity. The velocity is downstream of the system, not the other way around. Seven PRs in a day is safe if the three guards are running. Seven PRs in a day without the guards is a slop event waiting to be discovered by the partner — usually quietly, usually without a bug report, usually as a downward revision of trust that nobody articulates.&lt;/p&gt;

&lt;p&gt;The lesson is not "ship faster." The lesson is "build the guards first, then the velocity is allowed."&lt;/p&gt;

</description>
      <category>partnerengineering</category>
      <category>codereview</category>
      <category>qualitygates</category>
      <category>engineeringdiscipline</category>
    </item>
    <item>
      <title>Two False-Positive Fixes, Same Root Cause</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Thu, 14 May 2026 13:00:27 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/two-false-positive-fixes-same-root-cause-3387</link>
      <guid>https://dev.to/jeremy_longshore/two-false-positive-fixes-same-root-cause-3387</guid>
      <description>&lt;p&gt;Two separate monitoring failures on the same day, same root cause. Both fixed by answering a single question: "Am I testing for health, or am I testing for perfect conditions?" The distinction matters because perfect conditions are temporary, and health is structural. And once you see the pattern once, you see it everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context: production on a shared VPS
&lt;/h2&gt;

&lt;p&gt;The Braves stack runs on Contabo (24 GiB RAM, 6 CPUs). Five Docker stacks share that hardware: Braves (frontend, backend, pybaseball), Plane (13 containers), Twenty (5 containers), Umami (3 containers), and ntfy (1 container). 25 containers total. Single ingress: Caddy reverse proxy. Single disk. When one stack's load spikes, all five feel it.&lt;/p&gt;

&lt;p&gt;This architecture means healthchecks and deployment validators are sensitive to global state, not just stack-local state. A healthcheck that works under isolated test conditions can fail when the VPS is under collective load. A validator that passes in the afternoon can fail at 2 AM when a different stack is doing batch work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptom
&lt;/h2&gt;

&lt;p&gt;On May 11, two separate failure modes emerged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;False-positive container-unhealthy alerts firing ~10 times per day.&lt;/strong&gt; Each one triggered: manual inspection, "nope, it's fine," return to normal operations. Repeat. The notification log became noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every off-hours deploy auto-rolling back without an obvious cause.&lt;/strong&gt; Off-season deployments (which are mostly off-hours) all failed smoke checks and rolled back. The CI pipeline was effectively blocked for non-emergency pushes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both failures traced to monitoring expressions that mixed structural health signals with situational condition signals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix one: TCP over HTTP fetch
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The setup
&lt;/h3&gt;

&lt;p&gt;Healthchecks for the Braves containers ran every 10 seconds, invoking Node's global &lt;code&gt;fetch&lt;/code&gt; (or &lt;code&gt;urllib.request&lt;/code&gt; for the Python service) to make an HTTP round-trip to a local status endpoint. The logic was straightforward: open connection, validate response, exit on failure. The Docker healthcheck timeout was 5 seconds.&lt;/p&gt;

&lt;p&gt;Performance profile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Light load (loadavg &amp;lt; 2): fetch completed in 5–20 ms.&lt;/li&gt;
&lt;li&gt;Moderate load (loadavg 2–8): fetch completed in 100–500 ms.&lt;/li&gt;
&lt;li&gt;High load (loadavg &amp;gt; 10): fetch sometimes failed to complete within 5 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The failure cascade
&lt;/h3&gt;

&lt;p&gt;When the healthcheck timed out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Docker retried the check every 10 seconds.&lt;/li&gt;
&lt;li&gt;After 5 consecutive timeouts (50 seconds), Docker marked the container unhealthy.&lt;/li&gt;
&lt;li&gt;Netdata observed the state change and fired a &lt;code&gt;docker_container_unhealthy&lt;/code&gt; alert.&lt;/li&gt;
&lt;li&gt;The alert flowed through ntfy to mobile notifications: "scorecardecho is down."&lt;/li&gt;
&lt;li&gt;Manual inspection: the container was fine, the process was responding, load was just high.&lt;/li&gt;
&lt;li&gt;Clear the alert, wait for the cycle to repeat.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This happened ~10 times per day, every single day.&lt;/p&gt;

&lt;h3&gt;
  
  
  The assumption that bit
&lt;/h3&gt;

&lt;p&gt;Fetch-based healthchecks assume light load. They assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The event loop has microseconds to spare for I/O&lt;/li&gt;
&lt;li&gt;The network isn't congested&lt;/li&gt;
&lt;li&gt;The kernel isn't swapping&lt;/li&gt;
&lt;li&gt;No other workload is competing for scheduler time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All true most of the time. Not true on a shared VPS where 24 other production containers are running. Not true when pybaseball is churning through XML parsing. Not true when Plane is sync-checking its database. The healthcheck assumed the happy path—and the production VPS spends most of its time off the happy path.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix (commit &lt;code&gt;cbb4f6e&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Replace the HTTP fetch with a raw TCP connect. Verification moves from the application layer down to a single SYN/ACK exchange — the work the kernel was already doing to accept the connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;  healthcheck:
&lt;span class="gd"&gt;-   test: ["CMD-SHELL", "node -e \"fetch('http://localhost:3001/api/health').then(r=&amp;gt;{if(!r.ok)process.exit(1)}).catch(()=&amp;gt;process.exit(1))\""]
-   interval: 10s
&lt;/span&gt;&lt;span class="gi"&gt;+   test: ["CMD-SHELL", "node -e \"require('net').connect(3001,'localhost').on('connect',function(){this.end();process.exit(0)}).on('error',function(){process.exit(1)})\""]
+   interval: 30s
&lt;/span&gt;    timeout: 5s
&lt;span class="gd"&gt;-   retries: 5
&lt;/span&gt;&lt;span class="gi"&gt;+   retries: 3
+   start_period: 15s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Python service got the equivalent treatment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:8001/health')\" || exit 1"]
&lt;/span&gt;&lt;span class="gi"&gt;+ test: ["CMD-SHELL", "python3 -c \"import socket; s=socket.create_connection(('localhost',8001),2); s.close()\""]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both new checks open a TCP connection to the port, immediately close it, and exit. No HTTP parsing. No JSON. No event-loop work beyond the socket call itself. The kernel completes the SYN/ACK in microseconds even when the application thread is stalled. This pattern works in any container image that already has &lt;code&gt;node&lt;/code&gt; or &lt;code&gt;python3&lt;/code&gt; — no extra binaries to install.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tuning alongside the fix
&lt;/h3&gt;

&lt;p&gt;Three other changes shipped together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interval 10s → 30s:&lt;/strong&gt; Polling three times less frequently means 3× fewer state transitions, 3× fewer container-state callback executions, 3× fewer potential false positives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retries 5 → 3:&lt;/strong&gt; Before: unhealthy after 50 seconds. After: unhealthy after 90 seconds. Trades slightly earlier detection of real outages for dramatically lower false-positive noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;start_period: 15s&lt;/code&gt; added:&lt;/strong&gt; Containers no longer fail healthcheck during startup when they're still bootstrapping.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operational pairing: Netdata hold-down
&lt;/h3&gt;

&lt;p&gt;The VPS runs Netdata for monitoring. A separate change added a 2-minute hold-down before alerting on &lt;code&gt;docker_container_unhealthy&lt;/code&gt;. A brief glitch—a 10-second spike in load, a temporary network hiccup—can't page anymore. It has to persist for 120 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;Unhealthy alerts dropped from ~10 per day to zero. The notification log went silent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix two: drop the mode signal from deployment validation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The setup
&lt;/h3&gt;

&lt;p&gt;The deployment smoke check for the Braves backend used a jq filter applied to the app's status endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.status == "ok" and .gumbo.running == true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first part is a liveness signal: the app is responding and healthy. The second part is a mode signal: the gumbo processor (which handles game-update XML) is currently running. When this filter was written—probably during baseball season when games are daily—both conditions made intuitive sense. Both seemed permanent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The failure cascade
&lt;/h3&gt;

&lt;p&gt;Most of the calendar is &lt;em&gt;between&lt;/em&gt; games:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Off-season (November–March)&lt;/li&gt;
&lt;li&gt;Post-game (after each game ends)&lt;/li&gt;
&lt;li&gt;Pre-game (before first pitch, morning hours)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During these windows, &lt;code&gt;gumbo.running&lt;/code&gt; is false. Most deployments happen off-hours. So most off-hours deployments triggered a smoke check that required &lt;code&gt;gumbo.running == true&lt;/code&gt;. The app was fine. The status was &lt;code&gt;"ok"&lt;/code&gt;. But the game processor was inactive. The filter conjunction failed. The deployment workflow interpreted the failure as "deployment is broken, roll back." Automatic rollback fired. Every single off-hours deploy. Without exception.&lt;/p&gt;

&lt;p&gt;This blocked the entire CI pipeline for off-season work. No off-hours deployments could land unless manually overridden.&lt;/p&gt;

&lt;h3&gt;
  
  
  The assumption that bit
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;gumbo.running&lt;/code&gt; is a &lt;em&gt;temporary&lt;/em&gt; signal. It's true when a game is in progress. False when there isn't one. During the offseason it's false for months straight.&lt;/p&gt;

&lt;p&gt;The smoke check mixed a permanent structural signal (&lt;code&gt;status == "ok"&lt;/code&gt; = the app is healthy) with a temporary situational signal (&lt;code&gt;gumbo.running == true&lt;/code&gt; = a game is active right now). It required both to be true, as if they were equivalent. They aren't. An app is healthy between games just as much as it's healthy during games. Health and game-processing mode are orthogonal.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix (commit &lt;code&gt;5b9fe26&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Remove the mode condition entirely. The filter now simply validates health:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;-.status == "ok" and .gumbo.running == true
&lt;/span&gt;&lt;span class="gi"&gt;+.status == "ok"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single question: "Is the app responding correctly?" Nothing about what it's processing. Nothing about external conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;Off-hours deployments stopped auto-rolling back. The CI pipeline unblocked. Every deploy now passes smoke validation as long as the app is actually healthy, regardless of whether a game is in progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shared lesson
&lt;/h2&gt;

&lt;p&gt;Both fixes follow the same pattern: a monitoring expression conjoined two signals where one was structural and the other was situational.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;th&gt;Structural Signal&lt;/th&gt;
&lt;th&gt;Situational Signal&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#1 (healthcheck)&lt;/td&gt;
&lt;td&gt;"Process is listening on port 3000"&lt;/td&gt;
&lt;td&gt;"Load is light enough for a 5-second fetch"&lt;/td&gt;
&lt;td&gt;Always true? No.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#2 (smoke check)&lt;/td&gt;
&lt;td&gt;"App responds with ok status"&lt;/td&gt;
&lt;td&gt;"Game processor is running"&lt;/td&gt;
&lt;td&gt;Always true? No.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When the situational signal became false—as situational signals do—the conjunction failed, and the alarm fired. The system was healthy. The alarm was noise.&lt;/p&gt;

&lt;p&gt;The pattern emerges because it &lt;em&gt;feels&lt;/em&gt; right when you write it. "The app should be healthy &lt;em&gt;and&lt;/em&gt; the load should be light." "The container should be healthy &lt;em&gt;and&lt;/em&gt; the game should be in progress." Both conditions seem like they should always be true. They're not. Situational conditions change. The moment you conjoin them with structural health signals, you've created a trap. The conjunction becomes true only under the narrow circumstances you happened to be testing in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three ways to break the trap
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Remove the situational condition
&lt;/h4&gt;

&lt;p&gt;Ask only the health question. Strip the conjunction down to the structural signal.&lt;/p&gt;

&lt;h4&gt;
  
  
  Move to a separate alert
&lt;/h4&gt;

&lt;p&gt;"Is the app healthy?" and "Is the game processor running?" are two questions. They should be two checks, not one. Alert on each independently.&lt;/p&gt;

&lt;h4&gt;
  
  
  Document the assumption
&lt;/h4&gt;

&lt;p&gt;If the check fails when a situational condition flips, say so in the alert message so responders know the system is fine without manual intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  The checklist before merging a monitoring expression
&lt;/h3&gt;

&lt;p&gt;List every condition it depends on staying true:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"This healthcheck assumes load is under threshold N."&lt;/li&gt;
&lt;li&gt;"This smoke check assumes a game is in progress."&lt;/li&gt;
&lt;li&gt;"This alert assumes the cache is populated."&lt;/li&gt;
&lt;li&gt;"This validator assumes the external service is available."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any condition can become false — and most can — apply one of the three fixes above.&lt;/p&gt;

&lt;p&gt;A healthcheck should answer: "Is this process alive?" A deployment validator should answer: "Does the app respond correctly?" Neither should answer: "And is everything perfect?" Perfect is temporary. Healthy is structural.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Also shipped:&lt;/strong&gt; hubspot-pack v2.0.0 landed the same day, consolidating 30 templated skills into 10 production-engineering skills following the guidewire v2 pattern. Also: porkbun-dnssec-caa.sh script pinning DNSSEC/CAA on intentsolutions.io as a Rekor predicate precondition.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>healthchecks</category>
      <category>monitoring</category>
      <category>cicd</category>
    </item>
    <item>
      <title>AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Tue, 12 May 2026 13:00:26 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/agentsmd-as-a-cross-tool-plugin-brief-a-case-study-from-kobitonautomate-36h</link>
      <guid>https://dev.to/jeremy_longshore/agentsmd-as-a-cross-tool-plugin-brief-a-case-study-from-kobitonautomate-36h</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Canonical home:&lt;/strong&gt; This post first appeared on Kobiton's blog at &lt;a href="https://kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate/" rel="noopener noreferrer"&gt;kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate&lt;/a&gt;. This page mirrors it; SEO authority consolidates to the Kobiton URL via &lt;code&gt;rel="canonical"&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — I ran a 5-device parity sweep against Kobiton's real-device cloud through the &lt;code&gt;kobiton/automate&lt;/code&gt; Claude Code plugin. iOS screenshot capture came in ~17% faster than Android in this run. The interesting part isn't the gap — it's that the plugin doesn't document the gap, or the post-&lt;code&gt;deleteSession&lt;/code&gt; cooldown, or which Appium log endpoints actually work. That's what an &lt;code&gt;AGENTS.md&lt;/code&gt; file is for, and PR #10 on the repo is starting to add one. This is a worked example of what should go in it.&lt;/p&gt;

&lt;p&gt;I spent last week poking at &lt;a href="https://github.com/kobiton/automate" rel="noopener noreferrer"&gt;&lt;code&gt;kobiton/automate&lt;/code&gt;&lt;/a&gt;, the Claude Code plugin that fronts Kobiton's real-device cloud. Five devices, two pools, both major mobile platforms, one small WebDriverIO harness. The numbers showed something plugin authors rarely publish: iOS screenshot capture was about 17% faster than Android across the sample.&lt;/p&gt;

&lt;p&gt;That gap isn't a bug. It's platform variance. But it's the kind of variance you want surfaced before your CI bill quietly compounds it — and surfacing things like this is exactly what a cross-tool agent brief like &lt;code&gt;AGENTS.md&lt;/code&gt; is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plugin
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;kobiton/automate&lt;/code&gt; is a thin Claude Code plugin pointing at a remote MCP server (&lt;code&gt;https://api.kobiton.com/mcp&lt;/code&gt;). The repo holds manifests, one skill, schemas, and docs. Appium still runs the driver loop once a session opens. That's the right boundary. The plugin doesn't pretend to be Appium; it just helps the agent get into a working session and back out cleanly.&lt;/p&gt;

&lt;p&gt;The public repo currently exposes 12 MCP tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Devices&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;listDevices&lt;/code&gt;, &lt;code&gt;getDeviceStatus&lt;/code&gt;, &lt;code&gt;reserveDevice&lt;/code&gt;, &lt;code&gt;terminateReservation&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sessions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;listSessions&lt;/code&gt;, &lt;code&gt;getSession&lt;/code&gt;, &lt;code&gt;getSessionArtifacts&lt;/code&gt;, &lt;code&gt;terminateSession&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apps&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;listApps&lt;/code&gt;, &lt;code&gt;uploadAppToStore&lt;/code&gt;, &lt;code&gt;confirmAppUpload&lt;/code&gt;, &lt;code&gt;getApp&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Last week the team opened &lt;a href="https://github.com/kobiton/automate/pull/10" rel="noopener noreferrer"&gt;PR #10&lt;/a&gt;, which adds GitHub Copilot CLI support and an &lt;code&gt;AGENTS.md&lt;/code&gt; file. Five files changed, 75 lines added. As of writing it's open and marked in testing. Most of the diff is portability work — declaring skill and MCP paths, swapping Claude-specific phrasing for neutral language, and adding the agent-facing instructions file itself.&lt;/p&gt;

&lt;p&gt;That PR is what made me want to write this up. It's a real example of a plugin moving from "works in Claude Code" to "any reasonable coding agent can read this and behave."&lt;/p&gt;

&lt;h2&gt;
  
  
  The parity sweep
&lt;/h2&gt;

&lt;p&gt;The harness is small. Open an Appium session, take five screenshots, record boot wall-clock and per-screenshot p50, terminate cleanly. Five devices:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Device pool&lt;/th&gt;
&lt;th&gt;OS&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Boot ms&lt;/th&gt;
&lt;th&gt;Screenshot p50&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRIVATE&lt;/td&gt;
&lt;td&gt;Android 13&lt;/td&gt;
&lt;td&gt;Galaxy A52s 5G&lt;/td&gt;
&lt;td&gt;4,206&lt;/td&gt;
&lt;td&gt;353&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLOUD&lt;/td&gt;
&lt;td&gt;Android 9&lt;/td&gt;
&lt;td&gt;moto g(7) play&lt;/td&gt;
&lt;td&gt;5,451&lt;/td&gt;
&lt;td&gt;297&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PRIVATE&lt;/td&gt;
&lt;td&gt;iOS 17.5.1&lt;/td&gt;
&lt;td&gt;iPhone XR&lt;/td&gt;
&lt;td&gt;5,091&lt;/td&gt;
&lt;td&gt;242&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLOUD&lt;/td&gt;
&lt;td&gt;iOS 18.6&lt;/td&gt;
&lt;td&gt;iPhone 14 Plus&lt;/td&gt;
&lt;td&gt;4,490&lt;/td&gt;
&lt;td&gt;306&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLOUD&lt;/td&gt;
&lt;td&gt;iOS 18.6.2&lt;/td&gt;
&lt;td&gt;iPad 9th Gen&lt;/td&gt;
&lt;td&gt;5,259&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In this run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boot times spread ~30%.&lt;/li&gt;
&lt;li&gt;Screenshot p50 spread ~46%.&lt;/li&gt;
&lt;li&gt;Android averaged ~325ms per screenshot.&lt;/li&gt;
&lt;li&gt;iOS averaged ~268ms — about 17% faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five devices is not a fleet study, so don't read this as "iOS wins." What's worth noticing is that platform mattered more than pixel count. The fastest screenshot in the run came off an iPhone XR at 828×1792; the slowest came off a Galaxy A52s 5G at 1080×2400. Resolution alone didn't predict the spread.&lt;/p&gt;

&lt;p&gt;That gap matters in CI. A 57ms screenshot delta sounds trivial until you compound it. At 100 tests × 50 runs/day × 3 screenshots per test, you've spent ~855 seconds a day, or ~7 hours a month, on the slower path. Push that to five screenshots per test and you're at ~12 hours/month. Not a redesign-the-suite number. But it's real queue time — enough that a routing decision ("send the screenshot-heavy suite to iOS first") starts paying for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two findings an AGENTS.md would close
&lt;/h2&gt;

&lt;p&gt;Two things came up that an agent-facing brief would have closed before I started.&lt;/p&gt;

&lt;h3&gt;
  
  
  Endpoint compatibility
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;driver.getLogs('logcat')&lt;/code&gt; didn't return usable data through the endpoint my client tried. Appium's docs distinguish between &lt;code&gt;/session/:sessionId/log&lt;/code&gt; and &lt;code&gt;/session/:sessionId/se/log&lt;/code&gt;, and which one works depends on the driver and server. A plugin like this should just say up front which log endpoints it supports, which it rejects, and what the agent should do when log retrieval fails.&lt;/p&gt;

&lt;p&gt;Without that, a test ported in from a vanilla Appium setup can silently lose its logs. The test still passes. The evidence is just gone. Worst kind of failure — the kind that smiles and waves while stealing your evidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lifecycle invisibility
&lt;/h3&gt;

&lt;p&gt;After &lt;code&gt;deleteSession&lt;/code&gt;, devices entered a brief cooldown. During the window &lt;code&gt;getDeviceStatus&lt;/code&gt; reported them as &lt;code&gt;ACTIVATED&lt;/code&gt; with &lt;code&gt;is_online=true&lt;/code&gt; — but they couldn't actually accept a new session yet. A naive scheduler sees "ready," queues the next job, and waits.&lt;/p&gt;

&lt;p&gt;The fix is a documented lifecycle. Names like &lt;code&gt;ready&lt;/code&gt; / &lt;code&gt;reserved&lt;/code&gt; / &lt;code&gt;active&lt;/code&gt; / &lt;code&gt;cleanup-required&lt;/code&gt; / &lt;code&gt;cooldown-required&lt;/code&gt; / &lt;code&gt;offline&lt;/code&gt; / &lt;code&gt;unknown&lt;/code&gt;. The wording matters less than having one. If &lt;code&gt;is_online=true&lt;/code&gt; doesn't mean session-ready, the plugin needs to say that out loud.&lt;/p&gt;

&lt;p&gt;Both gaps are documentation, not code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Claude Code conventions meet AGENTS.md
&lt;/h2&gt;

&lt;p&gt;If you've authored a Claude Code plugin you already know about &lt;code&gt;CLAUDE.md&lt;/code&gt; (Claude-specific repo guidance) and &lt;code&gt;SKILL.md&lt;/code&gt; (skill frontmatter and workflow). Neither replaces &lt;code&gt;AGENTS.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; is the tool-agnostic instruction file. A briefing packet any coding agent can read: setup, conventions, testing rules, operational caveats. &lt;code&gt;SKILL.md&lt;/code&gt; belongs to a different model entirely — the open AgentSkills.io spec defines its structure for reusable skills. Related, not interchangeable.&lt;/p&gt;

&lt;p&gt;The four files compose:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;For humans — overview and install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Claude Code-specific guidance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SKILL.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skill trigger and workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AGENTS.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cross-tool operational guidance for any agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A strong &lt;code&gt;AGENTS.md&lt;/code&gt; for an MCP-backed testing plugin should cover capabilities (what it does), costs and latency (p50/p95, screenshot timing, upload constraints, platform variance), lifecycle states (what "ready" actually means), compatibility boundaries (which Appium endpoints work, when to fall back to artifact APIs), and orchestrator requirements (what CI systems and agent runtimes need to know).&lt;/p&gt;

&lt;p&gt;When a plugin documents that, a cost-conscious agent can make decisions instead of guessing. "This suite goes to the faster capture path." "This device needs cooldown." "This log endpoint isn't available, use artifacts." Without the spec you're guessing. With it, you're routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kobiton/automate got right
&lt;/h2&gt;

&lt;p&gt;The plugin is a clean implementation of the thin-plugin / remote-MCP pattern that the AI agent ecosystem is converging on. MCP server config points to Kobiton's hosted endpoint. OAuth 2.1 is the default; API keys exist for headless CI. App uploads go through pre-signed storage URLs rather than routing binaries through the assistant. Tool schemas live as reference YAML. The &lt;code&gt;run-automation-suite&lt;/code&gt; skill stays focused on guided Appium execution and doesn't try to become a test framework.&lt;/p&gt;

&lt;p&gt;That's the right scope. A Claude Code plugin shouldn't pretend to be Appium. It should help the agent pick a target, prepare inputs, run the test, collect evidence, and report out.&lt;/p&gt;

&lt;p&gt;PR #10 adds the cross-tool layer on top of that. It isn't a complete operational spec yet, but it's pointed in the right direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's still open
&lt;/h2&gt;

&lt;p&gt;The gaps the parity sweep exposed are exactly what I'd document next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supported and unsupported Appium log endpoints.&lt;/li&gt;
&lt;li&gt;Platform-specific log retrieval guidance.&lt;/li&gt;
&lt;li&gt;Device lifecycle states between "online" and "session-ready."&lt;/li&gt;
&lt;li&gt;Cooldown behavior after &lt;code&gt;deleteSession&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Retry/backoff rules for schedulers.&lt;/li&gt;
&lt;li&gt;Error shapes for partial success, timeout cleanup, and artifact failures.&lt;/li&gt;
&lt;li&gt;Latency expectations for screenshot capture and session boot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The file doesn't have to be exhaustive on day one. It has to be honest — the operational facts an agent would otherwise learn the expensive way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method note
&lt;/h2&gt;

&lt;p&gt;The matrix wasn't a vibe check. Before any device touched the harness, I had three Claude sub-agents review the script in parallel — &lt;code&gt;code-reviewer&lt;/code&gt;, &lt;code&gt;test-automator&lt;/code&gt;, &lt;code&gt;security-auditor&lt;/code&gt;. They caught:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orphaned cleanup on timeout.&lt;/li&gt;
&lt;li&gt;Partial success counted as full success in the fallback chain.&lt;/li&gt;
&lt;li&gt;A timing bug where a 30-second log capture window could skid by ~1.5 seconds per device under load.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any one of those would have polluted the measurement. The cadence is reusable: specify the experiment, multi-review it, fix the harness, run the sweep, publish with caveats. Skipping the review step is how a 10-minute validation turns into a two-hour bug archaeology dig.&lt;/p&gt;

&lt;h2&gt;
  
  
  A test you can run this week
&lt;/h2&gt;

&lt;p&gt;If you author or consume a real-device testing plugin, run something like this against your own pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;wait_for_ready&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;boot_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;

    &lt;span class="n"&gt;shots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;take_screenshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;shots&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;delete_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;boot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;boot_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shots&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shots&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;print_percentiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five devices, five screenshots, one table. That's the baseline you can re-run whenever your pool changes — and the evidence you need to decide whether screenshot-heavy, log-heavy, or cold-start-sensitive tests should route differently.&lt;/p&gt;

&lt;p&gt;If your platform vendor's docs don't tell you which Appium endpoints work, what session cleanup actually does, or what "online" means — that's not a docs gap. That's operational risk wearing a friendly UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Cross-tool plugin standards aren't abstract architecture. They're the difference between&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We picked Android arbitrarily and paid for the variance silently."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We routed the screenshot-heavy suite based on measured platform behavior."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;kobiton/automate&lt;/code&gt; is moving in the right direction. Clean remote-MCP shape, focused skill design, sensible auth boundaries — and now PR #10 starts the cross-tool instruction surface.&lt;/p&gt;

&lt;p&gt;If you author a plugin: &lt;code&gt;README.md&lt;/code&gt; for humans, &lt;code&gt;CLAUDE.md&lt;/code&gt; for Claude-specific bits, &lt;code&gt;SKILL.md&lt;/code&gt; for skill workflow, &lt;code&gt;AGENTS.md&lt;/code&gt; for everything any agent runtime needs to know. They compose; none of them replaces another.&lt;/p&gt;

&lt;p&gt;If you consume plugins from a real-device cloud — or any AI-orchestratable platform — ask your vendor whether they publish an &lt;code&gt;AGENTS.md&lt;/code&gt; or equivalent. Then ask what's in it.&lt;/p&gt;

&lt;p&gt;If the answer is "what's that?", you found the gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Postscript (2026-05-07).&lt;/strong&gt; While this post was being finalized, the &lt;code&gt;kobiton/automate&lt;/code&gt; team merged Copilot CLI support (&lt;a href="https://github.com/kobiton/automate/pull/10" rel="noopener noreferrer"&gt;PR #10&lt;/a&gt;) and opened a Phase 1 Gemini CLI extension PR (&lt;a href="https://github.com/kobiton/automate/pull/28" rel="noopener noreferrer"&gt;PR #28&lt;/a&gt;). Both reuse the same &lt;code&gt;AGENTS.md&lt;/code&gt;, the same MCP server endpoint via OAuth dynamic discovery (RFC 9728), and the same &lt;code&gt;skills/&amp;lt;name&amp;gt;/SKILL.md&lt;/code&gt; convention — three CLIs against one source of truth, zero server-side code change.&lt;/p&gt;

&lt;p&gt;The Gemini PR description is the working reference for anyone trying this pattern: &lt;code&gt;AGENTS.md&lt;/code&gt; carries the cross-tool load (no separate &lt;code&gt;GEMINI.md&lt;/code&gt; needed), dynamic-discovery OAuth lets the install flow piggyback off plumbing already deployed, and skills auto-discover from the canonical path so they don't need explicit manifest references. If you're authoring a plugin in 2026 and want to ship it across Claude Code, Copilot CLI, and Gemini CLI with one source tree, read those two PRs.&lt;/p&gt;

&lt;p&gt;OpenAI Codex CLI is the natural fourth runtime in this space and fits the same pattern — &lt;code&gt;AGENTS.md&lt;/code&gt; is read natively, MCP servers are declared in &lt;code&gt;~/.codex/config.toml&lt;/code&gt; under &lt;code&gt;[mcp_servers.&amp;lt;name&amp;gt;]&lt;/code&gt;, and the OAuth dynamic-discovery flow is identical. The only delta is the config format (TOML rather than JSON), which means a Codex extension to a multi-CLI plugin is typically just a documentation snippet — no new manifest, no new build step. Four agentic CLIs, one cross-tool surface, one MCP server. That's the convergence the AGENTS.md convention was hinting at all along.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>mcp</category>
      <category>agentsmd</category>
      <category>plugins</category>
    </item>
    <item>
      <title>Coherence as a Deliverable: How a Multi-Surface Engagement Stays Sane</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Tue, 12 May 2026 13:00:25 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/coherence-as-a-deliverable-how-a-multi-surface-engagement-stays-sane-ah7</link>
      <guid>https://dev.to/jeremy_longshore/coherence-as-a-deliverable-how-a-multi-surface-engagement-stays-sane-ah7</guid>
      <description>&lt;p&gt;A sprawling multi-surface engagement (Kobiton partner pilot, 4 months, three deliverable rounds) exposed a silent failure mode: drift doesn't announce itself. A title rename on Plane goes unnoticed when the canonical source doc still has the old framing. A partner-portal deliverable gets updated before the source file does, leaving future sessions reading stale context from what should be source-of-truth.&lt;/p&gt;

&lt;p&gt;On 2026-05-08, one session caught two separate drift instances and shipped four structural patterns to make drift cheaper to find next time. None of the drifts were bugs. Both were coherence gaps — places where a single idea lived in multiple surfaces (Plane, beads, local docs, partner portal) with different currency.&lt;/p&gt;

&lt;p&gt;The fix wasn't "use one surface." It was: &lt;strong&gt;detect drift early, make the boundaries between surfaces explicit, give pre-committed thinking a home, and grow scope through buckets instead of through accretion.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift Caught in Two Directions
&lt;/h2&gt;

&lt;p&gt;A May 4 session had renamed a Plane content issue from "Text-first AI triage on session logs (refined per F30)" to "AI-vision testing." The local draft file (&lt;code&gt;000-docs/020-DR-BLOG-...md&lt;/code&gt;), the partner portal copy (&lt;code&gt;m2-blog-3.md&lt;/code&gt;), and the CLAUDE.md history were all on the canonical thesis: text-first triage. Plane was the only surface out of sync.&lt;/p&gt;

&lt;p&gt;Caught by reading CLAUDE.md cold. A session with no prior context opened the resume-from-cold doc and noticed the contradiction. The fix: revert Plane back to canonical, log the vision-testing angle as a separate evergreen idea in a new file (&lt;code&gt;034-RR-OPEN&lt;/code&gt;), mark it explicitly as deferred.&lt;/p&gt;

&lt;p&gt;The reverse-drift happened the same day. An R2 fork-staging update went to the partner portal first (because the client reads that surface), but the source doc (&lt;code&gt;021-AA-AACR-r2-...md&lt;/code&gt;) was now stale. Sync brought source up to portal. Header table updated with new snapshot tag, new "Staged audit slate" metadata row. Reverse-drift is the silent kind: the deliverable surface looks current, the source looks wrong, and a future session reading source will replay outdated thinking.&lt;/p&gt;

&lt;p&gt;Two drifts in one day on the same engagement. The pattern: without explicit boundaries and a promotions log, every surface drifts toward stale.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Current-Focus Block at the Top of CLAUDE.md
&lt;/h2&gt;

&lt;p&gt;Added to &lt;code&gt;kobiton/CLAUDE.md&lt;/code&gt;: a "Current focus" block at the very top. Three rows. Each row names a live workstream (M2 blog cadence, M3+R3 final review, hooks-as-deterministic thesis), owns it to a bead, and defines what "done" looks like.&lt;/p&gt;

&lt;p&gt;Below that: an explicit "what NOT to start" list. New evergreen blogs, project-shipping blogs, site infra, channel work — all queued but explicitly deferred until M2/M3/R3 close.&lt;/p&gt;

&lt;p&gt;Why not a checklist or a TODO list? Because a TODO is committed work. A Current-focus block is a &lt;em&gt;priority map&lt;/em&gt; for cold-starting future sessions. A TODO says "do this." The block says "this is load-bearing now; everything else queues below the line." Future sessions landing cold should know what's live without reading a month of history.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Current focus (2026-05-08) — read this first&lt;/span&gt;

| Workstream | Owner | "Done" looks like |
|---|---|---|
| M2 Blog series delivery | kobiton-z3y | Blog 1 published May 11, Blog 2 May 18, Blog 3 May 25 |
| M3 Featured Placement + R3 close | kobiton-9z0.7, kobiton-bmj | R3 deliverable filed and reviewed by May 25 |
| Hooks-as-deterministic layer thesis | kobiton-5cj | Prototype → multi-reviewer pre-flight → R3 above-spec landing |

&lt;span class="gs"&gt;**What NOT to start until M2/M3/R3 close:**&lt;/span&gt; new evergreen blog drafts, new project-shipping blogs, site-refresh work, channel infra.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Strategic Spine for the 19-Issue Backlog
&lt;/h2&gt;

&lt;p&gt;Same day, a separate consolidation: 19+ scattered Plane content issues organized into a 6-post evergreen series in publication order, with adjacent clusters (B/C/D/E) listed so unfiled ideas have homes too. This is the antidote to backlog rot. Without a spine, every new idea fights every other idea for next session's attention. With a spine, ideas cluster, and new sessions land oriented — they read the spine, see what's live in cluster A, and know that clusters B-E are queued but real.&lt;/p&gt;

&lt;h2&gt;
  
  
  RR-OPEN: The Pre-Committed Layer
&lt;/h2&gt;

&lt;p&gt;A new file: &lt;code&gt;034-RR-OPEN-things-to-think-about.md&lt;/code&gt;. Single surface for engagement-adjacent open questions, loose threads, refinement ideas, and deferred decisions that aren't yet committed work. Not a TODO. Not a backlog. A &lt;em&gt;pre-committed thinking&lt;/em&gt; surface.&lt;/p&gt;

&lt;p&gt;Six categories. Initial seed: 10 bullets. Crucially, it includes a "Promotions log" — when a bullet matures and graduates (to Plane CONTENT, beads, KOB issues, email, or CLAUDE.md), the commit message records where it went:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### Promotions log&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; 2026-05-08 — "Per-harness spec audit scope (decide before May 14)"
  RESOLVED as out-of-scope. Spec audit stays narrowly scoped to
  code.claude.com/docs/en/mcp per existing contract. The "10-12
  harness reach" framing migrates to OPS-28, not engagement scope.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why not just a TODO list or a scattered Slack thread? Because ideas that live nowhere searchable get re-invented. RR-OPEN is a single backlog-rot antidote: ideas can live here, mature visibly, and graduate with an audit trail of where they went.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scope Discipline Through Bucket Boundaries
&lt;/h2&gt;

&lt;p&gt;R3 scope expanded from one bucket to three in a single session. Normally that's a red flag. The discipline that kept it coherent: each bucket got its own bead with explicit deliverable boundaries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard re-validation (existing bead, no boundary change)&lt;/li&gt;
&lt;li&gt;Spec-conformance audit (new bead, separate surface, distinct findings)&lt;/li&gt;
&lt;li&gt;Hooks bundle (conditional on multi-reviewer pre-flight before May 23)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The empirical findings catalog (F11-F35) and the spec-conformance candidates (F36-F43) live in separate subsections. Scope can grow without losing shape if the boundaries between buckets are explicit and defensible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-Flight Catches the Signal-Type Misses
&lt;/h2&gt;

&lt;p&gt;A technical comment for a partner GitHub PR went through multi-reviewer pre-flight before posting. The catch: three signal-type mislabelings in a single comment. The same three mislabelings had propagated back into Plane CONTENT issues that referenced the same source material.&lt;/p&gt;

&lt;p&gt;Mistakes don't live in single surfaces. A mislabeling on a public comment is also on the issues that referenced the same source. Catching it pre-flight means one fix in three places. Catching it after publish means a correction comment, three issue edits, and a stale public comment that future readers will trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also Shipped
&lt;/h2&gt;

&lt;p&gt;R2 follow-up email sent (closing the credibility gap from a May 5 commitment). One beads epic created, one stale bead closed. Three companion commits to &lt;code&gt;partner-portals/&lt;/code&gt; for the reverse-drift fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/audit-harness-v010-enforcement-travels-with-code/"&gt;Enforcement travels with the code: audit-harness v0.1.0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/building-ai-friendly-codebase-documentation-real-time-claude-md-creation-journey/"&gt;Building an AI-friendly codebase: real-time CLAUDE.md creation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/wild-deep-dive-2-claude-md/"&gt;Wild deep dive #2: CLAUDE.md as a resume-from-cold tool&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>architecture</category>
      <category>aiagents</category>
      <category>releaseengineering</category>
    </item>
    <item>
      <title>AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Mon, 11 May 2026 04:43:57 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/agentsmd-as-a-cross-tool-plugin-brief-a-case-study-from-kobitonautomate-3ig7</link>
      <guid>https://dev.to/jeremy_longshore/agentsmd-as-a-cross-tool-plugin-brief-a-case-study-from-kobitonautomate-3ig7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Canonical home:&lt;/strong&gt; This post first appeared on Kobiton's blog at &lt;a href="https://kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate/" rel="noopener noreferrer"&gt;kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate&lt;/a&gt;. This page mirrors it; SEO authority consolidates to the Kobiton URL via &lt;code&gt;rel="canonical"&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — I ran a 5-device parity sweep against Kobiton's real-device cloud through the &lt;code&gt;kobiton/automate&lt;/code&gt; Claude Code plugin. iOS screenshot capture came in ~17% faster than Android in this run. The interesting part isn't the gap — it's that the plugin doesn't document the gap, or the post-&lt;code&gt;deleteSession&lt;/code&gt; cooldown, or which Appium log endpoints actually work. That's what an &lt;code&gt;AGENTS.md&lt;/code&gt; file is for, and PR #10 on the repo is starting to add one. This is a worked example of what should go in it.&lt;/p&gt;

&lt;p&gt;I spent last week poking at &lt;a href="https://github.com/kobiton/automate" rel="noopener noreferrer"&gt;&lt;code&gt;kobiton/automate&lt;/code&gt;&lt;/a&gt;, the Claude Code plugin that fronts Kobiton's real-device cloud. Five devices, two pools, both major mobile platforms, one small WebDriverIO harness. The numbers showed something plugin authors rarely publish: iOS screenshot capture was about 17% faster than Android across the sample.&lt;/p&gt;

&lt;p&gt;That gap isn't a bug. It's platform variance. But it's the kind of variance you want surfaced before your CI bill quietly compounds it — and surfacing things like this is exactly what a cross-tool agent brief like &lt;code&gt;AGENTS.md&lt;/code&gt; is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plugin
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;kobiton/automate&lt;/code&gt; is a thin Claude Code plugin pointing at a remote MCP server (&lt;code&gt;https://api.kobiton.com/mcp&lt;/code&gt;). The repo holds manifests, one skill, schemas, and docs. Appium still runs the driver loop once a session opens. That's the right boundary. The plugin doesn't pretend to be Appium; it just helps the agent get into a working session and back out cleanly.&lt;/p&gt;

&lt;p&gt;The public repo currently exposes 12 MCP tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Devices&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;listDevices&lt;/code&gt;, &lt;code&gt;getDeviceStatus&lt;/code&gt;, &lt;code&gt;reserveDevice&lt;/code&gt;, &lt;code&gt;terminateReservation&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sessions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;listSessions&lt;/code&gt;, &lt;code&gt;getSession&lt;/code&gt;, &lt;code&gt;getSessionArtifacts&lt;/code&gt;, &lt;code&gt;terminateSession&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apps&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;listApps&lt;/code&gt;, &lt;code&gt;uploadAppToStore&lt;/code&gt;, &lt;code&gt;confirmAppUpload&lt;/code&gt;, &lt;code&gt;getApp&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Last week the team opened &lt;a href="https://github.com/kobiton/automate/pull/10" rel="noopener noreferrer"&gt;PR #10&lt;/a&gt;, which adds GitHub Copilot CLI support and an &lt;code&gt;AGENTS.md&lt;/code&gt; file. Five files changed, 75 lines added. As of writing it's open and marked in testing. Most of the diff is portability work — declaring skill and MCP paths, swapping Claude-specific phrasing for neutral language, and adding the agent-facing instructions file itself.&lt;/p&gt;

&lt;p&gt;That PR is what made me want to write this up. It's a real example of a plugin moving from "works in Claude Code" to "any reasonable coding agent can read this and behave."&lt;/p&gt;

&lt;h2&gt;
  
  
  The parity sweep
&lt;/h2&gt;

&lt;p&gt;The harness is small. Open an Appium session, take five screenshots, record boot wall-clock and per-screenshot p50, terminate cleanly. Five devices:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Device pool&lt;/th&gt;
&lt;th&gt;OS&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Boot ms&lt;/th&gt;
&lt;th&gt;Screenshot p50&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRIVATE&lt;/td&gt;
&lt;td&gt;Android 13&lt;/td&gt;
&lt;td&gt;Galaxy A52s 5G&lt;/td&gt;
&lt;td&gt;4,206&lt;/td&gt;
&lt;td&gt;353&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLOUD&lt;/td&gt;
&lt;td&gt;Android 9&lt;/td&gt;
&lt;td&gt;moto g(7) play&lt;/td&gt;
&lt;td&gt;5,451&lt;/td&gt;
&lt;td&gt;297&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PRIVATE&lt;/td&gt;
&lt;td&gt;iOS 17.5.1&lt;/td&gt;
&lt;td&gt;iPhone XR&lt;/td&gt;
&lt;td&gt;5,091&lt;/td&gt;
&lt;td&gt;242&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLOUD&lt;/td&gt;
&lt;td&gt;iOS 18.6&lt;/td&gt;
&lt;td&gt;iPhone 14 Plus&lt;/td&gt;
&lt;td&gt;4,490&lt;/td&gt;
&lt;td&gt;306&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLOUD&lt;/td&gt;
&lt;td&gt;iOS 18.6.2&lt;/td&gt;
&lt;td&gt;iPad 9th Gen&lt;/td&gt;
&lt;td&gt;5,259&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In this run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boot times spread ~30%.&lt;/li&gt;
&lt;li&gt;Screenshot p50 spread ~46%.&lt;/li&gt;
&lt;li&gt;Android averaged ~325ms per screenshot.&lt;/li&gt;
&lt;li&gt;iOS averaged ~268ms — about 17% faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five devices is not a fleet study, so don't read this as "iOS wins." What's worth noticing is that platform mattered more than pixel count. The fastest screenshot in the run came off an iPhone XR at 828×1792; the slowest came off a Galaxy A52s 5G at 1080×2400. Resolution alone didn't predict the spread.&lt;/p&gt;

&lt;p&gt;That gap matters in CI. A 57ms screenshot delta sounds trivial until you compound it. At 100 tests × 50 runs/day × 3 screenshots per test, you've spent ~855 seconds a day, or ~7 hours a month, on the slower path. Push that to five screenshots per test and you're at ~12 hours/month. Not a redesign-the-suite number. But it's real queue time — enough that a routing decision ("send the screenshot-heavy suite to iOS first") starts paying for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two findings an AGENTS.md would close
&lt;/h2&gt;

&lt;p&gt;Two things came up that an agent-facing brief would have closed before I started.&lt;/p&gt;

&lt;h3&gt;
  
  
  Endpoint compatibility
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;driver.getLogs('logcat')&lt;/code&gt; didn't return usable data through the endpoint my client tried. Appium's docs distinguish between &lt;code&gt;/session/:sessionId/log&lt;/code&gt; and &lt;code&gt;/session/:sessionId/se/log&lt;/code&gt;, and which one works depends on the driver and server. A plugin like this should just say up front which log endpoints it supports, which it rejects, and what the agent should do when log retrieval fails.&lt;/p&gt;

&lt;p&gt;Without that, a test ported in from a vanilla Appium setup can silently lose its logs. The test still passes. The evidence is just gone. Worst kind of failure — the kind that smiles and waves while stealing your evidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lifecycle invisibility
&lt;/h3&gt;

&lt;p&gt;After &lt;code&gt;deleteSession&lt;/code&gt;, devices entered a brief cooldown. During the window &lt;code&gt;getDeviceStatus&lt;/code&gt; reported them as &lt;code&gt;ACTIVATED&lt;/code&gt; with &lt;code&gt;is_online=true&lt;/code&gt; — but they couldn't actually accept a new session yet. A naive scheduler sees "ready," queues the next job, and waits.&lt;/p&gt;

&lt;p&gt;The fix is a documented lifecycle. Names like &lt;code&gt;ready&lt;/code&gt; / &lt;code&gt;reserved&lt;/code&gt; / &lt;code&gt;active&lt;/code&gt; / &lt;code&gt;cleanup-required&lt;/code&gt; / &lt;code&gt;cooldown-required&lt;/code&gt; / &lt;code&gt;offline&lt;/code&gt; / &lt;code&gt;unknown&lt;/code&gt;. The wording matters less than having one. If &lt;code&gt;is_online=true&lt;/code&gt; doesn't mean session-ready, the plugin needs to say that out loud.&lt;/p&gt;

&lt;p&gt;Both gaps are documentation, not code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Claude Code conventions meet AGENTS.md
&lt;/h2&gt;

&lt;p&gt;If you've authored a Claude Code plugin you already know about &lt;code&gt;CLAUDE.md&lt;/code&gt; (Claude-specific repo guidance) and &lt;code&gt;SKILL.md&lt;/code&gt; (skill frontmatter and workflow). Neither replaces &lt;code&gt;AGENTS.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; is the tool-agnostic instruction file. A briefing packet any coding agent can read: setup, conventions, testing rules, operational caveats. &lt;code&gt;SKILL.md&lt;/code&gt; belongs to a different model entirely — the open AgentSkills.io spec defines its structure for reusable skills. Related, not interchangeable.&lt;/p&gt;

&lt;p&gt;The four files compose:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;For humans — overview and install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Claude Code-specific guidance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SKILL.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skill trigger and workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AGENTS.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cross-tool operational guidance for any agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A strong &lt;code&gt;AGENTS.md&lt;/code&gt; for an MCP-backed testing plugin should cover capabilities (what it does), costs and latency (p50/p95, screenshot timing, upload constraints, platform variance), lifecycle states (what "ready" actually means), compatibility boundaries (which Appium endpoints work, when to fall back to artifact APIs), and orchestrator requirements (what CI systems and agent runtimes need to know).&lt;/p&gt;

&lt;p&gt;When a plugin documents that, a cost-conscious agent can make decisions instead of guessing. "This suite goes to the faster capture path." "This device needs cooldown." "This log endpoint isn't available, use artifacts." Without the spec you're guessing. With it, you're routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kobiton/automate got right
&lt;/h2&gt;

&lt;p&gt;The plugin is a clean implementation of the thin-plugin / remote-MCP pattern that the AI agent ecosystem is converging on. MCP server config points to Kobiton's hosted endpoint. OAuth 2.1 is the default; API keys exist for headless CI. App uploads go through pre-signed storage URLs rather than routing binaries through the assistant. Tool schemas live as reference YAML. The &lt;code&gt;run-automation-suite&lt;/code&gt; skill stays focused on guided Appium execution and doesn't try to become a test framework.&lt;/p&gt;

&lt;p&gt;That's the right scope. A Claude Code plugin shouldn't pretend to be Appium. It should help the agent pick a target, prepare inputs, run the test, collect evidence, and report out.&lt;/p&gt;

&lt;p&gt;PR #10 adds the cross-tool layer on top of that. It isn't a complete operational spec yet, but it's pointed in the right direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's still open
&lt;/h2&gt;

&lt;p&gt;The gaps the parity sweep exposed are exactly what I'd document next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supported and unsupported Appium log endpoints.&lt;/li&gt;
&lt;li&gt;Platform-specific log retrieval guidance.&lt;/li&gt;
&lt;li&gt;Device lifecycle states between "online" and "session-ready."&lt;/li&gt;
&lt;li&gt;Cooldown behavior after &lt;code&gt;deleteSession&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Retry/backoff rules for schedulers.&lt;/li&gt;
&lt;li&gt;Error shapes for partial success, timeout cleanup, and artifact failures.&lt;/li&gt;
&lt;li&gt;Latency expectations for screenshot capture and session boot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The file doesn't have to be exhaustive on day one. It has to be honest — the operational facts an agent would otherwise learn the expensive way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method note
&lt;/h2&gt;

&lt;p&gt;The matrix wasn't a vibe check. Before any device touched the harness, I had three Claude sub-agents review the script in parallel — &lt;code&gt;code-reviewer&lt;/code&gt;, &lt;code&gt;test-automator&lt;/code&gt;, &lt;code&gt;security-auditor&lt;/code&gt;. They caught:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orphaned cleanup on timeout.&lt;/li&gt;
&lt;li&gt;Partial success counted as full success in the fallback chain.&lt;/li&gt;
&lt;li&gt;A timing bug where a 30-second log capture window could skid by ~1.5 seconds per device under load.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any one of those would have polluted the measurement. The cadence is reusable: specify the experiment, multi-review it, fix the harness, run the sweep, publish with caveats. Skipping the review step is how a 10-minute validation turns into a two-hour bug archaeology dig.&lt;/p&gt;

&lt;h2&gt;
  
  
  A test you can run this week
&lt;/h2&gt;

&lt;p&gt;If you author or consume a real-device testing plugin, run something like this against your own pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;wait_for_ready&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;boot_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;

    &lt;span class="n"&gt;shots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;take_screenshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;shots&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;delete_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;boot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;boot_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shots&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shots&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;print_percentiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five devices, five screenshots, one table. That's the baseline you can re-run whenever your pool changes — and the evidence you need to decide whether screenshot-heavy, log-heavy, or cold-start-sensitive tests should route differently.&lt;/p&gt;

&lt;p&gt;If your platform vendor's docs don't tell you which Appium endpoints work, what session cleanup actually does, or what "online" means — that's not a docs gap. That's operational risk wearing a friendly UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Cross-tool plugin standards aren't abstract architecture. They're the difference between&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We picked Android arbitrarily and paid for the variance silently."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We routed the screenshot-heavy suite based on measured platform behavior."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;kobiton/automate&lt;/code&gt; is moving in the right direction. Clean remote-MCP shape, focused skill design, sensible auth boundaries — and now PR #10 starts the cross-tool instruction surface.&lt;/p&gt;

&lt;p&gt;If you author a plugin: &lt;code&gt;README.md&lt;/code&gt; for humans, &lt;code&gt;CLAUDE.md&lt;/code&gt; for Claude-specific bits, &lt;code&gt;SKILL.md&lt;/code&gt; for skill workflow, &lt;code&gt;AGENTS.md&lt;/code&gt; for everything any agent runtime needs to know. They compose; none of them replaces another.&lt;/p&gt;

&lt;p&gt;If you consume plugins from a real-device cloud — or any AI-orchestratable platform — ask your vendor whether they publish an &lt;code&gt;AGENTS.md&lt;/code&gt; or equivalent. Then ask what's in it.&lt;/p&gt;

&lt;p&gt;If the answer is "what's that?", you found the gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Postscript (2026-05-07).&lt;/strong&gt; While this post was being finalized, the &lt;code&gt;kobiton/automate&lt;/code&gt; team merged Copilot CLI support (&lt;a href="https://github.com/kobiton/automate/pull/10" rel="noopener noreferrer"&gt;PR #10&lt;/a&gt;) and opened a Phase 1 Gemini CLI extension PR (&lt;a href="https://github.com/kobiton/automate/pull/28" rel="noopener noreferrer"&gt;PR #28&lt;/a&gt;). Both reuse the same &lt;code&gt;AGENTS.md&lt;/code&gt;, the same MCP server endpoint via OAuth dynamic discovery (RFC 9728), and the same &lt;code&gt;skills/&amp;lt;name&amp;gt;/SKILL.md&lt;/code&gt; convention — three CLIs against one source of truth, zero server-side code change.&lt;/p&gt;

&lt;p&gt;The Gemini PR description is the working reference for anyone trying this pattern: &lt;code&gt;AGENTS.md&lt;/code&gt; carries the cross-tool load (no separate &lt;code&gt;GEMINI.md&lt;/code&gt; needed), dynamic-discovery OAuth lets the install flow piggyback off plumbing already deployed, and skills auto-discover from the canonical path so they don't need explicit manifest references. If you're authoring a plugin in 2026 and want to ship it across Claude Code, Copilot CLI, and Gemini CLI with one source tree, read those two PRs.&lt;/p&gt;

&lt;p&gt;OpenAI Codex CLI is the natural fourth runtime in this space and fits the same pattern — &lt;code&gt;AGENTS.md&lt;/code&gt; is read natively, MCP servers are declared in &lt;code&gt;~/.codex/config.toml&lt;/code&gt; under &lt;code&gt;[mcp_servers.&amp;lt;name&amp;gt;]&lt;/code&gt;, and the OAuth dynamic-discovery flow is identical. The only delta is the config format (TOML rather than JSON), which means a Codex extension to a multi-CLI plugin is typically just a documentation snippet — no new manifest, no new build step. Four agentic CLIs, one cross-tool surface, one MCP server. That's the convergence the AGENTS.md convention was hinting at all along.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>mcp</category>
      <category>agentsmd</category>
      <category>plugins</category>
    </item>
    <item>
      <title>Forge Dogfood Ships a Grade-A Plane Plugin, JRig Loop Closes</title>
      <dc:creator>Jeremy Longshore</dc:creator>
      <pubDate>Sun, 10 May 2026 13:00:26 +0000</pubDate>
      <link>https://dev.to/jeremy_longshore/forge-dogfood-ships-a-grade-a-plane-plugin-jrig-loop-closes-13nl</link>
      <guid>https://dev.to/jeremy_longshore/forge-dogfood-ships-a-grade-a-plane-plugin-jrig-loop-closes-13nl</guid>
      <description>&lt;p&gt;A plugin generator is theoretical until it produces something a marketplace will actually accept. May 7 turned the &lt;code&gt;/skill-creator --forge&lt;/code&gt; workflow from an 8-gate diagram into a real artifact — a Plane plugin that scored Grade A (97/100), passed Tier 2 GREEN with zero warnings, and cleared all 12 deterministic j-rig checks across the 7-layer behavioral framework. On the same day, the JRig-Verified provenance pipe closed end-to-end: a schema, a build-time enrichment step, a per-plugin verification page, and a validator tier all landed in the same window. The thesis the day proves: compound commands and build-time enrichment beat raw API surfaces and runtime joins, and the way to find that out is to run the full pipeline once on something real.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "theoretical" looked like on May 6
&lt;/h2&gt;

&lt;p&gt;The forge workflow had eight gates defined in spec. None had been exercised together. The JRig-Verified badge UI shipped earlier in the same day in PR #696 — it rendered, but the data path behind it terminated at an empty placeholder. A plugin detail page could display "JRig-Verified · N/7 layers" if the right shape of data showed up, but nothing in the build pipeline produced that data, and the &lt;code&gt;/plugins/&amp;lt;name&amp;gt;/verification&lt;/code&gt; link the badge pointed at was a 404.&lt;/p&gt;

&lt;p&gt;The pre-May-7 state, in a sentence each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Forge:&lt;/strong&gt; documented, scaffolded, never run end-to-end on a real API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JRig badge:&lt;/strong&gt; UI complete, no data source, dangling link target&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator:&lt;/strong&gt; 100-point rubric, no static production checks beyond it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marketplace homepage:&lt;/strong&gt; 422 plugins, no curated entry surface for the first five minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance metadata:&lt;/strong&gt; spec defined &lt;code&gt;generated&lt;/code&gt; and &lt;code&gt;author_type&lt;/code&gt; fields; no consumers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those gaps closed on May 7.&lt;/p&gt;

&lt;h2&gt;
  
  
  The forge dogfood — Plane as a team behavior observatory
&lt;/h2&gt;

&lt;p&gt;The forge takes two inputs: an API spec and a one-line NOI (Notion of Intent — the answer to "what makes this plugin different from a CRUD wrapper?"). The NOI is the forcing function. Without it, an LLM-generated plugin defaults to one command per endpoint, and the result is a thinner, slower duplicate of whatever MCP server already wraps the API.&lt;/p&gt;

&lt;p&gt;The NOI for this run rejected that framing outright:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Plane is a team behavior observatory.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sentence does most of the design work. The existing &lt;code&gt;mcp__plane&lt;/code&gt; MCP server already covers CRUD — listing cycles, creating issues, updating worklogs. A plugin that wraps the same surface is dead weight. A plugin that surfaces the &lt;em&gt;behavioral signal&lt;/em&gt; hiding inside JOINed Plane data is something the MCP server cannot do, because MCP tools are endpoint-shaped and behavior is JOIN-shaped.&lt;/p&gt;

&lt;p&gt;The five compound commands the NOI produced each answer a question that no single Plane endpoint can answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plane-cycle-velocity        — does cycle close-out match cycle planning?
/plane-stale-tickets          — which In Progress tickets quietly fail under shared ownership?
/plane-reviewer-gate-strength — which reviewers gate-keep harder than the spec demands?
/plane-priority-drift         — does the team plan high-priority and ship low-priority?
/plane-cross-project-load     — which engineers are spread across too many projects?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each requires JOINing at least two Plane resources, applying a scoring formula, and producing ranked output. None of them is &lt;code&gt;GET /cycles/{id}&lt;/code&gt; plus a render template.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 8 gates and what came out of each
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. NOI&lt;/td&gt;
&lt;td&gt;Accepted: "Plane is a team behavior observatory"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Ecosystem absorb&lt;/td&gt;
&lt;td&gt;5 competing tools cataloged; behavioral-synthesis gap confirmed uncovered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. API surface&lt;/td&gt;
&lt;td&gt;14 &lt;a href="https://docs.plane.so/api-reference/introduction" rel="noopener noreferrer"&gt;Plane API&lt;/a&gt; endpoints documented in &lt;code&gt;api-surface.md&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Domain archetype&lt;/td&gt;
&lt;td&gt;Project / Workflow tracker; default compound set adopted + extended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. Compound commands&lt;/td&gt;
&lt;td&gt;5 commands designed with synthesis logic + scoring formula&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. Generation&lt;/td&gt;
&lt;td&gt;SKILL.md, 2 agents, 3 references, plugin.json, README.md written&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7. Validation&lt;/td&gt;
&lt;td&gt;Tier 1 Grade A (97/100), Tier 2 GREEN, Tier 3A GREEN (12/12 j-rig)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8. PR + catalog&lt;/td&gt;
&lt;td&gt;PR #703&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The output of Gate 7 is the headline number. It is also the first piece of evidence that the workflow produces marketplace-grade output, not lab-bench output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproducibility — the receipts anyone can re-run
&lt;/h3&gt;

&lt;p&gt;Both checks shipped in the PR body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 scripts/validate-skills-schema.py &lt;span class="nt"&gt;--marketplace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  plugins/productivity/plane/skills/plane/SKILL.md
&lt;span class="c"&gt;# → Grade A (97/100), Tier 2 GREEN, 0 errors, 0 warnings&lt;/span&gt;

j-rig check plugins/productivity/plane/skills/plane
&lt;span class="c"&gt;# → 12 passed, 0 warnings, 0 errors&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actual stdout from the run captured in the PR body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Grade: A (97/100)
Tier 2: GREEN — 0 errors, 0 warnings
Tier 3A: 12/12 j-rig checks passed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anyone with &lt;a href="https://github.com/jeremylongshore/claude-code-plugins" rel="noopener noreferrer"&gt;the &lt;code&gt;claude-code-plugins&lt;/code&gt; repo&lt;/a&gt; checked out can rerun those two commands and deterministically verify the result. Provenance without reproducibility is decoration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "compound" beats "wrapper" — the design rationale in detail
&lt;/h3&gt;

&lt;p&gt;The CRUD-wrapper anti-pattern is seductive because it is easy to generate. An LLM with an OpenAPI spec can produce one command per endpoint in a few minutes. The output passes most surface-level checks: it has commands, it has parameters, it talks to the API. What it does not have is &lt;em&gt;value beyond the API&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A user who wants to list cycles in Plane already has &lt;code&gt;mcp__plane.list_cycles&lt;/code&gt;. A plugin command called &lt;code&gt;/plane-list-cycles&lt;/code&gt; is a strictly worse interface — slower (slash command overhead), harder to discover (lives in plugin catalog instead of MCP tool list), and provides no transformation of the result. The user gets the raw response either way; the plugin command added one round-trip and zero insight.&lt;/p&gt;

&lt;p&gt;A compound command flips the value equation. &lt;code&gt;/plane-cycle-velocity&lt;/code&gt; calls &lt;code&gt;list_cycles&lt;/code&gt;, then for each cycle calls &lt;code&gt;list_cycle_issues&lt;/code&gt;, joins the planning data against the close-out data, computes a velocity ratio per cycle, and returns ranked output with a behavioral interpretation. The user could in principle do this themselves with five MCP calls and a calculator. They will not. The plugin earns its place by collapsing five mechanical steps into one named operation that produces actionable signal.&lt;/p&gt;

&lt;p&gt;The NOI gate exists to force this distinction during generation. "Plane is a team behavior observatory" is not a marketing tagline — it is a constraint that disqualifies any command that does not surface behavioral signal. The forge uses the NOI to filter the generated command list: a command that fails to tie back to the NOI gets cut, regardless of how cleanly it wraps an endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture choices the dogfood surfaced
&lt;/h3&gt;

&lt;p&gt;The forge produced AI-generated output that passed Tier 2 without post-generation edits — a first for the workflow. The PR is 1,123 lines, but the orchestrator skill is only 150. That ratio is intentional. SKILL.md routes — it does not implement. Implementation lives in two agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;plane-expert&lt;/code&gt;&lt;/strong&gt; — API-surface specialist. Knows endpoints, parameters, auth shape. Does not call live Plane. Used for design questions and shape verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;plane-analyst&lt;/code&gt;&lt;/strong&gt; — behavioral synthesis. Calls &lt;code&gt;mcp__plane&lt;/code&gt; endpoints, applies JOIN logic and scoring formulas, returns ranked output. The five compound commands all delegate here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three references back the agents: &lt;code&gt;noi.md&lt;/code&gt; (the design anchor every output ties back to), &lt;code&gt;api-surface.md&lt;/code&gt; (the 14 endpoints), and &lt;code&gt;compound-commands.md&lt;/code&gt; (the synthesis logic and scoring formulas).&lt;/p&gt;

&lt;p&gt;MCP server scaffolding got skipped. The forge offers to scaffold an MCP server when the API has no existing wrapper; &lt;code&gt;mcp__plane&lt;/code&gt; already exists. Producing a duplicate would have been the exact CRUD-wrapper anti-pattern the NOI rejected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Provenance metadata — the seam that wires this to JRig
&lt;/h3&gt;

&lt;p&gt;Two fields landed in the plugin's &lt;code&gt;plugin.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"generated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"forge"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are read by the marketplace renderer (PR #696's earlier work) to display the "Forge-generated" pill on the plugin page. They are also the inputs the JRig data flow keys on, which is the next half of May 7's story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the JRig-Verified loop
&lt;/h2&gt;

&lt;p&gt;PR #696 landed the badge UI earlier the same day. The badge rendered conditionally on a &lt;code&gt;plugin.jrig&lt;/code&gt; overlay that nothing wrote. The next four PRs closed the gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema — &lt;code&gt;forge_proofs&lt;/code&gt; and three new columns (PR #699)
&lt;/h3&gt;

&lt;p&gt;Three columns added to &lt;code&gt;skill_compliance&lt;/code&gt;, all nullable, all idempotent:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jrig_passed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;INTEGER, nullable&lt;/td&gt;
&lt;td&gt;Boolean — did all 7 JRig layers pass on the model matrix?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jrig_tier_blocked&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;INTEGER, nullable&lt;/td&gt;
&lt;td&gt;Which JRig layer (1–7) failed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jrig_baseline_delta&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;REAL, nullable&lt;/td&gt;
&lt;td&gt;Performance delta vs. naked Claude. &amp;gt;0 helps, &amp;lt;0 hurts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A new table holds verification artifacts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;forge_proofs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;plugin_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;verification_type&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;-- 'tier1' / 'tier2' / 'tier3-jrig' / 'dogfood'&lt;/span&gt;
  &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;evidence&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;layers_passed&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;total_layers&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;baseline_delta&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;verified_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;UNIQUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plugin_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verification_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The migration only ADDs — never DROPs, never RENAMEs. Re-runs are no-ops. PRAGMA-check guards prevent duplicate column adds. The schema is forward-compatible by construction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build pipeline — &lt;code&gt;enrich-jrig-data&lt;/code&gt; step (PR #700)
&lt;/h3&gt;

&lt;p&gt;The data path that wires &lt;code&gt;forge_proofs&lt;/code&gt; rows into the rendered marketplace page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;forge_proofs (freshie/inventory.sqlite)
    │  SELECT … WHERE verification_type='tier3-jrig' AND passed=1
    ▼
enrich-jrig-data.mjs  ←  jrig:enrich build step (new)
    │  writes flat plugin_name → {verified, layers_passed, total_layers,
    │  baseline_delta, verified_at} map
    ▼
marketplace/src/data/jrig-data.json
    │  imported by [name].astro at static-build time
    ▼
plugin.jrig overlay  ←  PR #696's existing optional-chain rendering
    ▼
"JRig-Verified · N/7 layers" pill on plugin detail page
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The build pipeline order post-merge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. discover-skills         → skills-catalog.json
2. extract-readme-sections → readme-sections.json
3. sync-catalog            → catalog.json
4. enrich-jrig-data        → jrig-data.json     ← NEW
5. generate-unified-search → unified-search-index.json
6. build-cowork-zips       → cowork zips + manifest
7. astro build             → static site
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The sqlite driver decision
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;enrich-jrig-data.mjs&lt;/code&gt; reads &lt;code&gt;freshie/inventory.sqlite&lt;/code&gt; to produce &lt;code&gt;jrig-data.json&lt;/code&gt;. Two driver options were on the table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;better-sqlite3&lt;/code&gt;&lt;/strong&gt; — native module, single-call query, ~1 ms per read&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sqlite3&lt;/code&gt; CLI subprocess&lt;/strong&gt; — already on every dev machine and CI runner, ~50 ms per query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;better-sqlite3&lt;/code&gt; would add a postinstall native build step on every CI run. That step adds 30–90 seconds, fails on architecture mismatches, and has bitten enough Node projects to be a known smell. The &lt;code&gt;sqlite3&lt;/code&gt; CLI is already installed everywhere the build runs — &lt;code&gt;sqlite3 -json&lt;/code&gt; returns parseable JSON natively. Trade: ~50 ms subprocess overhead per query, dwarfed by the 20-second astro pipeline that follows.&lt;/p&gt;

&lt;p&gt;Today's &lt;code&gt;jrig-data.json&lt;/code&gt; content: &lt;code&gt;{}&lt;/code&gt;. Empty by design — no &lt;code&gt;forge_proofs&lt;/code&gt; rows have landed yet. The build degrades to "no badge rendered" for every plugin, which is the correct fallback. As soon as the first JRig run writes a &lt;code&gt;tier3-jrig&lt;/code&gt; row, the next site build picks it up automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why build-time, not request-time
&lt;/h3&gt;

&lt;p&gt;Two build-time vs. request-time architectures were on the table for getting &lt;code&gt;forge_proofs&lt;/code&gt; data onto plugin pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Request-time JOIN&lt;/strong&gt; — plugin page fetches &lt;code&gt;forge_proofs&lt;/code&gt; rows on each render, joins against catalog data, renders the badge inline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build-time enrichment&lt;/strong&gt; — &lt;code&gt;forge_proofs&lt;/code&gt; rows pre-computed into a flat JSON map at build time, imported by the static page renderer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Request-time wins on freshness — the moment a JRig run writes a row, the next page view sees it. Build-time wins on everything else. The marketplace is a static site (&lt;a href="https://docs.astro.build/en/concepts/why-astro/" rel="noopener noreferrer"&gt;Astro SSG&lt;/a&gt; output, served from CDN). Putting a database in the request path of a static site means giving up the static-site benefits: no edge caching, no instant cold start, no "drop the build into any object store and it works." The freshness gap from build-time is bounded by the deploy cadence — currently sub-hourly on commit, more than fast enough for a verification badge that does not need to update in real time.&lt;/p&gt;

&lt;p&gt;The data flow shape is also worth noting: &lt;code&gt;enrich-jrig-data.mjs&lt;/code&gt; produces a &lt;em&gt;flat map keyed by plugin name&lt;/em&gt;. Not a relational join, not a graph, not nested objects — a flat key/value map small enough to import in full at render time. That shape was chosen because Astro's SSG model imports static JSON at the top of the render function. A flat map adds zero query logic to the page; a nested or relational structure would have forced filtering or joining inside the page template, which is the wrong place for that work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Page target — &lt;code&gt;/plugins/&amp;lt;name&amp;gt;/verification&lt;/code&gt; (PR #702)
&lt;/h3&gt;

&lt;p&gt;The badge in PR #696 was a link. The link target was a 404. PR #702 shipped the destination page (306 lines of Astro) with two states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Verified&lt;/strong&gt; — green pill, baseline delta vs. naked Claude, verified-at timestamp, 7-layer breakdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pending&lt;/strong&gt; — neutral status, two paths to JRig (forge generation or manual &lt;code&gt;j-rig eval&lt;/code&gt;), reassurance that grade and Tier 2 results remain authoritative when JRig data is absent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Graceful degradation is built in: the &lt;code&gt;jrig-data.json&lt;/code&gt; import is wrapped in a try/catch with an empty-object fallback. Environments without the data file still build the site; the verification page just renders the pending state for everyone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Homepage starter pack (PR #701, Phase 4C)
&lt;/h3&gt;

&lt;p&gt;Five curated Grade-A plugins now anchor the homepage:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plugin&lt;/th&gt;
&lt;th&gt;Persona&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ai-commit-gen&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;productivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;conversational-api-debugger&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ci-cd-pipeline-builder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;design-to-code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;frontend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;excel-analyst-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;business&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The marketplace had 422 plugins and no first-five-minutes surface. The starter pack is editorial cadence — quarterly rotation, not algorithmic ranking. Curation beats search when the catalog is too large to skim and the visitor has no query yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validator Tier 2 gate — +273 lines of Python (PR #698)
&lt;/h3&gt;

&lt;p&gt;Five deterministic checks now fire alongside the standard 100-point rubric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tier2:allowed-tools-accuracy   — declared tools must appear in body          (warn)
tier2:auth-documented           — API surfaces require auth method documented (warn)
tier2:dead-code                 — literal-false branches detected             (warn, capped at 3 surfaces)
tier2:tool-safety               — unscoped Bash + Write/WebFetch needs        (error at marketplace)
                                  Safety Justification
tier2:orchestration-bounds      — skills shouldn't claim cross-skill          (error at marketplace)
                                  orchestration
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first three warn. The last two error at the marketplace tier. That split matches risk: shipping a skill that says "I orchestrate other skills" is a behavioral hazard; shipping one with a stale &lt;code&gt;allowed-tools&lt;/code&gt; line is sloppy but not dangerous.&lt;/p&gt;

&lt;h3&gt;
  
  
  The false-positive guard — a generalizable pattern
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;tier2:orchestration-bounds&lt;/code&gt; check initially flagged &lt;code&gt;/validate-skillmd&lt;/code&gt; itself. That skill &lt;em&gt;documents&lt;/em&gt; the anti-pattern in its body — it has a section explaining "skills shouldn't orchestrate other skills." The check, scanning prose for orchestration claims, hit those sentences and emitted an error.&lt;/p&gt;

&lt;p&gt;The wrong fix would have been to special-case &lt;code&gt;/validate-skillmd&lt;/code&gt;. The right fix was a generic guard on every Tier 2 prose check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skip lines inside code fences&lt;/li&gt;
&lt;li&gt;Skip lines starting with &lt;code&gt;&amp;gt;&lt;/code&gt; (block quotes) or &lt;code&gt;|&lt;/code&gt; (table cells)&lt;/li&gt;
&lt;li&gt;Skip lines containing negation markers: &lt;code&gt;" not "&lt;/code&gt; (space-padded so it does not match "annotate" or "notable"), &lt;code&gt;never&lt;/code&gt;, &lt;code&gt;avoid&lt;/code&gt;, &lt;code&gt;don't&lt;/code&gt;, &lt;code&gt;do not&lt;/code&gt;, &lt;code&gt;must not&lt;/code&gt;, &lt;code&gt;should not&lt;/code&gt;, &lt;code&gt;forbidden&lt;/code&gt;, &lt;code&gt;disallow&lt;/code&gt;, &lt;code&gt;anti-pattern&lt;/code&gt;, &lt;code&gt;antipattern&lt;/code&gt;, &lt;code&gt;wrong:&lt;/code&gt;, &lt;code&gt;bad:&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That guard generalizes to any static-analysis check that runs over prose. A document might describe the very pattern a check is looking for — to teach against it, to warn about it, to compare alternatives. The check has to recognize description versus assertion. The negation-marker list is a cheap heuristic that handles the common cases without an NLP dependency.&lt;/p&gt;

&lt;p&gt;This pattern is reusable. Every prose-level lint rule on a documentation site eventually hits it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs the day shipped
&lt;/h2&gt;

&lt;p&gt;Nothing free landed. Each piece carries a cost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;sqlite3 CLI subprocess&lt;/strong&gt; — ~50 ms per query overhead. Acceptable inside a 20 s build, would not be acceptable at request time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;jrig-data.json&lt;/code&gt; starts as &lt;code&gt;{}&lt;/code&gt;&lt;/strong&gt; — degrades gracefully today, but a misconfigured CI runner that fails the &lt;code&gt;enrich-jrig-data&lt;/code&gt; step silently produces an empty file and every JRig badge disappears. The fallback is friendly; the failure mode is silent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plane plugin compound commands&lt;/strong&gt; — JOIN logic and scoring formulas match the playbooks, but no live Plane workspace has run them yet. The math is correct on paper. Behavior under real data drift is unverified until someone runs them against a real workspace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator Tier 2 negation-marker guard&lt;/strong&gt; — list-based, not parser-based. Documents that paraphrase negation in unusual ways could still trip false positives. The fix when that happens is to extend the list, not to switch to a heavier parser.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Spec docs that landed alongside
&lt;/h2&gt;

&lt;p&gt;Four spec PRs framed the day's work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PR #693&lt;/strong&gt; — master skills spec bumped from 3.1.0 to 3.3.1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #695&lt;/strong&gt; — JRig Tier 3A spec snapshots added, with &lt;code&gt;.gitignore&lt;/code&gt; exceptions to keep the snapshots tracked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #696&lt;/strong&gt; — tagline plus JRig-Verified plus forge-generated badges added to plugin pages (the UI that PRs #700 and #702 wired to data on May 7)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #697&lt;/strong&gt; — IS-extension fields for forge provenance landed (&lt;code&gt;generated&lt;/code&gt;, &lt;code&gt;author_type&lt;/code&gt;) — Phase 5A of the "Use the Printing Press to Learn" plan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Phase 4C (homepage starter pack) and Phase 5A (forge provenance schema) both closed on May 7. The forge dogfood itself was Phase 3 of the same plan. Three phases of a multi-phase plan, all converging in one window — not a coincidence. The plan was structured so the dogfood and the provenance pipeline would close together. Running the dogfood without the provenance pipeline produces a plugin nobody can verify; shipping the provenance pipeline without a dogfood produces a UI for data that does not exist. Both halves had to land at once for either to mean anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond the marketplace
&lt;/h2&gt;

&lt;p&gt;Three patterns from May 7 generalize past the immediate work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compound commands beat endpoint wrappers when the value is in the JOIN.&lt;/strong&gt; The Plane plugin proves the design. An MCP server plus an LLM gives you &lt;code&gt;GET&lt;/code&gt; per resource. A compound command gives you &lt;code&gt;WHICH cycles closed late AND had reviewer churn AND had priority drift?&lt;/code&gt;, which no API endpoint exposes directly. The forge's NOI gate exists to force that question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build-time enrichment beats runtime joins for static marketplaces.&lt;/strong&gt; &lt;code&gt;jrig-data.json&lt;/code&gt; is computed once per build, served as static JSON, and read by Astro at SSG time. Runtime joining &lt;code&gt;freshie/inventory.sqlite&lt;/code&gt; against the page render would have meant a database in the request path of a static site. The build step keeps the runtime simple and the cache cold-key small.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provenance metadata is structural, not cosmetic.&lt;/strong&gt; The &lt;code&gt;generated: true, author_type: "forge"&lt;/code&gt; fields are not just for the badge. They are the seam that lets the JRig pipeline filter, the marketplace render, the validator behave differently, and future tooling cite the origin. Two boolean-ish fields, multiple downstream consumers — that is a metadata investment that pays compounding interest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False-positive guards are part of the gate, not an afterthought.&lt;/strong&gt; The Tier 2 orchestration check that flagged &lt;code&gt;/validate-skillmd&lt;/code&gt; could have been dismissed as "fix it later." The decision to ship the negation-marker guard &lt;em&gt;with&lt;/em&gt; the check is the difference between a gate that earns trust and a gate that gets bypassed because it cries wolf. Static-analysis checks live and die on their false-positive rate; once that rate goes above a small threshold, engineers route around them and the gate stops being a gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intent Solutions thread
&lt;/h2&gt;

&lt;p&gt;The forge dogfood and the JRig loop close the same theme that has run through this site for the past month: turning policy into mechanism, then turning mechanism into evidence. The validator is policy. Tier 2 is mechanism. Grade A (97/100) is evidence. JRig is policy. &lt;code&gt;forge_proofs&lt;/code&gt; is mechanism. The verification page is evidence. None of the three is sufficient alone, and the chain is what makes a marketplace claim defensible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also shipped — same day
&lt;/h2&gt;

&lt;p&gt;The day did not stop at the marketplace and the forge.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;intent-solutions-landing&lt;/code&gt;&lt;/strong&gt; — PR #18 migrated &lt;code&gt;intentsolutions.io&lt;/code&gt; off Firebase to the Contabo VPS (the canonical VPS-as-the-home pattern). PR #19 disabled &lt;code&gt;compressHTML&lt;/code&gt; and bumped the line-length cap to 50k to fix a deploy regression. PR #20 dropped the Resend/SQLite form-flow notes — Slack-only is the final shape. Umami tracker landed alongside the existing Firebase Analytics. The trustbar gained a "53k+ npm Downloads" stat badge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Umami analytics rollout across three sites&lt;/strong&gt; — claude-code-plugins (PR #692, with &lt;code&gt;data-domains&lt;/code&gt; spam guard), &lt;code&gt;jeremylongshore.com&lt;/code&gt;, and &lt;code&gt;intent-solutions-landing&lt;/code&gt; all wired to the self-hosted Umami instance in one day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;contributing-clanker&lt;/code&gt;&lt;/strong&gt; — URL-or-repo argument now drives a two-branch onboarding-and-briefing flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;partner-portals&lt;/code&gt;&lt;/strong&gt; — Kobiton portal got an editorial pass (engagement-structure table tightened, status pills dropped, upcoming-work cards added).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;kobiton&lt;/code&gt;&lt;/strong&gt; — CLAUDE.md sync absorbed engagement history and the sub-bead table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;intent-eval-lab&lt;/code&gt;&lt;/strong&gt; — umbrella repo scaffolded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;j-rig-binary-eval&lt;/code&gt;&lt;/strong&gt; — skill-spec sources of truth pulled into the repo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marquee fixes&lt;/strong&gt; — PR #689 throttled the npm fetch to dodge registry rate limits and restored the live total. PR #691 relabeled the marquee from '30d' to 'total downloads'.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/guidewire-mcp-v0-1-0-v0-1-1-76-minutes/"&gt;Guidewire MCP v0.1.0 → v0.1.1 in 76 minutes&lt;/a&gt; — release engineering with the same evidence-first discipline&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/anti-slop-framework-found-three-bugs-inside-itself/"&gt;The Anti-Slop Framework Found Three Bugs Inside Itself&lt;/a&gt; — validator dogfooding, the same pattern that produced today's false-positive guard&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/propagation-day-when-the-spec-becomes-the-migration-plan/"&gt;Propagation Day: When the Spec Becomes the Migration Plan&lt;/a&gt; — spec-to-execution arcs, the same shape this dogfood follows&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>aiagents</category>
      <category>automation</category>
      <category>releaseengineering</category>
    </item>
  </channel>
</rss>
