DEV Community: Jeremy Longshore

Five Tags, Zero Ships: How an Auto-Release Workflow Lied for a Whole Day

Jeremy Longshore — Sun, 24 May 2026 13:00:27 +0000

Five GitHub tags. v1.0.4 through v1.1.0. Five green checkmarks on the workflow. Five formatted release notes. The npm registry stayed at v1.0.5 the entire time.

This is what it looks like when a release workflow ships tags without shipping code. Every observable surface said "done" except the one that mattered — the registry. The bug wasn't in one place; it was three independent failures that combined to make the lie convincing.

What the Checkmarks Promised

gh release list showed all five tags with formatted changelogs. The workflow run logs were entirely green. If you ran npm install -g intentional-cognition-os, you got v1.0.5. No error. No warning. Silently wrong for anyone relying on v1.0.5+, silently right for everyone else.

The pattern repeated across the morning: commit → auto-release fires → tag appears → npm registry unchanged. The workflow was perfectly honest about tagging. It just wasn't releasing anything.

Bug 1: Tests That Passed by Lying

The "Verify readiness" step was:

- name: Verify readiness
  run: pnpm test || true

The || true is the tell. Every test failed. Failed to resolve entry for package @ico/types — the workspace packages hadn't been built yet, so pnpm test resolved nothing, threw hard errors, and the || true swallowed them all. The workflow saw exit code 0 and kept going.

In a monorepo, the build step is not optional ceremony. The test runner needs the workspace packages to be built first. The fix:

- name: Verify readiness
  run: |
    set -e
    pnpm build
    pnpm test
    pnpm lint
    pnpm typecheck

set -e means any non-zero exit stops the workflow. If tests fail after the build, you find out. If the build fails, you stop. Lint and typecheck went into the same step because they were already in the local pre-push hook; the only reason to keep them out of the release gate is laziness or speed, and a release gate is the wrong place to optimize either.

Bug 2: Nine Version Sources, Six Ignored

Nine surfaces emit a version string in this repo: root package.json, version.txt, CHANGELOG.md, the five workspace package.json files (packages/cli, packages/kernel, packages/compiler, packages/types, packages/benchmarks), and the runtime constant at packages/kernel/src/version.ts. The workflow bumped three of them — root, version.txt, CHANGELOG.md — and silently left the other six behind.

Result: root said 1.0.4, workspace packages said 1.0.3. Root said 1.0.5, workspace said 1.0.4. Drift every run. ico --version told users the workspace's number, not the tag's.

Lock-step monorepos need single-source-of-truth version sync. A helper that picks up the six the workflow was missing:

bump_pkg_json() {
  local file=$1
  local version=$2
  node -e "
    const fs = require('fs');
    const pkg = JSON.parse(fs.readFileSync('$file', 'utf8'));
    pkg.version = '$version';
    fs.writeFileSync('$file', JSON.stringify(pkg, null, 2) + '\n');
  "
}

bump_pkg_json package.json "$VERSION"
for pkg in packages/*/package.json; do
  bump_pkg_json "$pkg" "$VERSION"
done
sed -i "s/export const VERSION = '.*';/export const VERSION = '$VERSION';/" \
  packages/kernel/src/version.ts

All nine sources now move together. ico --version reports the truth.

Bug 3: The Step That Wasn't There

The workflow tagged releases. It never published to npm. There was no npm publish step. That's not a typo — the workflow was complete without it. Every release ran. Every release skipped the one thing that makes it a release.

Here's what belongs after "Create GitHub Release":

- name: Publish to npm
  env:
    NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
  run: |
    set -e
    if [ -z "$NPM_TOKEN" ]; then
      echo "NPM_TOKEN not set — skipping publish"
      exit 0
    fi
    if npm view "intentional-cognition-os@$VERSION" version 2>/dev/null; then
      echo "intentional-cognition-os@$VERSION already on npm — skipping"
      exit 0
    fi
    echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" > ~/.npmrc
    pnpm --filter intentional-cognition-os publish --no-git-checks
    sleep 5
    npm view "intentional-cognition-os@$VERSION" version

Three guards, all in the script — not in the step's if: condition. (Step-level env: isn't available to that step's own if: in GitHub Actions, so if: env.NPM_TOKEN != '' would always evaluate false. The check belongs inside run:, where the env is real.) Token presence fails safe if it's missing. Idempotency skips if already published (covers manual publishes). Post-publish verification re-queries the registry to confirm it landed.

A release workflow that doesn't end with a verifiable artifact in the registry isn't a release workflow. It's a tagging workflow with extra steps.

The State Behind the Process

Fixing the workflow forward didn't fix the present. When the workflow was corrected (commit 7681dd5), main was drifted: root at 1.1.0, workspace at 1.0.5. Users running ico --version got 1.0.5. One-time backfill in commit c651de8 aligned all nine version sources to 1.1.0. Then verified: pnpm build succeeded, pnpm test 1,210/1,210 passing, ico --version → 1.1.0.

Process bugs leave state behind. Fixing the process doesn't heal the damage. You clean it up separately.

The Three-Bug Pattern

Every CI/CD pipeline that ships has these three failure modes available:

Quality gates that pass on failure (|| true, swallowed errors). Fix: set -e and explicit step order.
Monorepo workspaces with distributed version state. Fix: single-source-of-truth version sync in the workflow.
A release workflow that doesn't end with verification the artifact reached the registry. Fix: final step that queries the registry and confirms.

The icos release workflow had all three. The checkmarks lied because the workflow wasn't designed to catch itself lying.

Also Shipped 2026-05-19

Daily-log convention — the rest of the day, in one paragraph each. Not connected to the release-workflow thread; logged here because they happened on the same git day.

claude-code-slack-channel v2 cluster — 4 PRs merged with enterprise governance substrate framing. RFC 8785 JCS interop vectors (#175), cross-tier shadow detection (#176), journal v2 Ed25519 signing (#177), strip denied tool-call detail (#178).
kobiton R3 close-out — deliverable final review, Blog 3 rewrite, 5 close-out PRs merged.
claude-code-plugins partner portal — Kobiton and Nixtla brand integration, Killer Skill of the Week refresh.
intentional-cognition-os test infra — Intent Solutions Testing SOP layers L0-L7 installed (.husky/, dependency-cruiser, stryker, RTM/PERSONAS/JOURNEYS docs). 3,447 insertions in commit e0efdee.

v1.0.0: Conditional GO Through a Release Gate — The gate that flagged this path.
Honest Performance Benchmarks for a Paid-API Compiler — Earlier icos work from this release cycle; same repo, different failure mode.

A v1.0 Is a Gate, Not a Tag

Jeremy Longshore — Thu, 21 May 2026 13:00:39 +0000

Two beads were open at the start of 2026-05-18. E10-B11 was the v1.0 release-readiness gate. E10-B12 was the v1.0 release cut, blocked-by-design on B11. Epic 10 was the last epic in intentional-cognition-os (ICO). The release pipeline was wired through /release. Everything that mattered had to clear one ritual.

Five npm releases shipped that day: v0.21.0 → v0.22.0 → v0.22.1 → v0.22.2 → v1.0.0 → v1.0.1. The interesting one is v1.0.0, because the gate said GO with conditions, not GO. And the same-day v1.0.1 is the proof that "GO with conditions" is the correct verdict shape for a real release, not a binary.

The 3× degradation gate

The release ran on top of fresh benchmark infrastructure. 625691e and f7bd287 closed out E10-B06 (performance profiling) with a 500-source large-corpus benchmark. The headline addition was a 3× degradation gate — a configurable cap (default 3.0) that fails the run if per-unit cost at large scale exceeds 3× the moderate-corpus baseline.

The gate is intentionally narrow:

// utils/degradation.ts — gate stays honest by NOT inferring per-unit costs
export function computeDegradation(
  moderatePerUnitMs: number,
  largePerUnitMs: number,
  cap = 3.0
): { ratio: number; pass: boolean } {
  if (moderatePerUnitMs === 0) {
    return { ratio: Infinity, pass: false }; // catch degenerate samples loudly
  }
  const ratio = largePerUnitMs / moderatePerUnitMs;
  return { ratio, pass: ratio <= cap };
}

The runner does per-unit derivation BEFORE calling the gate. Ingest's perFile.medianMs is already per-unit (each iteration was one file). Lint's result.medianMs is whole-workspace, so the runner divides by page count first. Putting that decision in the runner instead of the gate is the difference between "gate that knows what it's measuring" and "gate that guesses at the measurement units."

Results at 500 sources: ingest 1.25× (PASS), lint 0.33× (PASS — got faster at scale, likely amortized constants). The gate had teeth and the system passed cleanly.

The release-readiness checklist (E10-B11, PR #73)

Eight items, verified item-by-item, recorded honestly. No "looks good to me" entries:

CI passes — all 4 jobs green on last 3 main runs
Evals pass — smoke eval clean; retrieval/citation/compilation handlers wired with 30+ unit tests
Coverage targets — PARTIAL. Types 100%, kernel 84.6% (target 90%), compiler 62.3% (target 80%), CLI 45.2% (target 70%)
Docs updated — current per E10-B07/B08
CHANGELOG complete — auto-generated, current through v0.22.0
No critical beads open — only B11 (this) + B12 (release cut, blocked by design)
User journey walkthrough — ico init → status → 14-command CLI surface, live smoke-tested
Performance targets met — ingest 200× headroom, lint 3000× headroom, 3× degradation gate PASS

Verdict: GO with two conditions.

C1: ico --version reported 0.1.0 (a stale kernel constant) instead of the published 0.22.x. Fix in-cut.
C2: Coverage shortfall on kernel/compiler/cli. Documented as post-v1, not blocking. 1,210 passing tests, zero known bugs.

That verdict is the artifact. Most release rituals make GO/NO-GO a binary. The conditional verdict is honest: state the gap, decide if it blocks, ship if it doesn't, document the gap permanently if it doesn't.

What "GO with conditions" actually means

A conditional release verdict is the three-state model: fix what's fixable in-cut, document what isn't, ship anyway. Unlike a binary GO/NO-GO gate that forces a boolean choice, a conditional gate acknowledges that real releases ship with known imperfections. The conditions are documented forever in the release record — no lying about readiness, no pretending gaps don't exist, but no unnecessary delays waiting for the perfect threshold that never comes.

Why not GO/NO-GO binary?

Binary GO/NO-GO encourages two bad behaviors.

Behavior one: lower the bar to ship. "The version-string bug is fine, users will figure it out." The release ships, the operator-visible defect ships with it, and the next person debugging an environment ends up reading the wrong build into their incident postmortem.

Behavior two: delay until the gate is perfect. Coverage targets met on a Tuesday that never comes. Kernel at 84.6% is allegedly not 90%, so v1.0 slips. Then 90% becomes 95%, because some new code landed during the wait. The gate becomes a treadmill.

Coverage at kernel 84.6% / compiler 62.3% / CLI 45.2% with 1,210 passing tests and zero known bugs is shippable. Blocking v1.0 on coverage uplift would have been a bigger lie than shipping with documented shortfalls. The AAR opens C2 as a post-v1 bead for the next planning cycle. The truth is in the record.

C1 is the inverse case — ico --version reporting the wrong number is shippable but ugly, and the fix is small. So fix it in-cut, document it, move on. The gate didn't pretend C1 was fine; it just didn't pretend it was a v2.0-blocker either.

The prescription is a three-part rule, not a two-part one: fix what's fixable in-cut, document what isn't, ship anyway. Binary GO/NO-GO collapses three states into two and loses the most useful one — the "shippable with known imperfections" state where most real releases actually live.

C1 fix: read your own version (PR #74)

packages/cli/src/index.ts had been importing version from @ico/kernel, which exported a hardcoded string. The kernel constant was never maintained in lock-step with the published CLI package — and shouldn't be, since they are independent artifacts on independent release cadences.

// packages/cli/src/index.ts — read from CLI's own package.json
function readCliVersion(): string {
  try {
    const pkgPath = new URL('../package.json', import.meta.url);
    const pkg = JSON.parse(readFileSync(pkgPath, 'utf-8'));
    return pkg.version;
  } catch (err) {
    console.error('[ico] failed to read CLI package.json:', err);
    return '0.0.0-unknown'; // sentinel — CLI keeps working, operator sees clear msg
  }
}
export const cliVersion = readCliVersion();

The try/catch is load-bearing. readCliVersion() runs at module load, BEFORE the process-level error handlers are installed further down the file. An uncaught throw here would surface as a raw Node stack trace and bypass the friendly [ico]-prefixed message convention every other CLI error uses. The sentinel path is what makes this safe to call at import time — the CLI keeps working, the operator gets a legible message, and the bug is visible without crashing.

The test was tightened in the same PR. /^\d+\.\d+\.\d+/ (no end anchor — would accept nonsense like 0.22.1.99) became:

expect(cliVersion).toMatch(/^\d+\.\d+\.\d+(-[\w.-]+)?$/);

Strict semver core plus optional pre-release tag. The previous regex was a one-character bug; the fix is one character plus an opt-in pre-release group.

The cut itself (52fa7a4 → v1.0.0)

The cut commit was tiny: 11 files, +54/-10 lines. It did one thing: aligned all 6 workspace package.json + version.txt + kernel/src/version.ts at 1.0.0.

The auto-release workflow had been bumping the root package.json and version.txt only — internal packages had drifted to 0.1.0 or 0.22.1 depending on history. /release Phase 3 caught the drift. Phase 5 required explicit SHA approval before any push (f1a627b). Phases 6-8 ran atomically.

Verified at v1.0:

1,210 / 1,210 tests pass across 5 packages
Lint + typecheck clean
escape-scan REFUSE=0 CHALLENGE=0 FLAG=0
ico --version reports 1.0.0

The tarball turned out incomplete (v1.0.1, same day)

During the actual npm publish flow, the pack dry-run reported 7 files when expected was 9: dist + package.json, no README, no LICENSE. The CLI's package.json declared:

"files": ["dist", "README.md", "LICENSE"]

But the CLI directory didn't OWN those files. The canonical README.md and LICENSE live at the monorepo root.

Fix landed inline before the real publish:

// packages/cli/tsup.config.ts — copy README + LICENSE at build time
export default defineConfig({
  // ... entry, format, dts, sourcemap ...
  onSuccess: 'cp ../../README.md ../../LICENSE ./',
});

The copies are gitignored (their source of truth is the repo root). v1.0.0 on npm now includes both. No version bump for the build-infra fix itself, but the same day shipped v1.0.1 for the next user-visible change.

This is the test of whether "GO with conditions" was the right shape. A binary GO/NO-GO ritual would have caught the version string (C1) and either fixed it before re-running the whole gate or punted to v1.0.1. The conditional model said: ship, here's what we know is imperfect. When the tarball turned out incomplete during the actual publish — a discovery that couldn't have been made during gate verification, because it only surfaces in the publish pipeline itself — the answer was just: ship v1.0.1 the same day. No drama. No "release is broken" panic. The model already accepted that real releases generate follow-on releases.

AAR same day

d17e10e docs(aar): v1.0.0 release after-action report landed within hours. Three lessons-for-next-release, captured while they were still warm:

Beads JSONL/Dolt sync flapping during multi-PR sessions — repeated need to re-close beads after merges. Filed as a follow-up to investigate the sync ordering.
Auto-release workflow bumps root + version.txt only — should bump packages/*/package.json in lock-step. The 11-file cut commit was entirely correcting drift the workflow could have prevented.
/release skill execution worked as designed — Phase 0 surfaced no blockers, Phases 1-3 caught the version drift, Phase 5 required SHA approval, Phases 6-8 atomic.

Same-day AAR is non-negotiable. The version-drift issue, the tarball issue, the conditional-verdict pattern — all of them lose 80% of their teaching value if you write the AAR a week later, after the warm memory of "wait, why didn't the workflow catch that?" has faded into "yeah, we shipped, it was fine."

Also shipped

The release gate constrained the v1.0.0 cut, not the working day. Three other repos kept moving in parallel — exactly the behavior the conditional-verdict model is designed to enable. A release that takes the whole org offline isn't a release ritual; it's an outage.

hustle: Phase 3 auth landed in three commits — NextAuth + Drizzle/SQLite infrastructure, dashboard cutover, password reset flow. Coordinated migration from the previous auth stack on a single feature branch.
claude-code-slack-channel: ACP session/cancel boundary adapter extracted into a module, and JSON-RPC id widened to nullable per spec §5.1 (#172, #173).
claude-code-plugins: Six PRs — repo quality audit, private vulnerability reporting enabled, validator discovers root-level SKILL.md (Anthropic-spec layout), slack-channel mirror stopped stripping upstream tests, blog cross-post infra fix.

Honest perf benchmarks for a paid-API compiler — yesterday's post on the benchmark infrastructure that fed this release gate
Five releases in fifteen minutes: Mandy cutover and freeze break — earlier five-releases-in-a-day pattern
GitHub release workflow: uncommitted changes and semantic versioning — related release-engineering theme

Honest Perf Benchmarks for a Paid-API Compiler

Jeremy Longshore — Wed, 20 May 2026 13:00:40 +0000

intentional-cognition-os is a TypeScript "compiler" — markdown sources go in one end, a structured artifact comes out the other, and several of the middle stages call paid Claude APIs to do the cognitive work. Up to today there were zero performance gates on any of it. No baseline, no regression alarm, no "did that refactor make ingest 4× slower" check.

The benchmark suite that landed across four PRs answers two design questions that had to be settled before a single line of timing code got written:

How do you compare numbers across machines when half the corpus is randomly generated text?
What do you do about the steps that cost real money on every run?

Get either answer wrong and the benchmark suite is worse than no benchmark suite — it produces numbers that look authoritative and aren't.

The corpus has to be byte-identical

The first scenario — ingest — needs a corpus. Hand-curated fixtures committed to disk were considered and rejected: they don't scale, they go stale, and they encode whoever-wrote-them's idea of "representative." A generator is the right answer, but a generator has to be deterministic or before/after diffs are noise.

The generator uses a seeded mulberry32 PRNG and pulls UUIDs from the same stream:

function mulberry32(seed: number) {
  return function () {
    let t = (seed += 0x6d2b79f5);
    t = Math.imul(t ^ (t >>> 15), t | 1);
    t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
    return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
  };
}

function seededUuidV4(rand: () => number): string {
  // 16 bytes from the seeded stream, version + variant nibbles set per RFC 4122
  const bytes = new Uint8Array(16);
  for (let i = 0; i < 16; i++) bytes[i] = Math.floor(rand() * 256);
  bytes[6] = (bytes[6] & 0x0f) | 0x40;
  bytes[8] = (bytes[8] & 0x3f) | 0x80;
  return formatUuid(bytes);
}

The non-obvious trap is crypto.randomUUID. It would have looked correct, passed every unit test, and silently produced different UUIDs on every run — so every "identical" corpus would have differed in the front-matter id field. That breaks ingest's content-hash cache in different ways on different machines. Same seed, same count, same body-word count yields byte-identical output everywhere. That's the contract.

One more gotcha worth a sentence: the corpus generator writes front matter through gray-matter, which quotes string values. The compiler's wiki-page validator uses a hand-rolled YAML parser that does NOT strip quotes — so wiki fixtures emit all values unquoted. A quoted compiled_at would arrive at Zod's datetime check with literal " characters in it and fail. Two parsers, two rules, documented inline at the parser boundary.

An API key is not consent

The render, compile, and ask scenarios call Claude. Running them on every CI pass would either drain a budget or quietly stop running when the budget hit zero. Neither is acceptable.

The gate is two env vars, both required:

ANTHROPIC_API_KEY=sk-ant-... \
ICO_BENCH_INCLUDE_CLAUDE=1 \
pnpm --filter @ico/benchmarks bench

From PR #70's design notes, kept verbatim because the framing matters:

The double gate is intentional. An API key alone is not consent — many developers have it set for normal CLI use. ICO_BENCH_INCLUDE_CLAUDE is the explicit "yes, burn tokens on this benchmark run" signal.

This pattern shows up elsewhere — CI=true plus RUN_E2E=1, prod credentials plus --really-really-yes. The shape is the same: one signal proves capability, the second proves intent. A single-gate design fails open the first time someone forgets which shell they're in.

Skipped is not zero

The interesting design call was what to do when the gate is closed. The wrong answers:

Don't run, don't record. Trend tooling then can't tell "we stopped running render" from "render still passes."
Record a zero. Trend tooling thinks render got infinitely fast and stops alarming.

The right answer: record the scenario as skipped: true with a stable skipReason. ScenarioRecord is Partial<CommonTiming> so the timing fields legitimately don't exist on skipped records:

{
  "name": "render",
  "skipped": true,
  "skipReason": "ICO_BENCH_INCLUDE_CLAUDE not set",
  "git_sha": "9c14f02",
  "node": "v22.21.0",
  "platform": "linux-x64"
}

A baseline-comparison script can now answer three different questions instead of two: did this scenario regress, did it improve, or did it not run? Skipped runs stay visible in the JSON timeline. They don't pollute the histogram, but they prove the scenario still exists and the runner saw it.

The four PRs, briefly

PR #68 scaffolded the packages/benchmarks/ workspace, the corpus generator, a bench() timer with warmup + N-iteration median + RSS delta, and the runner that captures git SHA, Node version, and platform into results/<iso>-<sha>.json. The results/ directory is gitignored except .gitkeep — baselines get tracked explicitly, not by accident.
PR #69 added the lint scenario and moved runLint, scanWikiPages, extractWikilinks, detectOrphans, LintResult, and SchemaError out of packages/cli/src/commands/lint.ts into a new packages/compiler/src/lint.ts. The function only composes compiler + kernel primitives and has no CLI dependency — it belonged in the compiler the whole time. The CLI's lint command shrunk to a thin wrapper around commander wiring and renderLintReport. Side fix: extractWikilinks had a module-level /g regex whose lastIndex carried state between calls — the same class of bug that landed in PR #67 the day before. Fixed by constructing the regex per call.
PR #70 added the render scenario and the double-gate.
PR #71 added compile and ask, each using the same gating pattern. Roughly 70 lines of additions across both files — the gate had already done the hard work.

Why not the obvious alternatives

Vitest's built-in bench was considered. It does microbenchmarks well and integrates with the existing test runner. It does not produce the JSON timeline shape needed for cross-run comparison, and bolting that on means owning the storage layer anyway. Build it once, build it right.

Committing fixture corpora to disk was considered. They go stale, balloon the repo, and encode one author's idea of "moderate." The seeded generator is reproducible AND parameterizable — same determinism guarantee, no committed binary blobs.

Running Claude scenarios always was considered for about a minute, then rejected on cost grounds. Even with caching, a benchmark suite that costs $2 per run on a busy day stops getting run.

What the numbers say

Three scenarios ran on the dev box this afternoon (Claude-gated ones skipped because the opt-in wasn't set):

Scenario	Median	Target	Headroom
ingest (per-file, 50 sources × 500 words)	~9 ms	< 2 s	220×
lint (50 sources + 30 wiki pages)	~12 ms	< 30 s	2400×
render	SKIPPED (no opt-in)	—	recorded

The headroom isn't the point — those targets are deliberately generous because the goal is regression detection, not perf bragging. The point is that there are now numbers to regress against.

Also shipped today

claude-code-plugins repo audit. A 232-line audit landed at 266-RA-AUDT-repo-quality-audit-2026-05-17.md cataloguing a broken /about route, missing 404 handling, 14 stale MS-OLDV files still claiming v1.0.0 while the repo is at v4.30.0, and notebook content teaching the old 6-required-fields skill spec when the current spec requires 8. The first commit incorrectly flagged the wiki as empty, because gh api repos/.../wiki returns 404 even when the wiki has content — that endpoint isn't a content probe, it's a metadata probe with bad error semantics. Followup commit cloned the wiki, found 23 pages, and refreshed all of them with current numbers. Lesson noted inline: don't use API existence probes as content probes. Clone and read.

claude-code-slack-channel threat model. Added T11 (EchoLeak — instructions exfiltrated via legitimate-looking message replies) and invariant #7: admin verbs are not chat content. An operational key-management doc for the audit-signing key landed alongside the threat model update.

The transferable pattern

Five scenarios in source tree, three actively measured, two gated behind explicit consent. The numbers that get reported are honest because the inputs are reproducible and the skipped runs are visible. Forget the opt-in flag and three scenarios show up as skipped in the JSON — they don't disappear, and they don't pretend to be zero.

Any benchmark suite that mixes deterministic and paid steps needs all three pieces: a deterministic corpus that survives machine swaps, an opt-in gate strong enough to mean something, and a record shape that distinguishes "didn't run" from "ran fast." Miss one and the suite will quietly lie to you the first time someone forgets which mode they're in. The lie is worse than the gap it filled.

Five Silent Failures in One Day — the regex lastIndex bug that re-appeared in PR #69 was one of these.
Deterministic-First, LLM-Advisory CI — same principle: the deterministic gate decides, the paid gate informs.
Transitive CVE Clearance: A Dual-Layer Pattern — the double-gate is the same shape as that two-layer defense.

Five Silent Failures in One Day

Jeremy Longshore — Tue, 19 May 2026 13:00:41 +0000

A silent failure is when a tool reports PASS without doing the work it was supposed to do — the legitimate empty-set case and the broken-but-silent case produce identical output, and nothing downstream can tell them apart.

A green check is not evidence of work. It is evidence that whatever ran did not raise an error. Those are different claims, and on 2026-05-16 the difference surfaced five times in five unrelated systems before lunch.

The pattern is the same in all five: a tool reported PASS without doing the work it was supposed to do. Not a wrong answer — no answer, dressed up as a correct one. The legitimate empty-set case and the broken-but-silent case produced identical output. CI was green. Reviewers saw nothing to push back on. The signal that something was wrong came from downstream consumers noticing the work was missing.

The five instances, in the order they were found:

A CI prescreen that ran on zero plugins and called itself green
A .gitignore rule that silently dropped plugin configs from every commit
Prettier that reformatted an 11,000-line catalog and exited 0
An SSH deploy that succeeded by doing nothing
A regex that quietly skipped matches because the /g flag left state behind

Each one shipped past code review. Each one was caught by a downstream user, not by the gate that was supposed to catch it. Each one has now been re-armed with a guard whose job is to assert the work actually happened — not to assert that the command exited zero.

1. The prescreen that ran on zero plugins

Repo: claude-code-plugins, PR #730.

The pr-prescreen.yml workflow's "Compute changed plugin paths" step combined gh api --paginate with --jq in a single pipe:

- name: Compute changed plugin paths
  run: |
    gh api --paginate \
      "/repos/${{ github.repository }}/pulls/${{ github.event.pull_request.number }}/files" \
      --jq '.[].filename' \
      | grep -E '^plugins/[^/]+/' \
      | cut -d/ -f1-2 \
      | sort -u > changed-plugins.txt || true

This works on every local shell. On the GitHub Actions runner, the --paginate + --jq combination silently produced empty stdout. No error. No exit code. Just nothing on the pipe. The downstream grep | cut | sort -u happily processed zero lines and wrote an empty file. The trailing || true swallowed any failure that might have escaped the pipeline.

The classifier then read changed-plugins.txt, saw zero entries, and emitted PASS: no plugin paths matched the PR diff. Two external PRs — #726 and #728, the first contributions through the new pipeline — both landed false PASS verdicts on PRs that obviously added new plugin directories.

The fix is two changes and a guard:

- name: Fetch changed files
  run: |
    gh api --paginate \
      "/repos/${{ github.repository }}/pulls/${{ github.event.pull_request.number }}/files" \
      > pr-files.json

- name: Extract plugin paths
  run: |
    jq -r '.[].filename' pr-files.json \
      | grep -E '^plugins/[^/]+/' \
      | cut -d/ -f1-2 \
      | sort -u > changed-plugins.txt

- name: Sanity guard
  run: |
    if jq -r '.[].filename' pr-files.json | grep -qE '^plugins/'; then
      if [ ! -s changed-plugins.txt ]; then
        echo "HARD_BLOCK: PR touches plugins/ but extraction produced zero dirs"
        exit 1
      fi
    fi

Splitting gh api --paginate from jq removes the pipe-buffering interaction that ate stdout. Dropping the blanket || true lets real errors propagate. The third step is the actual fix: it asserts that if the PR diff touched any plugin path, the extraction must have produced at least one row. "I found nothing" becomes "I would have found something — fail loud."

2. The gitignore that ate plugin configs

Repo: claude-code-plugins, PR #733.

The root .gitignore contained one line that was never meant to apply globally:

.mcp.json

The original intent was dev-local — devs sometimes drop a .mcp.json at the repo root for personal MCP servers. The pattern matched everywhere. Three plugins — slack-channel, pr-to-spec, x-bug-triage — had a .mcp.json on disk because the mirror sync wrote them, and git silently never tracked any of the three. The mirror produced the file. The working tree showed the file. git status showed it as ignored. Nothing red anywhere.

Plugins without their .mcp.json fail the MCP handshake at install time. Claude Code can't determine how to spawn the server. The plugin loads, registers nothing, and the user sees commands that do nothing.

A second silent failure lived in the same PR. The mirror's sources.yaml listed source files explicitly:

plugins/x-bug-triage:
  sources:
    - server.ts
    - lib.ts

server.ts imports journal.ts, manifest.ts, policy.ts, supervisor.ts — none of which were in the allow-list. The mirror shipped a non-functional server, not because anything errored, but because the include list silently skipped the missing files. No "file not in sources" warning. No diff check. Just a partial build that compiled because the imports themselves were valid module references at type-check time but missing at runtime.

The fix:

# .gitignore
.mcp.json
!plugins/**/.mcp.json

plugins/x-bug-triage:
  sources:
    include: "*.ts"
    exclude: ["*.test.ts", "*.spec.ts"]

The negation rule re-tracks plugin configs. The glob-with-exclude replaces named-file allow-lists with a pattern that can't silently miss a new file. The three affected .mcp.json files were force-added in the same commit.

3. Prettier that reformatted 11,000 lines and exited 0

Repo: claude-code-plugins, PR #730 (same PR as the prescreen failure).

.claude-plugin/marketplace.extended.json is the canonical plugin catalog — eleven thousand lines, hand-formatted with deliberate multi-line keywords arrays for git-diff hygiene:

{
  "name": "example-plugin",
  "keywords": [
    "ci",
    "validation",
    "marketplace"
  ]
}

A contributor's format-on-save action ran prettier across the catalog. Prettier collapsed every keyword array to a single line:

{
  "name": "example-plugin",
  "keywords": ["ci", "validation", "marketplace"]
}

The JSON was still valid. Prettier exited 0. The validate-plugins.yml workflow loaded the catalog, parsed it, ran every entry through the schema — all green. The actual diff was +1 plugin entry, -1,200 lines of reformatted catalog. Every other in-flight PR's merge base was now unrecoverable without rebase-and-reformat.

The fix has two parts. First, .prettierignore:

.claude-plugin/marketplace.extended.json

Second, an active line-budget guard at scripts/check-catalog-format.py:

def expected_line_delta(base_catalog, head_catalog):
    with open(base_catalog) as f:
        base = json.load(f)
    with open(head_catalog) as f:
        head = json.load(f)
    base_by_name = {p["name"]: p for p in base["plugins"]}
    head_by_name = {p["name"]: p for p in head["plugins"]}

    added = set(head_by_name) - set(base_by_name)
    removed = set(base_by_name) - set(head_by_name)
    modified = {n for n in head_by_name & base_by_name
                if head_by_name[n] != base_by_name[n]}

    # Average plugin block is ~30 lines.
    return (len(added) + len(removed) + len(modified)) * 30

actual_delta = abs(file_line_count(head) - file_line_count(base))
expected = expected_line_delta(base, head)
budget = expected + 300  # slack for inline edits

if actual_delta > budget:
    sys.exit(f"FAIL: catalog diff {actual_delta} lines, budget {budget}")

The guard parses both catalogs structurally, computes the expected line delta from the actual content changes, and rejects PRs where the file delta exceeds that by more than 300 lines. "The file is still valid" becomes "the diff is the size we expected from the work that was claimed."

4. The SSH deploy that succeeded by doing nothing

Repo: hustle, PR #40. Documented in the intentsolutions-vps-runbook AAR for Phase 2.5 of the VPS migration.

The new Hustle VPS deploy workflow merged green. The first auto-deploy reported success. The container on the VPS was untouched.

The canonical reusable VPS deploy workflow is one SSH call:

- name: Deploy
  run: ssh ${{ env.DEPLOY_USER }}@${{ env.DEPLOY_HOST }}

There is no command argument. The whole architecture relies on a command="..." force-command directive in authorized_keys to bind the deploy key to a specific script. Connect with the key, the forced command runs, deploy happens, connection closes.

The hustle-deploy user's authorized_keys had no force-command. Plain ssh user@host with no command and no force-command opens an interactive session. The runner has no TTY. The session sits idle for a moment, the server times out the silent connection, exit 0. From the runner's perspective: SSH connected, SSH closed cleanly, deploy step SUCCESS. From the VPS's perspective: a key authenticated, nothing happened, the session ended.

The fix is a deploy script and a force-command lock:

# /usr/local/sbin/deploy-hustle
#!/bin/bash
set -euo pipefail
cd /srv/hustle
git fetch origin
git reset --hard origin/main
docker compose pull
docker compose up -d --remove-orphans
docker compose ps

# /home/hustle-deploy/.ssh/authorized_keys
command="/usr/local/sbin/deploy-hustle",no-port-forwarding,no-X11-forwarding,no-pty ssh-ed25519 AAAA... deploy@github

Now there is no path where the SSH channel can do nothing. The forced command runs or the key fails to authenticate. The second deploy ran the script end-to-end, recreated the container, and produced visible log output the runner could grep.

The generalization matters more than the fix. Every Docker-variant deploy in the fleet that depends on a force-command and doesn't have one is silently broken in the same way. lilly-75-holy and braves-booth are flagged for audit; partner-portals and claude-code-plugins-plus-skills are safe — both have the force-command directive in place. The fleet sweep is tracked as a follow-up bead off the P7 Stage C epic, not folded into this post.

5. The regex that skipped matches because `/g` left state behind

Repo: intentional-cognition-os, PR #67 (a Gemini review followup on E10-B03).

Two module-level constants:

const SOURCE_RE = /\[\^src:([^\]]+)\]/g;
const WIKILINK_RE = /\[\[([^\]]+)\]\]/g;

Used in two back-to-back RegExp.exec loops to iterate citation markers in a body of text:

export function extractCitations(body: string): Citation[] {
  const out: Citation[] = [];
  let m;
  while ((m = SOURCE_RE.exec(body)) !== null) {
    out.push({ kind: "source", id: m[1] });
  }
  while ((m = WIKILINK_RE.exec(body)) !== null) {
    out.push({ kind: "wikilink", id: m[1] });
  }
  return out;
}

RegExp instances with the /g flag carry a mutable lastIndex between calls. The exec loop is supposed to walk it to the end and let the final non-match reset it to 0 — but any code path that exits the loop early, throws mid-iteration, or runs concurrently on the same regex object leaves lastIndex mid-string. The next call to extractCitations starts searching from wherever the last one stopped.

The citation handler kept reporting "verified" because the missed citations were not checked at all — not flagged as missing, not flagged as wrong. They were invisible. Whichever entries fell before the carried-over lastIndex were skipped silently, every time.

The fix:

export function extractCitations(body: string): Citation[] {
  // Required: SOURCE_RE and WIKILINK_RE are module-level /g regexes.
  // Reset lastIndex on entry so prior loop state cannot cause this call
  // to start mid-string and silently skip matches.
  SOURCE_RE.lastIndex = 0;
  WIKILINK_RE.lastIndex = 0;

  const out: Citation[] = [];
  let m;
  while ((m = SOURCE_RE.exec(body)) !== null) {
    out.push({ kind: "source", id: m[1] });
  }
  while ((m = WIKILINK_RE.exec(body)) !== null) {
    out.push({ kind: "wikilink", id: m[1] });
  }
  return out;
}

The comment is load-bearing. Without it, the next refactor pulls the resets out as "redundant" and the silent skip comes back. Six regression tests pin the invariant: prebuilt-index honored, batch aggregation correct, 100 sequential calls return identical output, two interleaved bodies (one long, one short) stay independent of each other.

The shape of a silent failure

All five share the same anatomy. There exists a legitimate no-op outcome — no plugin paths matched, no files to include, no formatting changes needed, no command to run, no remaining matches in the string. The error path produces an observable state identical to the legitimate no-op. The downstream consumer cannot tell which one it got.

The fixes are not better error handling. The fixes are active assertions about the work that was claimed:

prescreen: if files matched the trigger, the extraction must have produced rows
gitignore + allow-list: plugin configs must reach the tree, not just the working directory — and source allow-lists must fail on missing imports, not silently ship a partial build
prettier: the diff size must match the structural work
SSH deploy: bind the command to the key — make it impossible for the channel to do nothing
regex: reset state to a known precondition before every call, and pin that contract with a test

The common verb in every fix is assert, not handle. The bug was not that errors weren't caught. The bug was that there was no point in the pipeline where the system stated, in code, what counted as the work actually being done.

The hardest silent failures to catch are the ones where the tool's success state and its silent-failure state are observationally identical. That is the category. Once auditing for it begins, more keep surfacing — most CI pipelines have at least one step that exits 0 whether or not it did anything, and most of them are downstream of a step that can legitimately produce empty output.

Silent failures don't get worse over time. They get more confident. Each green check trains the audit instinct to skip them, and the audit instinct is the only thing standing between the build status and the truth.

Deterministic-first, LLM-advisory CI — the broader argument for keeping reject/accept decisions in code that can be reasoned about, with model output as advisory signal
Three guards against shipping slop — earlier examples of the same assert-the-work pattern in plugin merges
Two false-positive fixes, same root cause — when two unrelated bugs share an underlying shape

Deterministic First, LLM Second: An Advisory CI Pre-Screen

Jeremy Longshore — Mon, 18 May 2026 13:00:26 +0000

The old PR review system ran Gemini on every submission to the claude-code-plugins repo. It broke every time — quota errors, timeout, malformed JSON, the works. On 2026-05-15 I shipped a replacement and deleted the original on the same day.

The replacement is structured around two contracts. A deterministic classifier scores each submission against 12 rules and emits one of three verdicts. A Groq LLM bolted on top writes a 5-line summary as advisory polish. The deterministic layer is the product. The LLM never blocks.

The first live invocation immediately caught two bugs in the new system. That's not failure. That's the design working exactly as intended.

Five PRs in one day: #719, #723, #721, #724, #725. Together they close the epic and demonstrate why the deterministic-first pattern lets you replace a live system without a transition period.

The never-block contract: LLM outputs are advisory only. They never block the primary CI decision. If the LLM crashes, times out, or hallucinates, the deterministic verdict posts unchanged and the rest of the pipeline runs as if the LLM step never existed.

The two contracts that matter

The pre-screen workflow lives on two non-negotiable contracts that separate the product from the polish.

The first: the deterministic classifier is the product. It ingests validator JSON output, applies 12 rules to the changeset, and emits a verdict — PASS, CHANGES_REQUESTED, or HARD_BLOCK. Three outcomes. No gray. No ambiguity.

The classifier is a pure function. No I/O. No dependencies beyond the Python stdlib. Every test case maps to a rule. Every rule maps to observable, repeatable behavior. You can trace it from input to output without waiting for an API to respond or hoping an LLM doesn't hallucinate.

PR #719 closed this layer in 579 new lines: .github/workflows/pr-prescreen.yml (~270 lines of workflow), scripts/pr-prescreen/classify.py (the classifier), and 12 unit tests covering every rule and edge case.

The workflow pattern is fork-safe: pull_request_target, SHA-pinned checkout, persist-credentials: false, never executes PR-controlled code. I copied the security pattern verbatim from the broken Gemini workflow (lines 23-79 — those were the load-bearing security). The security model didn't change. The signal source did.

The second contract: the LLM is advisory, never blocks. When the deterministic layer says PASS, the Groq LLM generates a 5-line human-readable summary. The summary is rendered as a GitHub comment. It carries zero veto power.

If Groq times out, crashes, the API key leaks, or the model hallucinates — continue-on-error: true on the workflow step ensures the pre-screen verdict still posts. The comment just doesn't appear. Slack doesn't ping a summary. The rest of the CI runs unchanged. The primary signal is independent of the advisory layer.

The verdict table:

Verdict	Deterministic rule	Slack	Comment	Retry
`PASS`	All skills ≥ C, no fatal errors	Ping	LLM summary (Groq)	No
`CHANGES_REQUESTED`	Missing fields or D/F grade	Silent	Deterministic details	No
`HARD_BLOCK`	Fatal validator error or missing impl	Ping	Deterministic details	No

Groq runs only on PASS verdicts. If Groq fails, the verdict still posts — just without the summary. The deterministic layer is the contract. The LLM is the enhancement that runs inside the bounds of the contract.

PR #723 added the Groq integration: 388 new lines, scripts/pr-prescreen/summarize.py. It calls Groq directly via stdlib urllib — no SDK, no dependency overhead, no transitive vulnerability surface. Model: llama-3.3-70b-versatile. Wall-clock budget: 5 seconds. Single attempt. No retries.

The function is dead simple: POST the verdict JSON plus the changeset summary to Groq, parse the response, format it, return. 11 unit tests including fixes from PR #720's review:

broad except Exception for the never-block contract — catch literally everything
OSError instead of TimeoutError for broader I/O coverage (socket.timeout, connection resets, the rest of the network failure surface)
json.JSONDecodeError guard for malformed responses

One tested invariant matters most: user-controlled PR content cannot override the fixed system prompt. The system prompt is a string literal in the source. The PR body is data. They never meet in the same code path. The classifier output goes into the user-role message; the system role is hard-coded and unreachable from outside.

What "never block" buys you

The old Gemini system gave an LLM veto power over 3,000+ shipped artifacts. Every time it broke, you couldn't delete it — too much workflow depended on it staying alive. The downstream blast radius made retiring it a multi-week migration.

You're stuck in maintenance mode: tuning prompts, chasing API changes, hoping the next model version handles the job the same way the last one did. You can't turn it off. You can't replace it. The veto power is a cage.

The never-block contract changes the trade-off entirely. The LLM is an enhancement layered on top of a deterministic core, not the core itself. If it malfunctions, the workflow degrades gracefully to the deterministic verdict — which you already trust to be correct.

You can replace the old system on the same day you deploy the new one. You're not hedging bets. You're not running both in parallel. You're not waiting for three weeks of production data to prove the new system is safe. You measure trust against the deterministic layer; the LLM is polish that can't revoke the decision.

PR #719 merged. The next day PR #721 deleted gemini-code-review.yml (179 lines of perpetually broken YAML) as a single breaking change.

That PR removed:

the workflow file itself
the orphaned ENABLE_GEMINI_REVIEW repo variable (operator deletes after merge)
the --thorough flag on the validator (advertised in the README but with broken plumbing)

It also added two new surfaces:

scripts/pr-prescreen/audit.py — appends one row per pre-screen run to freshie/inventory.sqlite, tracking the decision history for post-mortems and operator review. Inline CREATE TABLE IF NOT EXISTS schema. continue-on-error: true so DB failures don't mask the primary signal.
a 265-line operator runbook at 000-docs/265-DR-GUID-pr-prescreen-system.md documenting the workflow, the verdicts, the audit schema, and the operator's playbook.

No transition period. No parallel runs. No "let's keep both for safety." The new system had been live for exactly one workflow_dispatch manual invocation. That was enough to trust it — because the deterministic layer is the contract and it's testable end-to-end without the LLM in the loop.

First live invocation found two bugs

The first production run was PR #722 — a hyperflow submission from an external contributor with 8 new skills. The run immediately surfaced two design flaws the test suite didn't catch, because the test suite ran against toy data and missed the production edge cases.

Bug 1: empty-changeset explosion. PR #722 touched sources.yaml only — no plugins/ paths. The changeset filter triggered a fallback I'd written without thinking: "no plugin paths matched" → pass through all results → generate comment body for ~400 skills.

GitHub's comment API caps bodies at 65,536 characters. The post failed silently. The deterministic verdict was correct, but the comment never landed and the Slack ping fired with an incomplete reference. Confusing signal to the operator. Real production bug caught by the first real-world input.

Bug 2: pointless comments. Even after fixing Bug 1, every infrastructure or documentation PR would still get a "PASS: no plugin paths matched" comment. That's accurate — nothing happened. But it's signal without value: visible noise on every non-plugin PR. Noise erodes signal over time. After a week, the operator stops reading the comments.

PR #724 fixed both in 50 lines net (+50/-163 after deleting the dead fallback). Three changes:

Empty changeset → emit empty filtered list (the classifier reports "no plugin paths," doesn't dump everything).
Skip the Post Comment and Slack Notify steps entirely when steps.diff.outputs.count == '0'.
Cap the per-skill table at 100 rows and truncate the body to 65,000 characters with a clear marker.

Empty changeset → no comment, no ping. The deterministic signal has preconditions. When those preconditions aren't met, the system stays silent. No noise. No confusion.

The system found its own design flaws on first contact with reality. That's not a weakness of the never-block contract — that's the whole point. The deterministic classifier is safe enough to trust on its first invocation. The advisor runs under safe conditions. When reality violated the conditions, the system degraded gracefully and the operator fixed the precondition. Not the core logic. The core logic was never wrong. The assumptions feeding it were.

The spec was invisible — fix the surface

PR #722 was thoughtful. The contributor read CONTRIBUTING.md, followed the issue template, and wrote skills that made technical sense. All 8 were structurally sound. And all 8 were missing every one of the 6 marketplace-required frontmatter fields plus every one of the 7 body sections.

The expectations were buried in 000-docs/6767-b-SPEC-DR-STND-claude-skills-standard.md — the Global Master Standard for Claude Skills, v3.6.0, with the 100-point rubric and source citations against Anthropic + AgentSkills.io. Authoritative. Comprehensive. Invisible to contributors.

No link from the PR template. No mention in CONTRIBUTING.md. No signpost in the plugin-submission issue template. The deterministic classifier caught it all as D and F grades and reported each one. That's correct — the validator is working. But the feedback loop was broken: the spec was invisible to contributors. Invisible requirements produce work that looks wrong until you read the fine print. By then you've already written it.

PR #725 surfaced the spec on three contributor surfaces:

CONTRIBUTING.md — new "Read the spec before you start" callout above "Before You Submit," with the distilled requirements and direct links.
.github/PULL_REQUEST_TEMPLATE.md — top-of-template now points to CONTRIBUTING and to the spec. Also replaces stale "auto-review bot" phrasing that referred to the deleted Gemini workflow.
.github/ISSUE_TEMPLATE/plugin-submission.yml — adds a markdown description block with the spec callout and replaces 5 generic checkboxes with 7 spec-aware ones covering the real validator gates.

Also corrected two stale strings while I was there: "Gemini 2.5 Pro will post a review" → "the PR Pre-screen workflow," and the example switched from the deprecated --enterprise flag to the current --marketplace flag.

The validator still grades the same way. The standard didn't move. But now the spec isn't a surprise buried 8 directories deep; it's the first thing you see when you open a pull request. The expected drop in D/F submissions isn't a change to the validator. It's a change to the surface contributors actually touch.

The five-PR arc — #719, #723, #721, #724, #725 — is a case study in what never-block lets you do: ship something small, watch it collide with reality, and fix the collisions without unwinding the core. The deterministic classifier didn't change between Phase 1 and the hot-fix. The Groq advisory didn't change either. The preconditions and the surface visibility did, because reality demanded it.

Deterministic first, LLM second, never-block contract always. That's the formula that lets you retire the old system on the same day and trust the replacement.

Transitive CVE Clearance: The Dual-Layer Pattern

Jeremy Longshore — Sat, 16 May 2026 13:00:33 +0000

You bump a direct dependency to pull in a patched transitive. bun audit goes green. The lockfile is committed. Two weeks later, someone does a clean install on a fresh machine, and the vulnerable transitive comes back. This is the transitive CVE trap, and it catches teams with the first move alone.

The v0.9.1 release of claude-code-slack-channel cleared 6 high-severity CVEs in axios and fast-uri. It required two distinct moves: first, bump the direct deps that pull the patched transitives. Second, pin those transitives at the top-level overrides block so the lockfile cannot regress on the next bun install. Both moves are mandatory. Here's why.

The CVE Picture

Six vulnerabilities came down from the audit:

axios (multiple prototype-pollution and header-injection chains):

GHSA-q8qp-cvcw-x6jj — credential injection via prototype pollution
GHSA-pmwg-cvhr-8vh7 — NO_PROXY bypass via 127.0.0.0/8
GHSA-6chq-wfr3-2hj9 — header injection through polluted properties
GHSA-pf86-5x62-jrwf — response-tampering gadgets in prototype chain

fast-uri (percent-encoding confusion):

GHSA-v39h-62p7-jpjc — host confusion via percent-encoded delimiters
GHSA-q3j6-qgpj-74h6 — path traversal via percent-encoded dot segments

All were high severity. Axios was reachable through @slack/web-api, and fast-uri through @modelcontextprotocol/sdk.

Move 1: Bump the Direct Deps

The straightforward path: bump the deps that pull the patched versions.

@slack/web-api 7.15.0 → 7.15.2    (pulls axios ^1.13.5 → ^1.15.0, resolves to 1.16.1)
@modelcontextprotocol/sdk 1.27.1 → 1.29.0    (refreshes ajv → fast-uri 3.1.2)

Commit this, run the lockfile lock, bun audit shows green. Done, right?

Not quite.

The Lockfile Trap

Package managers use semantic versioning ranges. @slack/web-api at 7.15.2 declares axios ^1.15.0, which matches 1.15.x, 1.16.x, and newer. The first install on your CI or contributor's machine might pull 1.16.1 (the patched version). But six months later, when the MCP SDK maintainer releases a new version that also depends on axios with a different range like ^1.13.0, and a contributor runs bun install on a fresh checkout without the lockfile, the resolver has two legitimate paths to axios: one through Slack at 1.16.1 and one through MCP at 1.13.x. Package managers are free to choose — and if they pick the older one, the CVE is back.

The lockfile prevents this within a known tree, but it has a shelf life. Lockfiles can be ignored (clean install), overridden (manual dependency update), or corrupted (merge conflicts). The real guard is a top-level override that says: "No matter what ranges the transitives declare, axios stays at ^1.16.1 and fast-uri stays at ^3.1.2, always."

Move 2: Pin at the Top-Level Overrides Block

In Bun (and npm/yarn with overrides support), you declare a top-level policy:

{
  "dependencies": {
    "@slack/web-api": "7.15.2",
    "@modelcontextprotocol/sdk": "1.29.0"
  },
  "overrides": {
    "axios": "^1.16.1",
    "fast-uri": "^3.1.2"
  }
}

The overrides block forces every transitive reference to those packages to resolve through the pinned versions, regardless of what ranges the direct deps declare. Now a future lockfile, a fresh install, a contributor on a different machine — all of them get the patched versions. The CVE cannot re-emerge through a range mismatch.

Without the override, the next bun install on a clean tree could legally pull axios 1.13.x (or whatever version a new transitive path declares) and the CVE is back. With the override, it cannot.

Why Both Moves Matter

Move 1 (the dep bump) gets the patched version into the lockfile the first time and signals intent to the dependency tree. Move 2 (the override) is the insurance policy — it says "this version is non-negotiable" to any future resolver, whether it's a clean install, a new team member, or a GitHub Actions runner months from now.

Neither move alone is complete. Bump without override = fragile; override without bump = signals a different problem (the direct dep is stale and needs its own fix). Both together = the CVE cannot come back.

Evidence: The Full Gauntlet

The release ran the Intent Solutions testing gauntlet on every change:

704/704 tests passing (unit + integration + system + E2E)
98.47% line coverage, 98.82% function coverage (floor enforced by CI gate)
Cyclomatic complexity max = 28 (threshold = 30, no violations)
Harness-hash integrity verified (test policy signatures unchanged)
Depcruise clean (dependency graph validated, no cycles, no forbidden imports)
Gherkin-lint clean (all acceptance test syntax valid)
bun audit --audit-level=high clean (excluding one known unpatched transitive marked safe by policy)

A version bump that doesn't clear the full gauntlet doesn't ship. This one did.

Parallel Work in the Same Release Window

PR #162 (external contributor @PGMacDesign) fixed the file-upload extension bug — uploads were defaulting to file.txt because the filename wasn't being passed to filesUploadV2. That change rode the same release vehicle, showing the dual-layer pattern applies to all release-critical fixes, not just CVEs.

PR #164 cleaned up documentation drift after the CVE work landed — updated CLAUDE.md cross-references, dropped the gemini-review workflow (now handled via GitHub App), refreshed the source file LoC table to match the 704-test count, and softened coverage claims to "~704 / ~4,035" with a note that the floor is the real gate, not the count.

All three PRs (#162, #163, #164) merged into a single release tag with a 157-line AAR documenting the bump rationale, the CVE IDs, the test results, and the decision to include the external contribution in the same release window.

Takeaway

Clearing a transitive CVE is not a one-move operation. Bump the direct dep, run the gauntlet, add the top-level override, and commit both. The override is the difference between a fix that sticks and a fix that waits for the next fresh install to fail.

CCSC: Five Releases in One Day — Security Sprint — the prior security sprint on the same repo, where the v0.8.x baseline got hardened before this v0.9.1 patch.
Slack Channel Security Hardening v0.2.0 — External Contributors — earlier hardening pass plus an external-contributor merge story, parallel to today's #162.
Audit Harness v0.1.0 — Enforcement Travels with the Code — the vendored gauntlet (.audit-harness/) that produced the 704/704 + 98.47% evidence in this release.

Three Guards Against Shipping Slop

Jeremy Longshore — Fri, 15 May 2026 13:00:22 +0000

Seven pull requests landed on a single partner fork in one day, alongside half a dozen upstream issue filings and the closeout of a prior audit round. That is a velocity that produces slop by default. The slop did not ship — not because the work was careful, but because three distinct guards were standing between the work and the partner, each catching a different class of failure the other two would have missed.

This post is about those three guards. Not about the velocity. The velocity is the symptom. The guards are the system.

The engagement is Kobiton, a mobile device cloud partner running an MCP server at api.kobiton.com/mcp. The day's output included a hooks bundle, an agents addition, a server-side audit slate, and a consistency cleanup — PRs #39 through #45 on the fork. Any one of those, shipped wrong, would have cost partner credibility. None shipped wrong. Three guards caught the slop at three different moments in the workflow.

Guard 1: Adversarial pre-flight on the hooks bundle

Before the hooks bundle PR went up, three specialist subagents ran in parallel against the raw artifact:

code-reviewer
security-auditor
test-automator

The output was not gentle. Six BLOCKERs and eight HIGHs surfaced between the three reviewers. The PreToolUse envelope shape was wrong. The credential-handling strategy was unsafe — hooks were going to make authenticated API calls from inside a Claude Code session, with a credential surface that nobody had thought through. The appId parameter was an SSRF vector. Error responses echoed PII. There was a ReDoS in input parsing. CLAUDE_PROJECT_DIR vs CLAUDE_PLUGIN_ROOT was confused throughout. The shell-vs-exec form choice was wrong for several handlers. TLS and timeout defaults were missing entirely.

The bundle was BLOCKED from submission. Not "submit with caveats" — blocked, with a re-review date of 2026-05-21.

The hooks PR that actually landed — #44 — was a redesigned advisory-only bundle. No API calls from hooks at all. The credential surface that produced half the BLOCKERs was eliminated by design, not patched. 28 new tests passed. The artifact that shipped was a different artifact than the one that was queued to ship.

The transferable insight is about reviewer parallelism. A security reviewer reading after a code reviewer reads a different file than the one the code reviewer read. The code reviewer has already mentally cleared the surface; the security reviewer inherits that clearance silently. Running the three reviewers in parallel against the raw artifact — each one seeing the actual code, none of them inheriting another reviewer's frame — is what surfaced the BLOCKERs.

Serial review with the same three personas would likely have caught fewer issues — this is a structural inference from how reviewer framing inherits, not a measured comparison. The parallelism is load-bearing. It is also adversarial by construction: each reviewer is graded on what they find, not on consensus with the others.

Guard 2: Empirical verification over inference on the server-side audit

The R3 server-side audit slate for Kobiton — the set of findings about what the MCP server does and does not implement — started as a documentation review. Read the public docs, reason about the MCP protocol, file findings about apparent gaps.

Several DRAFT findings carried inference-grade language. "Likely missing." "Probably not declared." "Appears to omit." That language is a tell. Inference-grade findings filed to a partner are slop with a hedge attached. The hedge does not protect anyone — the partner still has to spend cycles refuting wrong claims.

The work shifted from inference to probe. Using the getCredential MCP tool to obtain a real Kobiton API key, the audit executed raw authenticated probes against api.kobiton.com/mcp:

initialize
resources/list
prompts/list
resources/templates/list

The verbatim server response to initialize:

protocolVersion: 2025-03-26
capabilities: {"tools": {}}
serverInfo: {"name":"kobiton","version":"1.0.0"}

Resources, prompts, and templates list each returned JSON-RPC error -32601 (method not found).

Six findings flipped from "likely missing" to verified against a server response: F36 (instructions field absent), F37 (resources capability absent), F38 (prompts capability absent), F42a (tools.listChanged not declared), F42b (resource subscriptions not declared), plus a newly-discovered protocol version lag — the server declares 2025-03-26 against a current spec of 2025-11-25, two releases behind.

The OAuth retraction is the load-bearing example for this guard, and it is worth describing in detail because the retraction is more valuable than the original finding would have been.

Bundle 3 DRAFT claimed Kobiton was missing three things: RFC 9728 (Protected Resource Metadata), RFC 8414 (Authorization Server Metadata), and WWW-Authenticate response headers entirely. Those claims were built from doc review. They were wrong.

The empirical probe showed all three were already implemented. Kobiton's MCP server has OAuth 2.1 with PKCE S256 and dynamic client registration. The well-known metadata endpoints respond. WWW-Authenticate is emitted. The original Bundle 3 finding — filed as a serious gap — was wrong on its central claims.

The Bundle 3 issue body got rewritten. The wrong claims were withdrawn. The bundle was narrowed to two real, verified gaps: F41d (the resource_indicators_supported field is undeclared) and F41e (the WWW-Authenticate header is inconsistent on bad-token 401 responses). The issue body included an explicit sourcing-discipline paragraph: here is what was wrong, here is why it was wrong, here is the corrected scope.

The transferable insight is about retraction economics. The credibility cost of an unwithdrawn wrong claim compounds — every future finding from the same audit gets read through the lens of "they got OAuth wrong, what else did they get wrong?" The credibility cost of an explicit retraction is small and decays fast. The partner reads the retraction, registers that the audit corrects itself, and the next finding gets evaluated on its merits.

Inference-grade findings shipped to partners are not "drafts" or "starting points." They are slop with a hedge. If the system can produce a verbatim server response, the audit has to produce one before the finding ships.

Guard 3: Post-delivery consistency sweep against fork main

After PRs #39 through #44 landed, the work ran /validate-consistency against the fork's main branch. Each PR had been internally consistent. The sweep returned seven findings anyway — all of them cross-PR drift.

Critical findings, two:

AGENTS.md was missing from the fork root, but the agents and hooks PRs both referenced it as if it existed.
package.json was still at version 1.0.0 while plugin.json had been bumped to 1.0.2. The version bump happened on one surface but not the other.

Warning findings, four:

The README had no section for the new agents/ directory introduced by PR #41.
The README had no section for the new hooks/ directory introduced by PR #44.
SKILL.md claimed Node >=18 while CI was already pinned to Node 20.
A fork-side issue reference read just #28 with no owner — ambiguous between upstream and fork.

Info finding, one:

package.json and marketplace.json had divergent descriptions.

All seven resolved in PR #45, a single cleanup pass. AGENTS.md got created at the fork root, 72 lines, every claim sourced. package.json bumped to 1.0.2 to match plugin.json. README gained sections for both agents/ and hooks/. SKILL.md Node compatibility was updated to match CI. The bare #28 was disambiguated to jeremylongshore/automate#28. The two manifest descriptions were aligned.

None of those seven findings would have been caught by reviewing any individual PR. Each PR was internally consistent. The drift only existed in the relational space between the PRs — file A references file B that does not exist yet, version X on surface 1 lags version Y on surface 2, description in manifest M diverges from description in manifest N.

The transferable insight is about review topology. Pre-submission review operates on one artifact at a time. Cross-artifact drift is structurally invisible to that frame. Running a consistency sweep as the closing move of the day catches a class of slop that pre-submission review cannot catch by design.

The sweep is cheap. The cleanup PR is small. The slop it prevents is the kind partners notice quietly and never mention — the README that does not describe what the repo contains, the version numbers that disagree with themselves, the references to files that do not exist. Quiet slop is the most expensive kind because the partner does not file a bug; they just lower their estimate of the engagement.

What three guards do not catch

These three guards target three specific failure classes. The pre-flight guard catches surface flaws in the artifact being shipped. The empirical verification guard catches inference-grade claims about external systems. The post-delivery consistency guard catches cross-artifact drift.

None of the three catches bad strategic choices. If the underlying decision to ship an advisory-only hooks bundle — rather than no hooks at all, or rather than blocking hooks with a serious credential design — was wrong, the guards would not flag it. They would clear a well-built version of the wrong thing.

None of the three catches architectural drift over weeks. All three operate on a single day's window. A long arc of individually-consistent decisions adding up to a wrong system needs a different mechanism — typically a periodic architecture review or a deliberate retro, neither of which fits inside a daily ship cycle.

None of the three catches bad communication with the partner. The guards catch wrong claims, not wrong tone, wrong cadence, or wrong escalation. A correctly-filed finding delivered with the wrong framing to the wrong person at the wrong moment is still a credibility hit. That problem lives outside the guards.

The guards are necessary, not sufficient. They eliminate a category of public embarrassment; they do not produce good engineering. Good engineering happens upstream of the guards, in the choices about what to build and what to file. The guards make sure that the choices, once made, ship in a defensible form.

What the day actually demonstrated

Seven PRs in a day on a partner engagement — with upstream filings and a prior audit round closing out in parallel — is a velocity that produces slop by default. That is the baseline. Velocity without a system underneath it is the slop pattern.

The slop did not ship today because three different mechanisms caught three different classes of error at three different moments. The pre-flight guard caught the hooks bundle before submission and forced a redesign that eliminated the credential surface entirely. The empirical verification guard caught the inference-grade OAuth claims before they shipped and converted the bundle into a narrower, defensible scope with an explicit retraction. The post-delivery consistency guard caught seven instances of cross-PR drift after the PRs landed and resolved them in a single cleanup pass.

The retraction is worth a second mention. Withdrawing wrong claims with an explicit sourcing-discipline paragraph is the kind of artifact that builds long-term credibility with a partner more than a perfect first submission would. A perfect first submission demonstrates competence. A retraction demonstrates a working error-correction loop. Partners optimize for working error-correction loops because they assume errors will happen — what they care about is what happens after.

This is the system. Not the velocity, the system underneath the velocity. The velocity is downstream of the system, not the other way around. Seven PRs in a day is safe if the three guards are running. Seven PRs in a day without the guards is a slop event waiting to be discovered by the partner — usually quietly, usually without a bug report, usually as a downward revision of trust that nobody articulates.

The lesson is not "ship faster." The lesson is "build the guards first, then the velocity is allowed."

Two False-Positive Fixes, Same Root Cause

Jeremy Longshore — Thu, 14 May 2026 13:00:27 +0000

Two separate monitoring failures on the same day, same root cause. Both fixed by answering a single question: "Am I testing for health, or am I testing for perfect conditions?" The distinction matters because perfect conditions are temporary, and health is structural. And once you see the pattern once, you see it everywhere.

Context: production on a shared VPS

The Braves stack runs on Contabo (24 GiB RAM, 6 CPUs). Five Docker stacks share that hardware: Braves (frontend, backend, pybaseball), Plane (13 containers), Twenty (5 containers), Umami (3 containers), and ntfy (1 container). 25 containers total. Single ingress: Caddy reverse proxy. Single disk. When one stack's load spikes, all five feel it.

This architecture means healthchecks and deployment validators are sensitive to global state, not just stack-local state. A healthcheck that works under isolated test conditions can fail when the VPS is under collective load. A validator that passes in the afternoon can fail at 2 AM when a different stack is doing batch work.

The symptom

On May 11, two separate failure modes emerged:

False-positive container-unhealthy alerts firing ~10 times per day. Each one triggered: manual inspection, "nope, it's fine," return to normal operations. Repeat. The notification log became noise.
Every off-hours deploy auto-rolling back without an obvious cause. Off-season deployments (which are mostly off-hours) all failed smoke checks and rolled back. The CI pipeline was effectively blocked for non-emergency pushes.

Both failures traced to monitoring expressions that mixed structural health signals with situational condition signals.

Fix one: TCP over HTTP fetch

The setup

Healthchecks for the Braves containers ran every 10 seconds, invoking Node's global fetch (or urllib.request for the Python service) to make an HTTP round-trip to a local status endpoint. The logic was straightforward: open connection, validate response, exit on failure. The Docker healthcheck timeout was 5 seconds.

Performance profile:

Light load (loadavg < 2): fetch completed in 5–20 ms.
Moderate load (loadavg 2–8): fetch completed in 100–500 ms.
High load (loadavg > 10): fetch sometimes failed to complete within 5 seconds.

The failure cascade

When the healthcheck timed out:

Docker retried the check every 10 seconds.
After 5 consecutive timeouts (50 seconds), Docker marked the container unhealthy.
Netdata observed the state change and fired a docker_container_unhealthy alert.
The alert flowed through ntfy to mobile notifications: "scorecardecho is down."
Manual inspection: the container was fine, the process was responding, load was just high.
Clear the alert, wait for the cycle to repeat.

This happened ~10 times per day, every single day.

The assumption that bit

Fetch-based healthchecks assume light load. They assume:

The event loop has microseconds to spare for I/O
The network isn't congested
The kernel isn't swapping
No other workload is competing for scheduler time

All true most of the time. Not true on a shared VPS where 24 other production containers are running. Not true when pybaseball is churning through XML parsing. Not true when Plane is sync-checking its database. The healthcheck assumed the happy path—and the production VPS spends most of its time off the happy path.

The fix (commit `cbb4f6e`)

Replace the HTTP fetch with a raw TCP connect. Verification moves from the application layer down to a single SYN/ACK exchange — the work the kernel was already doing to accept the connection.

  healthcheck:
-   test: ["CMD-SHELL", "node -e \"fetch('http://localhost:3001/api/health').then(r=>{if(!r.ok)process.exit(1)}).catch(()=>process.exit(1))\""]
-   interval: 10s
+   test: ["CMD-SHELL", "node -e \"require('net').connect(3001,'localhost').on('connect',function(){this.end();process.exit(0)}).on('error',function(){process.exit(1)})\""]
+   interval: 30s
    timeout: 5s
-   retries: 5
+   retries: 3
+   start_period: 15s

The Python service got the equivalent treatment:

- test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:8001/health')\" || exit 1"]
+ test: ["CMD-SHELL", "python3 -c \"import socket; s=socket.create_connection(('localhost',8001),2); s.close()\""]

Both new checks open a TCP connection to the port, immediately close it, and exit. No HTTP parsing. No JSON. No event-loop work beyond the socket call itself. The kernel completes the SYN/ACK in microseconds even when the application thread is stalled. This pattern works in any container image that already has node or python3 — no extra binaries to install.

Tuning alongside the fix

Three other changes shipped together:

Interval 10s → 30s: Polling three times less frequently means 3× fewer state transitions, 3× fewer container-state callback executions, 3× fewer potential false positives.
Retries 5 → 3: Before: unhealthy after 50 seconds. After: unhealthy after 90 seconds. Trades slightly earlier detection of real outages for dramatically lower false-positive noise.
start_period: 15s added: Containers no longer fail healthcheck during startup when they're still bootstrapping.

Operational pairing: Netdata hold-down

The VPS runs Netdata for monitoring. A separate change added a 2-minute hold-down before alerting on docker_container_unhealthy. A brief glitch—a 10-second spike in load, a temporary network hiccup—can't page anymore. It has to persist for 120 seconds.

Result

Unhealthy alerts dropped from ~10 per day to zero. The notification log went silent.

Fix two: drop the mode signal from deployment validation

The setup

The deployment smoke check for the Braves backend used a jq filter applied to the app's status endpoint:

.status == "ok" and .gumbo.running == true

The first part is a liveness signal: the app is responding and healthy. The second part is a mode signal: the gumbo processor (which handles game-update XML) is currently running. When this filter was written—probably during baseball season when games are daily—both conditions made intuitive sense. Both seemed permanent.

The failure cascade

Most of the calendar is between games:

Off-season (November–March)
Post-game (after each game ends)
Pre-game (before first pitch, morning hours)

During these windows, gumbo.running is false. Most deployments happen off-hours. So most off-hours deployments triggered a smoke check that required gumbo.running == true. The app was fine. The status was "ok". But the game processor was inactive. The filter conjunction failed. The deployment workflow interpreted the failure as "deployment is broken, roll back." Automatic rollback fired. Every single off-hours deploy. Without exception.

This blocked the entire CI pipeline for off-season work. No off-hours deployments could land unless manually overridden.

The assumption that bit

gumbo.running is a temporary signal. It's true when a game is in progress. False when there isn't one. During the offseason it's false for months straight.

The smoke check mixed a permanent structural signal (status == "ok" = the app is healthy) with a temporary situational signal (gumbo.running == true = a game is active right now). It required both to be true, as if they were equivalent. They aren't. An app is healthy between games just as much as it's healthy during games. Health and game-processing mode are orthogonal.

The fix (commit `5b9fe26`)

Remove the mode condition entirely. The filter now simply validates health:

-.status == "ok" and .gumbo.running == true
+.status == "ok"

A single question: "Is the app responding correctly?" Nothing about what it's processing. Nothing about external conditions.

Result

Off-hours deployments stopped auto-rolling back. The CI pipeline unblocked. Every deploy now passes smoke validation as long as the app is actually healthy, regardless of whether a game is in progress.

The shared lesson

Both fixes follow the same pattern: a monitoring expression conjoined two signals where one was structural and the other was situational.

Fix	Structural Signal	Situational Signal	Status
#1 (healthcheck)	"Process is listening on port 3000"	"Load is light enough for a 5-second fetch"	Always true? No.
#2 (smoke check)	"App responds with ok status"	"Game processor is running"	Always true? No.

When the situational signal became false—as situational signals do—the conjunction failed, and the alarm fired. The system was healthy. The alarm was noise.

The pattern emerges because it feels right when you write it. "The app should be healthy and the load should be light." "The container should be healthy and the game should be in progress." Both conditions seem like they should always be true. They're not. Situational conditions change. The moment you conjoin them with structural health signals, you've created a trap. The conjunction becomes true only under the narrow circumstances you happened to be testing in.

Three ways to break the trap

Remove the situational condition

Ask only the health question. Strip the conjunction down to the structural signal.

Move to a separate alert

"Is the app healthy?" and "Is the game processor running?" are two questions. They should be two checks, not one. Alert on each independently.

Document the assumption

If the check fails when a situational condition flips, say so in the alert message so responders know the system is fine without manual intervention.

The checklist before merging a monitoring expression

List every condition it depends on staying true:

"This healthcheck assumes load is under threshold N."
"This smoke check assumes a game is in progress."
"This alert assumes the cache is populated."
"This validator assumes the external service is available."

If any condition can become false — and most can — apply one of the three fixes above.

A healthcheck should answer: "Is this process alive?" A deployment validator should answer: "Does the app respond correctly?" Neither should answer: "And is everything perfect?" Perfect is temporary. Healthy is structural.

Also shipped: hubspot-pack v2.0.0 landed the same day, consolidating 30 templated skills into 10 production-engineering skills following the guidewire v2 pattern. Also: porkbun-dnssec-caa.sh script pinning DNSSEC/CAA on intentsolutions.io as a Rekor predicate precondition.

AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate

Jeremy Longshore — Tue, 12 May 2026 13:00:26 +0000

Canonical home: This post first appeared on Kobiton's blog at kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate. This page mirrors it; SEO authority consolidates to the Kobiton URL via rel="canonical".

AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate

TL;DR — I ran a 5-device parity sweep against Kobiton's real-device cloud through the kobiton/automate Claude Code plugin. iOS screenshot capture came in ~17% faster than Android in this run. The interesting part isn't the gap — it's that the plugin doesn't document the gap, or the post-deleteSession cooldown, or which Appium log endpoints actually work. That's what an AGENTS.md file is for, and PR #10 on the repo is starting to add one. This is a worked example of what should go in it.

I spent last week poking at kobiton/automate, the Claude Code plugin that fronts Kobiton's real-device cloud. Five devices, two pools, both major mobile platforms, one small WebDriverIO harness. The numbers showed something plugin authors rarely publish: iOS screenshot capture was about 17% faster than Android across the sample.

That gap isn't a bug. It's platform variance. But it's the kind of variance you want surfaced before your CI bill quietly compounds it — and surfacing things like this is exactly what a cross-tool agent brief like AGENTS.md is for.

The plugin

kobiton/automate is a thin Claude Code plugin pointing at a remote MCP server (https://api.kobiton.com/mcp). The repo holds manifests, one skill, schemas, and docs. Appium still runs the driver loop once a session opens. That's the right boundary. The plugin doesn't pretend to be Appium; it just helps the agent get into a working session and back out cleanly.

The public repo currently exposes 12 MCP tools:

Area	Tools
Devices	`listDevices`, `getDeviceStatus`, `reserveDevice`, `terminateReservation`
Sessions	`listSessions`, `getSession`, `getSessionArtifacts`, `terminateSession`
Apps	`listApps`, `uploadAppToStore`, `confirmAppUpload`, `getApp`

Last week the team opened PR #10, which adds GitHub Copilot CLI support and an AGENTS.md file. Five files changed, 75 lines added. As of writing it's open and marked in testing. Most of the diff is portability work — declaring skill and MCP paths, swapping Claude-specific phrasing for neutral language, and adding the agent-facing instructions file itself.

That PR is what made me want to write this up. It's a real example of a plugin moving from "works in Claude Code" to "any reasonable coding agent can read this and behave."

The parity sweep

The harness is small. Open an Appium session, take five screenshots, record boot wall-clock and per-screenshot p50, terminate cleanly. Five devices:

Device pool	OS	Model	Boot ms	Screenshot p50
PRIVATE	Android 13	Galaxy A52s 5G	4,206	353
CLOUD	Android 9	moto g(7) play	5,451	297
PRIVATE	iOS 17.5.1	iPhone XR	5,091	242
CLOUD	iOS 18.6	iPhone 14 Plus	4,490	306
CLOUD	iOS 18.6.2	iPad 9th Gen	5,259	256

In this run:

Boot times spread ~30%.
Screenshot p50 spread ~46%.
Android averaged ~325ms per screenshot.
iOS averaged ~268ms — about 17% faster.

Five devices is not a fleet study, so don't read this as "iOS wins." What's worth noticing is that platform mattered more than pixel count. The fastest screenshot in the run came off an iPhone XR at 828×1792; the slowest came off a Galaxy A52s 5G at 1080×2400. Resolution alone didn't predict the spread.

That gap matters in CI. A 57ms screenshot delta sounds trivial until you compound it. At 100 tests × 50 runs/day × 3 screenshots per test, you've spent ~855 seconds a day, or ~7 hours a month, on the slower path. Push that to five screenshots per test and you're at ~12 hours/month. Not a redesign-the-suite number. But it's real queue time — enough that a routing decision ("send the screenshot-heavy suite to iOS first") starts paying for itself.

Two findings an AGENTS.md would close

Two things came up that an agent-facing brief would have closed before I started.

Endpoint compatibility

driver.getLogs('logcat') didn't return usable data through the endpoint my client tried. Appium's docs distinguish between /session/:sessionId/log and /session/:sessionId/se/log, and which one works depends on the driver and server. A plugin like this should just say up front which log endpoints it supports, which it rejects, and what the agent should do when log retrieval fails.

Without that, a test ported in from a vanilla Appium setup can silently lose its logs. The test still passes. The evidence is just gone. Worst kind of failure — the kind that smiles and waves while stealing your evidence.

Lifecycle invisibility

After deleteSession, devices entered a brief cooldown. During the window getDeviceStatus reported them as ACTIVATED with is_online=true — but they couldn't actually accept a new session yet. A naive scheduler sees "ready," queues the next job, and waits.

The fix is a documented lifecycle. Names like ready / reserved / active / cleanup-required / cooldown-required / offline / unknown. The wording matters less than having one. If is_online=true doesn't mean session-ready, the plugin needs to say that out loud.

Both gaps are documentation, not code.

Where Claude Code conventions meet AGENTS.md

If you've authored a Claude Code plugin you already know about CLAUDE.md (Claude-specific repo guidance) and SKILL.md (skill frontmatter and workflow). Neither replaces AGENTS.md.

AGENTS.md is the tool-agnostic instruction file. A briefing packet any coding agent can read: setup, conventions, testing rules, operational caveats. SKILL.md belongs to a different model entirely — the open AgentSkills.io spec defines its structure for reusable skills. Related, not interchangeable.

The four files compose:

File	Purpose
`README.md`	For humans — overview and install
`CLAUDE.md`	Claude Code-specific guidance
`SKILL.md`	Skill trigger and workflow
`AGENTS.md`	Cross-tool operational guidance for any agent

A strong AGENTS.md for an MCP-backed testing plugin should cover capabilities (what it does), costs and latency (p50/p95, screenshot timing, upload constraints, platform variance), lifecycle states (what "ready" actually means), compatibility boundaries (which Appium endpoints work, when to fall back to artifact APIs), and orchestrator requirements (what CI systems and agent runtimes need to know).

When a plugin documents that, a cost-conscious agent can make decisions instead of guessing. "This suite goes to the faster capture path." "This device needs cooldown." "This log endpoint isn't available, use artifacts." Without the spec you're guessing. With it, you're routing.

What kobiton/automate got right

The plugin is a clean implementation of the thin-plugin / remote-MCP pattern that the AI agent ecosystem is converging on. MCP server config points to Kobiton's hosted endpoint. OAuth 2.1 is the default; API keys exist for headless CI. App uploads go through pre-signed storage URLs rather than routing binaries through the assistant. Tool schemas live as reference YAML. The run-automation-suite skill stays focused on guided Appium execution and doesn't try to become a test framework.

That's the right scope. A Claude Code plugin shouldn't pretend to be Appium. It should help the agent pick a target, prepare inputs, run the test, collect evidence, and report out.

PR #10 adds the cross-tool layer on top of that. It isn't a complete operational spec yet, but it's pointed in the right direction.

What's still open

The gaps the parity sweep exposed are exactly what I'd document next:

Supported and unsupported Appium log endpoints.
Platform-specific log retrieval guidance.
Device lifecycle states between "online" and "session-ready."
Cooldown behavior after deleteSession.
Retry/backoff rules for schedulers.
Error shapes for partial success, timeout cleanup, and artifact failures.
Latency expectations for screenshot capture and session boot.

The file doesn't have to be exhaustive on day one. It has to be honest — the operational facts an agent would otherwise learn the expensive way.

Method note

The matrix wasn't a vibe check. Before any device touched the harness, I had three Claude sub-agents review the script in parallel — code-reviewer, test-automator, security-auditor. They caught:

Orphaned cleanup on timeout.
Partial success counted as full success in the fallback chain.
A timing bug where a 30-second log capture window could skid by ~1.5 seconds per device under load.

Any one of those would have polluted the measurement. The cadence is reusable: specify the experiment, multi-review it, fix the harness, run the sweep, publish with caveats. Skipping the review step is how a 10-minute validation turns into a two-hour bug archaeology dig.

A test you can run this week

If you author or consume a real-device testing plugin, run something like this against your own pool:

for device in pool:
    t0 = now()
    session = create_session(device)
    wait_for_ready()
    boot_ms = now() - t0

    shots = []
    for _ in range(5):
        s = now()
        take_screenshot()
        shots.append(now() - s)

    delete_session(device)
    results[device] = {"boot": boot_ms, "shots": shots}

print_percentiles(results)

Five devices, five screenshots, one table. That's the baseline you can re-run whenever your pool changes — and the evidence you need to decide whether screenshot-heavy, log-heavy, or cold-start-sensitive tests should route differently.

If your platform vendor's docs don't tell you which Appium endpoints work, what session cleanup actually does, or what "online" means — that's not a docs gap. That's operational risk wearing a friendly UI.

The takeaway

Cross-tool plugin standards aren't abstract architecture. They're the difference between

"We picked Android arbitrarily and paid for the variance silently."

and

"We routed the screenshot-heavy suite based on measured platform behavior."

kobiton/automate is moving in the right direction. Clean remote-MCP shape, focused skill design, sensible auth boundaries — and now PR #10 starts the cross-tool instruction surface.

If you author a plugin: README.md for humans, CLAUDE.md for Claude-specific bits, SKILL.md for skill workflow, AGENTS.md for everything any agent runtime needs to know. They compose; none of them replaces another.

If you consume plugins from a real-device cloud — or any AI-orchestratable platform — ask your vendor whether they publish an AGENTS.md or equivalent. Then ask what's in it.

If the answer is "what's that?", you found the gap.

Postscript (2026-05-07). While this post was being finalized, the kobiton/automate team merged Copilot CLI support (PR #10) and opened a Phase 1 Gemini CLI extension PR (PR #28). Both reuse the same AGENTS.md, the same MCP server endpoint via OAuth dynamic discovery (RFC 9728), and the same skills/<name>/SKILL.md convention — three CLIs against one source of truth, zero server-side code change.

The Gemini PR description is the working reference for anyone trying this pattern: AGENTS.md carries the cross-tool load (no separate GEMINI.md needed), dynamic-discovery OAuth lets the install flow piggyback off plumbing already deployed, and skills auto-discover from the canonical path so they don't need explicit manifest references. If you're authoring a plugin in 2026 and want to ship it across Claude Code, Copilot CLI, and Gemini CLI with one source tree, read those two PRs.

OpenAI Codex CLI is the natural fourth runtime in this space and fits the same pattern — AGENTS.md is read natively, MCP servers are declared in ~/.codex/config.toml under [mcp_servers.<name>], and the OAuth dynamic-discovery flow is identical. The only delta is the config format (TOML rather than JSON), which means a Codex extension to a multi-CLI plugin is typically just a documentation snippet — no new manifest, no new build step. Four agentic CLIs, one cross-tool surface, one MCP server. That's the convergence the AGENTS.md convention was hinting at all along.

Coherence as a Deliverable: How a Multi-Surface Engagement Stays Sane

Jeremy Longshore — Tue, 12 May 2026 13:00:25 +0000

A sprawling multi-surface engagement (Kobiton partner pilot, 4 months, three deliverable rounds) exposed a silent failure mode: drift doesn't announce itself. A title rename on Plane goes unnoticed when the canonical source doc still has the old framing. A partner-portal deliverable gets updated before the source file does, leaving future sessions reading stale context from what should be source-of-truth.

On 2026-05-08, one session caught two separate drift instances and shipped four structural patterns to make drift cheaper to find next time. None of the drifts were bugs. Both were coherence gaps — places where a single idea lived in multiple surfaces (Plane, beads, local docs, partner portal) with different currency.

The fix wasn't "use one surface." It was: detect drift early, make the boundaries between surfaces explicit, give pre-committed thinking a home, and grow scope through buckets instead of through accretion.

Drift Caught in Two Directions

A May 4 session had renamed a Plane content issue from "Text-first AI triage on session logs (refined per F30)" to "AI-vision testing." The local draft file (000-docs/020-DR-BLOG-...md), the partner portal copy (m2-blog-3.md), and the CLAUDE.md history were all on the canonical thesis: text-first triage. Plane was the only surface out of sync.

Caught by reading CLAUDE.md cold. A session with no prior context opened the resume-from-cold doc and noticed the contradiction. The fix: revert Plane back to canonical, log the vision-testing angle as a separate evergreen idea in a new file (034-RR-OPEN), mark it explicitly as deferred.

The reverse-drift happened the same day. An R2 fork-staging update went to the partner portal first (because the client reads that surface), but the source doc (021-AA-AACR-r2-...md) was now stale. Sync brought source up to portal. Header table updated with new snapshot tag, new "Staged audit slate" metadata row. Reverse-drift is the silent kind: the deliverable surface looks current, the source looks wrong, and a future session reading source will replay outdated thinking.

Two drifts in one day on the same engagement. The pattern: without explicit boundaries and a promotions log, every surface drifts toward stale.

A Current-Focus Block at the Top of CLAUDE.md

Added to kobiton/CLAUDE.md: a "Current focus" block at the very top. Three rows. Each row names a live workstream (M2 blog cadence, M3+R3 final review, hooks-as-deterministic thesis), owns it to a bead, and defines what "done" looks like.

Below that: an explicit "what NOT to start" list. New evergreen blogs, project-shipping blogs, site infra, channel work — all queued but explicitly deferred until M2/M3/R3 close.

Why not a checklist or a TODO list? Because a TODO is committed work. A Current-focus block is a priority map for cold-starting future sessions. A TODO says "do this." The block says "this is load-bearing now; everything else queues below the line." Future sessions landing cold should know what's live without reading a month of history.

## Current focus (2026-05-08) — read this first

| Workstream | Owner | "Done" looks like |
|---|---|---|
| M2 Blog series delivery | kobiton-z3y | Blog 1 published May 11, Blog 2 May 18, Blog 3 May 25 |
| M3 Featured Placement + R3 close | kobiton-9z0.7, kobiton-bmj | R3 deliverable filed and reviewed by May 25 |
| Hooks-as-deterministic layer thesis | kobiton-5cj | Prototype → multi-reviewer pre-flight → R3 above-spec landing |

**What NOT to start until M2/M3/R3 close:** new evergreen blog drafts, new project-shipping blogs, site-refresh work, channel infra.

A Strategic Spine for the 19-Issue Backlog

Same day, a separate consolidation: 19+ scattered Plane content issues organized into a 6-post evergreen series in publication order, with adjacent clusters (B/C/D/E) listed so unfiled ideas have homes too. This is the antidote to backlog rot. Without a spine, every new idea fights every other idea for next session's attention. With a spine, ideas cluster, and new sessions land oriented — they read the spine, see what's live in cluster A, and know that clusters B-E are queued but real.

RR-OPEN: The Pre-Committed Layer

A new file: 034-RR-OPEN-things-to-think-about.md. Single surface for engagement-adjacent open questions, loose threads, refinement ideas, and deferred decisions that aren't yet committed work. Not a TODO. Not a backlog. A pre-committed thinking surface.

Six categories. Initial seed: 10 bullets. Crucially, it includes a "Promotions log" — when a bullet matures and graduates (to Plane CONTENT, beads, KOB issues, email, or CLAUDE.md), the commit message records where it went:

### Promotions log

- 2026-05-08 — "Per-harness spec audit scope (decide before May 14)"
  RESOLVED as out-of-scope. Spec audit stays narrowly scoped to
  code.claude.com/docs/en/mcp per existing contract. The "10-12
  harness reach" framing migrates to OPS-28, not engagement scope.

Why not just a TODO list or a scattered Slack thread? Because ideas that live nowhere searchable get re-invented. RR-OPEN is a single backlog-rot antidote: ideas can live here, mature visibly, and graduate with an audit trail of where they went.

Scope Discipline Through Bucket Boundaries

R3 scope expanded from one bucket to three in a single session. Normally that's a red flag. The discipline that kept it coherent: each bucket got its own bead with explicit deliverable boundaries.

Standard re-validation (existing bead, no boundary change)
Spec-conformance audit (new bead, separate surface, distinct findings)
Hooks bundle (conditional on multi-reviewer pre-flight before May 23)

The empirical findings catalog (F11-F35) and the spec-conformance candidates (F36-F43) live in separate subsections. Scope can grow without losing shape if the boundaries between buckets are explicit and defensible.

Pre-Flight Catches the Signal-Type Misses

A technical comment for a partner GitHub PR went through multi-reviewer pre-flight before posting. The catch: three signal-type mislabelings in a single comment. The same three mislabelings had propagated back into Plane CONTENT issues that referenced the same source material.

Mistakes don't live in single surfaces. A mislabeling on a public comment is also on the issues that referenced the same source. Catching it pre-flight means one fix in three places. Catching it after publish means a correction comment, three issue edits, and a stale public comment that future readers will trust.

Also Shipped

R2 follow-up email sent (closing the credibility gap from a May 5 commitment). One beads epic created, one stale bead closed. Three companion commits to partner-portals/ for the reverse-drift fix.

AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate

Jeremy Longshore — Mon, 11 May 2026 04:43:57 +0000

Canonical home: This post first appeared on Kobiton's blog at kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate. This page mirrors it; SEO authority consolidates to the Kobiton URL via rel="canonical".

AGENTS.md as a Cross-Tool Plugin Brief: A Case Study from kobiton/automate

The plugin

The public repo currently exposes 12 MCP tools:

Area	Tools
Devices	`listDevices`, `getDeviceStatus`, `reserveDevice`, `terminateReservation`
Sessions	`listSessions`, `getSession`, `getSessionArtifacts`, `terminateSession`
Apps	`listApps`, `uploadAppToStore`, `confirmAppUpload`, `getApp`

That PR is what made me want to write this up. It's a real example of a plugin moving from "works in Claude Code" to "any reasonable coding agent can read this and behave."

The parity sweep

The harness is small. Open an Appium session, take five screenshots, record boot wall-clock and per-screenshot p50, terminate cleanly. Five devices:

Device pool	OS	Model	Boot ms	Screenshot p50
PRIVATE	Android 13	Galaxy A52s 5G	4,206	353
CLOUD	Android 9	moto g(7) play	5,451	297
PRIVATE	iOS 17.5.1	iPhone XR	5,091	242
CLOUD	iOS 18.6	iPhone 14 Plus	4,490	306
CLOUD	iOS 18.6.2	iPad 9th Gen	5,259	256

In this run:

Boot times spread ~30%.
Screenshot p50 spread ~46%.
Android averaged ~325ms per screenshot.
iOS averaged ~268ms — about 17% faster.

Two findings an AGENTS.md would close

Two things came up that an agent-facing brief would have closed before I started.

Endpoint compatibility

Lifecycle invisibility

Both gaps are documentation, not code.

Where Claude Code conventions meet AGENTS.md

If you've authored a Claude Code plugin you already know about CLAUDE.md (Claude-specific repo guidance) and SKILL.md (skill frontmatter and workflow). Neither replaces AGENTS.md.

The four files compose:

File	Purpose
`README.md`	For humans — overview and install
`CLAUDE.md`	Claude Code-specific guidance
`SKILL.md`	Skill trigger and workflow
`AGENTS.md`	Cross-tool operational guidance for any agent

What kobiton/automate got right

That's the right scope. A Claude Code plugin shouldn't pretend to be Appium. It should help the agent pick a target, prepare inputs, run the test, collect evidence, and report out.

PR #10 adds the cross-tool layer on top of that. It isn't a complete operational spec yet, but it's pointed in the right direction.

What's still open

The gaps the parity sweep exposed are exactly what I'd document next:

Supported and unsupported Appium log endpoints.
Platform-specific log retrieval guidance.
Device lifecycle states between "online" and "session-ready."
Cooldown behavior after deleteSession.
Retry/backoff rules for schedulers.
Error shapes for partial success, timeout cleanup, and artifact failures.
Latency expectations for screenshot capture and session boot.

The file doesn't have to be exhaustive on day one. It has to be honest — the operational facts an agent would otherwise learn the expensive way.

Method note

Orphaned cleanup on timeout.
Partial success counted as full success in the fallback chain.
A timing bug where a 30-second log capture window could skid by ~1.5 seconds per device under load.

A test you can run this week

If you author or consume a real-device testing plugin, run something like this against your own pool:

for device in pool:
    t0 = now()
    session = create_session(device)
    wait_for_ready()
    boot_ms = now() - t0

    shots = []
    for _ in range(5):
        s = now()
        take_screenshot()
        shots.append(now() - s)

    delete_session(device)
    results[device] = {"boot": boot_ms, "shots": shots}

print_percentiles(results)

The takeaway

Cross-tool plugin standards aren't abstract architecture. They're the difference between

"We picked Android arbitrarily and paid for the variance silently."

and

"We routed the screenshot-heavy suite based on measured platform behavior."

kobiton/automate is moving in the right direction. Clean remote-MCP shape, focused skill design, sensible auth boundaries — and now PR #10 starts the cross-tool instruction surface.

If you consume plugins from a real-device cloud — or any AI-orchestratable platform — ask your vendor whether they publish an AGENTS.md or equivalent. Then ask what's in it.

If the answer is "what's that?", you found the gap.

Forge Dogfood Ships a Grade-A Plane Plugin, JRig Loop Closes

Jeremy Longshore — Sun, 10 May 2026 13:00:26 +0000

A plugin generator is theoretical until it produces something a marketplace will actually accept. May 7 turned the /skill-creator --forge workflow from an 8-gate diagram into a real artifact — a Plane plugin that scored Grade A (97/100), passed Tier 2 GREEN with zero warnings, and cleared all 12 deterministic j-rig checks across the 7-layer behavioral framework. On the same day, the JRig-Verified provenance pipe closed end-to-end: a schema, a build-time enrichment step, a per-plugin verification page, and a validator tier all landed in the same window. The thesis the day proves: compound commands and build-time enrichment beat raw API surfaces and runtime joins, and the way to find that out is to run the full pipeline once on something real.

What "theoretical" looked like on May 6

The forge workflow had eight gates defined in spec. None had been exercised together. The JRig-Verified badge UI shipped earlier in the same day in PR #696 — it rendered, but the data path behind it terminated at an empty placeholder. A plugin detail page could display "JRig-Verified · N/7 layers" if the right shape of data showed up, but nothing in the build pipeline produced that data, and the /plugins/<name>/verification link the badge pointed at was a 404.

The pre-May-7 state, in a sentence each:

Forge: documented, scaffolded, never run end-to-end on a real API
JRig badge: UI complete, no data source, dangling link target
Validator: 100-point rubric, no static production checks beyond it
Marketplace homepage: 422 plugins, no curated entry surface for the first five minutes
Provenance metadata: spec defined generated and author_type fields; no consumers

Every one of those gaps closed on May 7.

The forge dogfood — Plane as a team behavior observatory

The forge takes two inputs: an API spec and a one-line NOI (Notion of Intent — the answer to "what makes this plugin different from a CRUD wrapper?"). The NOI is the forcing function. Without it, an LLM-generated plugin defaults to one command per endpoint, and the result is a thinner, slower duplicate of whatever MCP server already wraps the API.

The NOI for this run rejected that framing outright:

Plane is a team behavior observatory.

That sentence does most of the design work. The existing mcp__plane MCP server already covers CRUD — listing cycles, creating issues, updating worklogs. A plugin that wraps the same surface is dead weight. A plugin that surfaces the behavioral signal hiding inside JOINed Plane data is something the MCP server cannot do, because MCP tools are endpoint-shaped and behavior is JOIN-shaped.

The five compound commands the NOI produced each answer a question that no single Plane endpoint can answer:

/plane-cycle-velocity        — does cycle close-out match cycle planning?
/plane-stale-tickets          — which In Progress tickets quietly fail under shared ownership?
/plane-reviewer-gate-strength — which reviewers gate-keep harder than the spec demands?
/plane-priority-drift         — does the team plan high-priority and ship low-priority?
/plane-cross-project-load     — which engineers are spread across too many projects?

Each requires JOINing at least two Plane resources, applying a scoring formula, and producing ranked output. None of them is GET /cycles/{id} plus a render template.

The 8 gates and what came out of each

Gate	Outcome
1. NOI	Accepted: "Plane is a team behavior observatory"
2. Ecosystem absorb	5 competing tools cataloged; behavioral-synthesis gap confirmed uncovered
3. API surface	14 Plane API endpoints documented in `api-surface.md`
4. Domain archetype	Project / Workflow tracker; default compound set adopted + extended
5. Compound commands	5 commands designed with synthesis logic + scoring formula
6. Generation	SKILL.md, 2 agents, 3 references, plugin.json, README.md written
7. Validation	Tier 1 Grade A (97/100), Tier 2 GREEN, Tier 3A GREEN (12/12 j-rig)
8. PR + catalog	PR #703

The output of Gate 7 is the headline number. It is also the first piece of evidence that the workflow produces marketplace-grade output, not lab-bench output.

Reproducibility — the receipts anyone can re-run

Both checks shipped in the PR body:

python3 scripts/validate-skills-schema.py --marketplace \
  plugins/productivity/plane/skills/plane/SKILL.md
# → Grade A (97/100), Tier 2 GREEN, 0 errors, 0 warnings

j-rig check plugins/productivity/plane/skills/plane
# → 12 passed, 0 warnings, 0 errors

Actual stdout from the run captured in the PR body:

Grade: A (97/100)
Tier 2: GREEN — 0 errors, 0 warnings
Tier 3A: 12/12 j-rig checks passed

Anyone with the claude-code-plugins repo checked out can rerun those two commands and deterministically verify the result. Provenance without reproducibility is decoration.

Why "compound" beats "wrapper" — the design rationale in detail

The CRUD-wrapper anti-pattern is seductive because it is easy to generate. An LLM with an OpenAPI spec can produce one command per endpoint in a few minutes. The output passes most surface-level checks: it has commands, it has parameters, it talks to the API. What it does not have is value beyond the API.

A user who wants to list cycles in Plane already has mcp__plane.list_cycles. A plugin command called /plane-list-cycles is a strictly worse interface — slower (slash command overhead), harder to discover (lives in plugin catalog instead of MCP tool list), and provides no transformation of the result. The user gets the raw response either way; the plugin command added one round-trip and zero insight.

A compound command flips the value equation. /plane-cycle-velocity calls list_cycles, then for each cycle calls list_cycle_issues, joins the planning data against the close-out data, computes a velocity ratio per cycle, and returns ranked output with a behavioral interpretation. The user could in principle do this themselves with five MCP calls and a calculator. They will not. The plugin earns its place by collapsing five mechanical steps into one named operation that produces actionable signal.

The NOI gate exists to force this distinction during generation. "Plane is a team behavior observatory" is not a marketing tagline — it is a constraint that disqualifies any command that does not surface behavioral signal. The forge uses the NOI to filter the generated command list: a command that fails to tie back to the NOI gets cut, regardless of how cleanly it wraps an endpoint.

Architecture choices the dogfood surfaced

The forge produced AI-generated output that passed Tier 2 without post-generation edits — a first for the workflow. The PR is 1,123 lines, but the orchestrator skill is only 150. That ratio is intentional. SKILL.md routes — it does not implement. Implementation lives in two agents:

plane-expert — API-surface specialist. Knows endpoints, parameters, auth shape. Does not call live Plane. Used for design questions and shape verification.
plane-analyst — behavioral synthesis. Calls mcp__plane endpoints, applies JOIN logic and scoring formulas, returns ranked output. The five compound commands all delegate here.

Three references back the agents: noi.md (the design anchor every output ties back to), api-surface.md (the 14 endpoints), and compound-commands.md (the synthesis logic and scoring formulas).

MCP server scaffolding got skipped. The forge offers to scaffold an MCP server when the API has no existing wrapper; mcp__plane already exists. Producing a duplicate would have been the exact CRUD-wrapper anti-pattern the NOI rejected.

Provenance metadata — the seam that wires this to JRig

Two fields landed in the plugin's plugin.json:

{
  "generated": true,
  "author_type": "forge"
}

These are read by the marketplace renderer (PR #696's earlier work) to display the "Forge-generated" pill on the plugin page. They are also the inputs the JRig data flow keys on, which is the next half of May 7's story.

Closing the JRig-Verified loop

PR #696 landed the badge UI earlier the same day. The badge rendered conditionally on a plugin.jrig overlay that nothing wrote. The next four PRs closed the gap.

Schema — `forge_proofs` and three new columns (PR #699)

Three columns added to skill_compliance, all nullable, all idempotent:

Column	Type	Purpose
`jrig_passed`	INTEGER, nullable	Boolean — did all 7 JRig layers pass on the model matrix?
`jrig_tier_blocked`	INTEGER, nullable	Which JRig layer (1–7) failed
`jrig_baseline_delta`	REAL, nullable	Performance delta vs. naked Claude. >0 helps, <0 hurts

A new table holds verification artifacts:

CREATE TABLE forge_proofs (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  plugin_name TEXT NOT NULL,
  run_id INTEGER,
  verification_type TEXT NOT NULL,    -- 'tier1' / 'tier2' / 'tier3-jrig' / 'dogfood'
  passed INTEGER NOT NULL,
  evidence TEXT,
  layers_passed INTEGER,
  total_layers INTEGER DEFAULT 7,
  baseline_delta REAL,
  verified_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  UNIQUE(plugin_name, verification_type, run_id)
);

The migration only ADDs — never DROPs, never RENAMEs. Re-runs are no-ops. PRAGMA-check guards prevent duplicate column adds. The schema is forward-compatible by construction.

Build pipeline — `enrich-jrig-data` step (PR #700)

The data path that wires forge_proofs rows into the rendered marketplace page:

forge_proofs (freshie/inventory.sqlite)
    │  SELECT … WHERE verification_type='tier3-jrig' AND passed=1
    ▼
enrich-jrig-data.mjs  ←  jrig:enrich build step (new)
    │  writes flat plugin_name → {verified, layers_passed, total_layers,
    │  baseline_delta, verified_at} map
    ▼
marketplace/src/data/jrig-data.json
    │  imported by [name].astro at static-build time
    ▼
plugin.jrig overlay  ←  PR #696's existing optional-chain rendering
    ▼
"JRig-Verified · N/7 layers" pill on plugin detail page

The build pipeline order post-merge:

1. discover-skills         → skills-catalog.json
2. extract-readme-sections → readme-sections.json
3. sync-catalog            → catalog.json
4. enrich-jrig-data        → jrig-data.json     ← NEW
5. generate-unified-search → unified-search-index.json
6. build-cowork-zips       → cowork zips + manifest
7. astro build             → static site

The sqlite driver decision

enrich-jrig-data.mjs reads freshie/inventory.sqlite to produce jrig-data.json. Two driver options were on the table:

better-sqlite3 — native module, single-call query, ~1 ms per read
sqlite3 CLI subprocess — already on every dev machine and CI runner, ~50 ms per query

better-sqlite3 would add a postinstall native build step on every CI run. That step adds 30–90 seconds, fails on architecture mismatches, and has bitten enough Node projects to be a known smell. The sqlite3 CLI is already installed everywhere the build runs — sqlite3 -json returns parseable JSON natively. Trade: ~50 ms subprocess overhead per query, dwarfed by the 20-second astro pipeline that follows.

Today's jrig-data.json content: {}. Empty by design — no forge_proofs rows have landed yet. The build degrades to "no badge rendered" for every plugin, which is the correct fallback. As soon as the first JRig run writes a tier3-jrig row, the next site build picks it up automatically.

Why build-time, not request-time

Two build-time vs. request-time architectures were on the table for getting forge_proofs data onto plugin pages:

Request-time JOIN — plugin page fetches forge_proofs rows on each render, joins against catalog data, renders the badge inline
Build-time enrichment — forge_proofs rows pre-computed into a flat JSON map at build time, imported by the static page renderer

Request-time wins on freshness — the moment a JRig run writes a row, the next page view sees it. Build-time wins on everything else. The marketplace is a static site (Astro SSG output, served from CDN). Putting a database in the request path of a static site means giving up the static-site benefits: no edge caching, no instant cold start, no "drop the build into any object store and it works." The freshness gap from build-time is bounded by the deploy cadence — currently sub-hourly on commit, more than fast enough for a verification badge that does not need to update in real time.

The data flow shape is also worth noting: enrich-jrig-data.mjs produces a flat map keyed by plugin name. Not a relational join, not a graph, not nested objects — a flat key/value map small enough to import in full at render time. That shape was chosen because Astro's SSG model imports static JSON at the top of the render function. A flat map adds zero query logic to the page; a nested or relational structure would have forced filtering or joining inside the page template, which is the wrong place for that work.

Page target — `/plugins/<name>/verification` (PR #702)

The badge in PR #696 was a link. The link target was a 404. PR #702 shipped the destination page (306 lines of Astro) with two states:

Verified — green pill, baseline delta vs. naked Claude, verified-at timestamp, 7-layer breakdown
Pending — neutral status, two paths to JRig (forge generation or manual j-rig eval), reassurance that grade and Tier 2 results remain authoritative when JRig data is absent

Graceful degradation is built in: the jrig-data.json import is wrapped in a try/catch with an empty-object fallback. Environments without the data file still build the site; the verification page just renders the pending state for everyone.

Homepage starter pack (PR #701, Phase 4C)

Five curated Grade-A plugins now anchor the homepage:

Plugin	Persona
`ai-commit-gen`	productivity
`conversational-api-debugger`	debugging
`ci-cd-pipeline-builder`	ops
`design-to-code`	frontend
`excel-analyst-pro`	business

The marketplace had 422 plugins and no first-five-minutes surface. The starter pack is editorial cadence — quarterly rotation, not algorithmic ranking. Curation beats search when the catalog is too large to skim and the visitor has no query yet.

Validator Tier 2 gate — +273 lines of Python (PR #698)

Five deterministic checks now fire alongside the standard 100-point rubric:

tier2:allowed-tools-accuracy   — declared tools must appear in body          (warn)
tier2:auth-documented           — API surfaces require auth method documented (warn)
tier2:dead-code                 — literal-false branches detected             (warn, capped at 3 surfaces)
tier2:tool-safety               — unscoped Bash + Write/WebFetch needs        (error at marketplace)
                                  Safety Justification
tier2:orchestration-bounds      — skills shouldn't claim cross-skill          (error at marketplace)
                                  orchestration

The first three warn. The last two error at the marketplace tier. That split matches risk: shipping a skill that says "I orchestrate other skills" is a behavioral hazard; shipping one with a stale allowed-tools line is sloppy but not dangerous.

The false-positive guard — a generalizable pattern

The tier2:orchestration-bounds check initially flagged /validate-skillmd itself. That skill documents the anti-pattern in its body — it has a section explaining "skills shouldn't orchestrate other skills." The check, scanning prose for orchestration claims, hit those sentences and emitted an error.

The wrong fix would have been to special-case /validate-skillmd. The right fix was a generic guard on every Tier 2 prose check:

Skip lines inside code fences
Skip lines starting with > (block quotes) or | (table cells)
Skip lines containing negation markers: " not " (space-padded so it does not match "annotate" or "notable"), never, avoid, don't, do not, must not, should not, forbidden, disallow, anti-pattern, antipattern, wrong:, bad:

That guard generalizes to any static-analysis check that runs over prose. A document might describe the very pattern a check is looking for — to teach against it, to warn about it, to compare alternatives. The check has to recognize description versus assertion. The negation-marker list is a cheap heuristic that handles the common cases without an NLP dependency.

This pattern is reusable. Every prose-level lint rule on a documentation site eventually hits it.

Tradeoffs the day shipped

Nothing free landed. Each piece carries a cost:

sqlite3 CLI subprocess — ~50 ms per query overhead. Acceptable inside a 20 s build, would not be acceptable at request time.
jrig-data.json starts as {} — degrades gracefully today, but a misconfigured CI runner that fails the enrich-jrig-data step silently produces an empty file and every JRig badge disappears. The fallback is friendly; the failure mode is silent.
Plane plugin compound commands — JOIN logic and scoring formulas match the playbooks, but no live Plane workspace has run them yet. The math is correct on paper. Behavior under real data drift is unverified until someone runs them against a real workspace.
Validator Tier 2 negation-marker guard — list-based, not parser-based. Documents that paraphrase negation in unusual ways could still trip false positives. The fix when that happens is to extend the list, not to switch to a heavier parser.

Spec docs that landed alongside

Four spec PRs framed the day's work:

PR #693 — master skills spec bumped from 3.1.0 to 3.3.1
PR #695 — JRig Tier 3A spec snapshots added, with .gitignore exceptions to keep the snapshots tracked
PR #696 — tagline plus JRig-Verified plus forge-generated badges added to plugin pages (the UI that PRs #700 and #702 wired to data on May 7)
PR #697 — IS-extension fields for forge provenance landed (generated, author_type) — Phase 5A of the "Use the Printing Press to Learn" plan

Phase 4C (homepage starter pack) and Phase 5A (forge provenance schema) both closed on May 7. The forge dogfood itself was Phase 3 of the same plan. Three phases of a multi-phase plan, all converging in one window — not a coincidence. The plan was structured so the dogfood and the provenance pipeline would close together. Running the dogfood without the provenance pipeline produces a plugin nobody can verify; shipping the provenance pipeline without a dogfood produces a UI for data that does not exist. Both halves had to land at once for either to mean anything.

Why this matters beyond the marketplace

Three patterns from May 7 generalize past the immediate work.

Compound commands beat endpoint wrappers when the value is in the JOIN. The Plane plugin proves the design. An MCP server plus an LLM gives you GET per resource. A compound command gives you WHICH cycles closed late AND had reviewer churn AND had priority drift?, which no API endpoint exposes directly. The forge's NOI gate exists to force that question.

Build-time enrichment beats runtime joins for static marketplaces. jrig-data.json is computed once per build, served as static JSON, and read by Astro at SSG time. Runtime joining freshie/inventory.sqlite against the page render would have meant a database in the request path of a static site. The build step keeps the runtime simple and the cache cold-key small.

Provenance metadata is structural, not cosmetic. The generated: true, author_type: "forge" fields are not just for the badge. They are the seam that lets the JRig pipeline filter, the marketplace render, the validator behave differently, and future tooling cite the origin. Two boolean-ish fields, multiple downstream consumers — that is a metadata investment that pays compounding interest.

False-positive guards are part of the gate, not an afterthought. The Tier 2 orchestration check that flagged /validate-skillmd could have been dismissed as "fix it later." The decision to ship the negation-marker guard with the check is the difference between a gate that earns trust and a gate that gets bypassed because it cries wolf. Static-analysis checks live and die on their false-positive rate; once that rate goes above a small threshold, engineers route around them and the gate stops being a gate.

The Intent Solutions thread

The forge dogfood and the JRig loop close the same theme that has run through this site for the past month: turning policy into mechanism, then turning mechanism into evidence. The validator is policy. Tier 2 is mechanism. Grade A (97/100) is evidence. JRig is policy. forge_proofs is mechanism. The verification page is evidence. None of the three is sufficient alone, and the chain is what makes a marketplace claim defensible.

Also shipped — same day

The day did not stop at the marketplace and the forge.

intent-solutions-landing — PR #18 migrated intentsolutions.io off Firebase to the Contabo VPS (the canonical VPS-as-the-home pattern). PR #19 disabled compressHTML and bumped the line-length cap to 50k to fix a deploy regression. PR #20 dropped the Resend/SQLite form-flow notes — Slack-only is the final shape. Umami tracker landed alongside the existing Firebase Analytics. The trustbar gained a "53k+ npm Downloads" stat badge.
Umami analytics rollout across three sites — claude-code-plugins (PR #692, with data-domains spam guard), jeremylongshore.com, and intent-solutions-landing all wired to the self-hosted Umami instance in one day.
contributing-clanker — URL-or-repo argument now drives a two-branch onboarding-and-briefing flow.
partner-portals — Kobiton portal got an editorial pass (engagement-structure table tightened, status pills dropped, upcoming-work cards added).
kobiton — CLAUDE.md sync absorbed engagement history and the sub-bead table.
intent-eval-lab — umbrella repo scaffolded.
j-rig-binary-eval — skill-spec sources of truth pulled into the repo.
Marquee fixes — PR #689 throttled the npm fetch to dodge registry rate limits and restored the live total. PR #691 relabeled the marquee from '30d' to 'total downloads'.

Guidewire MCP v0.1.0 → v0.1.1 in 76 minutes — release engineering with the same evidence-first discipline
The Anti-Slop Framework Found Three Bugs Inside Itself — validator dogfooding, the same pattern that produced today's false-positive guard
Propagation Day: When the Spec Becomes the Migration Plan — spec-to-execution arcs, the same shape this dogfood follows

DEV Community: Jeremy Longshore

Five Tags, Zero Ships: How an Auto-Release Workflow Lied for a Whole Day

What the Checkmarks Promised

Bug 1: Tests That Passed by Lying

Bug 2: Nine Version Sources, Six Ignored

Bug 3: The Step That Wasn't There

The State Behind the Process

The Three-Bug Pattern

Also Shipped 2026-05-19

Related Posts

A v1.0 Is a Gate, Not a Tag

The 3× degradation gate

The release-readiness checklist (E10-B11, PR #73)

What "GO with conditions" actually means

Why not GO/NO-GO binary?

C1 fix: read your own version (PR #74)

The cut itself (52fa7a4 → v1.0.0)

The tarball turned out incomplete (v1.0.1, same day)

AAR same day

Also shipped

Related Posts

Honest Perf Benchmarks for a Paid-API Compiler

The corpus has to be byte-identical

An API key is not consent

Skipped is not zero

The four PRs, briefly

Why not the obvious alternatives

What the numbers say

Also shipped today

The transferable pattern

Related posts

Five Silent Failures in One Day

1. The prescreen that ran on zero plugins

2. The gitignore that ate plugin configs

3. Prettier that reformatted 11,000 lines and exited 0

4. The SSH deploy that succeeded by doing nothing

5. The regex that skipped matches because /g left state behind

The shape of a silent failure

Related Posts

Deterministic First, LLM Second: An Advisory CI Pre-Screen

The two contracts that matter

What "never block" buys you

First live invocation found two bugs

The spec was invisible — fix the surface

Transitive CVE Clearance: The Dual-Layer Pattern

The CVE Picture

Move 1: Bump the Direct Deps

The Lockfile Trap

Move 2: Pin at the Top-Level Overrides Block

Why Both Moves Matter

Evidence: The Full Gauntlet

Parallel Work in the Same Release Window

Takeaway

Related Posts

Three Guards Against Shipping Slop

Guard 1: Adversarial pre-flight on the hooks bundle

Guard 2: Empirical verification over inference on the server-side audit

Guard 3: Post-delivery consistency sweep against fork main

What three guards do not catch

What the day actually demonstrated

Two False-Positive Fixes, Same Root Cause

Context: production on a shared VPS

The symptom

Fix one: TCP over HTTP fetch

The setup

The failure cascade

The assumption that bit

The fix (commit cbb4f6e)

Tuning alongside the fix

Operational pairing: Netdata hold-down

Result

Fix two: drop the mode signal from deployment validation

The setup

The failure cascade

The assumption that bit

The fix (commit 5b9fe26)

Result

The shared lesson

Three ways to break the trap

Remove the situational condition

5. The regex that skipped matches because `/g` left state behind

The fix (commit `cbb4f6e`)

The fix (commit `5b9fe26`)

Schema — `forge_proofs` and three new columns (PR #699)

Build pipeline — `enrich-jrig-data` step (PR #700)

Page target — `/plugins/<name>/verification` (PR #702)