<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brad Kinnard</title>
    <description>The latest articles on DEV Community by Brad Kinnard (@moonrunnerkc).</description>
    <link>https://dev.to/moonrunnerkc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3727405%2Fdace59d9-5970-49b1-9ee7-0836891c5a65.png</url>
      <title>DEV Community: Brad Kinnard</title>
      <link>https://dev.to/moonrunnerkc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/moonrunnerkc"/>
    <language>en</language>
    <item>
      <title>Your AI's tests pass. That doesn't mean the code works.</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 31 May 2026 22:21:05 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/your-ais-tests-pass-that-doesnt-mean-the-code-works-239c</link>
      <guid>https://dev.to/moonrunnerkc/your-ais-tests-pass-that-doesnt-mean-the-code-works-239c</guid>
      <description>&lt;p&gt;You ask a coding agent to fix a bug. It writes the code, writes the tests, CI goes green, you merge. The bug's still there.&lt;/p&gt;

&lt;p&gt;The agent's job was to turn the check green. The honest way to do that is to fix the code. The lazy way is to write a test that passes no matter what the code does. CI can't tell those two apart. A green check means the tests passed, not that the code is right.&lt;/p&gt;

&lt;p&gt;It's easy to miss in review, because the test sits right there looking like proof:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;parses the config&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rawInput&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeDefined&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That passes whether &lt;code&gt;parseConfig&lt;/code&gt; works perfectly or returns nothing useful on every input. It checks nothing. Adding more tests like it just raises your coverage number, not your odds of catching a bad change.&lt;/p&gt;

&lt;p&gt;So I built ClaimCheck (&lt;a href="https://github.com/moonrunnerkc/claimcheck" rel="noopener noreferrer"&gt;https://github.com/moonrunnerkc/claimcheck&lt;/a&gt;). Instead of trusting the agent's tests, it tries to break them. If a test still passes after the supposedly fixed code is broken on purpose, the test was never really checking the fix, and it gets blocked. Same answer every time, no AI making the call. So far it's caught every cheat in a set of twelve hand-built cases. Twelve is small, and there's no public release yet, so treat that as a direction, not a finished result.&lt;/p&gt;

&lt;p&gt;Some cheats slip through anyway. If the agent writes a real, solid test that locks in the wrong answer, every check passes. The only way to know the answer's wrong is to already know the right one, and nothing in the pull request can tell you that except the agent you're trying to catch. The one thing that helps is a clue from outside it, like a human-written bug report you can run the fix against.&lt;/p&gt;

&lt;p&gt;There's a second, wider tool, Swarm Orchestrator (&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;https://github.com/moonrunnerkc/swarm-orchestrator&lt;/a&gt;). It flags suspicious changes and keeps a tamper-evident record for audits. The record-keeping is the solid part. The catching is not: on real pull requests its accuracy is still low, and that's the half I'm hardening now.&lt;/p&gt;

&lt;p&gt;The next step is comparing the old code's behavior to the new directly. The catch is that a wrong change and a harmless cleanup can look the same from the outside, and a tool that blocks good code is worse than one that lets a bad change through. That's the part I'm still working out.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>testing</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Audit AI-Generated PRs Before You Merge Them (Swarm Orchestrator 10.3.0)</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 24 May 2026 20:54:59 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/audit-ai-generated-prs-before-you-merge-them-swarm-orchestrator-1030-3a6e</link>
      <guid>https://dev.to/moonrunnerkc/audit-ai-generated-prs-before-you-merge-them-swarm-orchestrator-1030-3a6e</guid>
      <description>&lt;p&gt;If you let Claude Code, Cursor, Devin, Aider, Copilot, or any other coding agent open PRs against your repo, you already know the problem. The diff looks fine on a fast read. CI is green. You merge it. A week later you find the test that "passed" got deleted, or the error handling is a silent &lt;code&gt;catch {}&lt;/code&gt;, or the "fix" was a comment swap that never touched the bug.&lt;/p&gt;

&lt;p&gt;Swarm Orchestrator looks at those PRs and flags the suspicious bits before you click merge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;A CLI and a GitHub Action. Open source. Node 20 or later. You point it at a PR (or a local diff) and it scores the patch against a set of cheat-pattern detectors. It posts a comment back to the PR with what it found and why.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm audit moonrunnerkc/swarm-orchestrator#42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole interface for most people.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;The default detector set has four checks, all aimed at patterns AI agents actually produce on real PRs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;error-swallow&lt;/code&gt;: a new empty or comment-only &lt;code&gt;catch&lt;/code&gt; block in non-test code.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mock-of-hallucination&lt;/code&gt;: a &lt;code&gt;jest.mock&lt;/code&gt; or &lt;code&gt;vi.mock&lt;/code&gt; against a module that doesn't exist anywhere in the repo.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;no-op-fix&lt;/code&gt;: tests changed without source, or source changed without tests, when the diff claims to fix something.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fake-refactor&lt;/code&gt;: an exported symbol renamed in source, with no caller in the diff updated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six more detectors live behind &lt;code&gt;--detectors experimental&lt;/code&gt; for shadow runs. They're not scored well enough on real PRs to be on by default, and the README says so.&lt;/p&gt;

&lt;p&gt;Every finding renders with its measured precision number inline, so a reviewer sees the false-positive rate every time the bot speaks.&lt;/p&gt;

&lt;p&gt;If you need compliance artifacts, &lt;code&gt;--emit-aibom cyclonedx-ml&lt;/code&gt; writes a CycloneDX 1.6 ML-BOM and an SPDX 3.0 AI-Profile per audit. That covers the EU AI Act Annex IV and CISA SBOM-for-AI minimums without bolting on a separate vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who it's for
&lt;/h2&gt;

&lt;p&gt;Teams that let AI agents open PRs and want a second pair of eyes that runs in CI, costs nothing per call, and produces a deterministic comment instead of vibes. Also useful for procurement and security folks who need an AI-BOM next to their SBOM and don't want another tool in the chain.&lt;/p&gt;

&lt;p&gt;If you have one developer eyeballing every line of every AI PR by hand, you probably don't need this yet. If you have ten agents pushing diffs to a queue at 2am, you do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's new in 10.3.0
&lt;/h2&gt;

&lt;p&gt;Four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;no-op-fix&lt;/code&gt; got a v2.0 with a gated LLM judge. The judge is off by default and only fires when you set &lt;code&gt;--enable-llm-judge&lt;/code&gt; (or &lt;code&gt;SWARM_AUDIT_LLM_JUDGE=1&lt;/code&gt;) and have an Anthropic key. Verdicts are content-addressed and cached, so the same diff and title always gets the same answer. The model id is pinned in the ledger so replay stays deterministic.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--shadow-output &amp;lt;path&amp;gt;&lt;/code&gt;. One JSON file per audit with detector verdicts, judge call count, and the rendered comment. Drops into a directory you can &lt;code&gt;jq&lt;/code&gt; later. The existing &lt;code&gt;--shadow &amp;lt;repo&amp;gt;&lt;/code&gt; per-repo rollup still works.&lt;/li&gt;
&lt;li&gt;Public leaderboard on GitHub Pages. Fetches the real-corpus score snapshot and renders precision, recall, F1, and a sortable per-detector table. No build step, no CDN, just an HTML page and one JS file: &lt;a href="https://moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/" rel="noopener noreferrer"&gt;moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Real-corpus headline rescored against the v2.0 detectors. F1 moved from 0.109 (P 0.067, R 0.300) to 0.167 (P 0.100, R 0.500). &lt;code&gt;mock-of-hallucination&lt;/code&gt; picked up two true positives the v1 shape missed.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;The real-corpus F1 is 0.167 across 205 AI-labeled PRs (10 broken, 195 clean, eight agent vendors). Precision is 0.100. Recall is 0.500.&lt;/p&gt;

&lt;p&gt;That precision number is exactly why the default mode is &lt;code&gt;advise&lt;/code&gt; and not &lt;code&gt;gate&lt;/code&gt;. Most flags will be false positives. The tool is calibrated to be useful as a reviewer-assist signal, not a merge blocker. If you want it to block, opt in: &lt;code&gt;--mode gate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The 205-PR corpus is currently labeled by an AI judge with "pending human review" stamped on every entry. That's the largest credibility hole in the project and the next milestone closes it. The labeling rubric, the kappa script, and the labels-v2 scaffold already live in the repo.&lt;/p&gt;

&lt;p&gt;Don't read this as "ship this into your release gate today." Read it as "here's a tool you can run in shadow mode, look at what it flags, and decide for yourself if those flags are useful."&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
&lt;span class="nb"&gt;cd &lt;/span&gt;swarm-orchestrator
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npm run build
npm &lt;span class="nb"&gt;link&lt;/span&gt;

&lt;span class="c"&gt;# audit a PR (advisory, never blocks)&lt;/span&gt;
&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;... swarm audit owner/repo#PR
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or wire it into a workflow with &lt;code&gt;uses: moonrunnerkc/swarm-orchestrator@main&lt;/code&gt; and &lt;code&gt;audit-mode: true&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;https://github.com/moonrunnerkc/swarm-orchestrator&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Leaderboard: &lt;a href="https://moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/" rel="noopener noreferrer"&gt;https://moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Real-corpus score snapshot: &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/blob/main/benchmarks/real-corpus/scores/latest.json" rel="noopener noreferrer"&gt;https://github.com/moonrunnerkc/swarm-orchestrator/blob/main/benchmarks/real-corpus/scores/latest.json&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CycloneDX 1.6 ML-BOM spec: &lt;a href="https://cyclonedx.org/specification/overview/" rel="noopener noreferrer"&gt;https://cyclonedx.org/specification/overview/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SPDX 3.0 AI Profile: &lt;a href="https://spdx.dev/use/specifications/" rel="noopener noreferrer"&gt;https://spdx.dev/use/specifications/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;EU AI Act Annex IV: &lt;a href="https://artificialintelligenceact.eu/annex/4/" rel="noopener noreferrer"&gt;https://artificialintelligenceact.eu/annex/4/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>githubactions</category>
      <category>devops</category>
    </item>
    <item>
      <title>Cryptographic Forensics for AI Coding Agent Sessions</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Wed, 20 May 2026 14:58:25 +0000</pubDate>
      <link>https://dev.to/aftermathtech/cryptographic-forensics-for-ai-coding-agent-sessions-2oaa</link>
      <guid>https://dev.to/aftermathtech/cryptographic-forensics-for-ai-coding-agent-sessions-2oaa</guid>
      <description>&lt;p&gt;A Claude Code or Codex CLI session writes a JSONL file to disk. If the agent runs &lt;code&gt;rm -rf&lt;/code&gt; on a training-data directory or &lt;code&gt;terraform destroy -auto-approve&lt;/code&gt; on production, that file is where an incident review starts.&lt;/p&gt;

&lt;p&gt;A JSONL file is not evidence. Anyone with shell access can rewrite it. To a third party who doesn't trust the machine it came from, it proves nothing.&lt;/p&gt;

&lt;p&gt;That gap matters once agents have credentials to real infrastructure. Most agent observability tooling is built for debugging and quality, not for the moment after damage is done. This post is about the three cryptographic properties that turn a transcript into something an auditor or regulator can verify, and how the DEPOSE project wires them together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three properties
&lt;/h2&gt;

&lt;p&gt;Assume the machine that produced the bundle can't be trusted. Three things need to hold at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tamper-evident.&lt;/strong&gt; Any byte change has to be detectable. Hash chain over events: change a byte, replay fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authenticated.&lt;/strong&gt; The record has to be bound to a key the producer controls and publishes a fingerprint for. Ed25519 signatures over a manifest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-backdated.&lt;/strong&gt; A party other than the producer has to anchor the record in time. RFC 3161 tokens from a public TSA.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The primitives are old and well understood. The hard part is wiring them through a normalized event schema and shipping a verifier that doesn't depend on the producer's runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  No LLM in the signed path
&lt;/h2&gt;

&lt;p&gt;Every event is captured at execution time or normalized from the session JSONL, then committed to the hash chain. The human-readable narrative is generated separately, from deterministic Handlebars templates over the signed events. It's excluded from the root hash.&lt;/p&gt;

&lt;p&gt;If generated prose became part of the signed record, verification would depend on model behavior staying stable and reproducible. DEPOSE avoids that dependency. The signed record is event data and hashes. The prose is templated commentary with &lt;code&gt;[#evt-&amp;lt;ulid&amp;gt;]&lt;/code&gt; citations back to the signed events. You can rewrite the narrative without affecting verification. Change an event and verification fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in a bundle
&lt;/h2&gt;

&lt;p&gt;A DEPOSE bundle is a directory, not an opaque archive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;incident-01JABC.../
├── manifest.json            bundleId, rootHash, eventsJsonlSha256, sigs, timestamps
├── events.jsonl             every event in canonical JSON, byte-pinned by manifest
├── rules/destructive.yaml   ruleset used at reconstruction time
├── narrative.md / .html     templated prose with per-event citations
├── verify.txt               human-readable verification summary
├── artifacts/               captured file diffs, payloads
├── attestations/            Ed25519 signatures, RFC 3161 timestamp tokens
└── raw/                     source JSONL, shell history, capture records
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Change a byte of &lt;code&gt;events.jsonl&lt;/code&gt;, &lt;code&gt;manifest.json&lt;/code&gt;, or &lt;code&gt;rules/destructive.yaml&lt;/code&gt; and verification fails. Canonical JSON follows RFC 8785 (JCS), which is what lets a Go verifier check a TypeScript-produced bundle without either side trusting the other's serializer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two binaries
&lt;/h2&gt;

&lt;p&gt;The producer is TypeScript. The verifier is a separately-built static Go binary, &lt;code&gt;depose-verify&lt;/code&gt;. The separation is deliberate: you hand the binary to whoever needs to check the bundle (auditor, opposing counsel, regulator, a customer's security team) and they run it on their own machine. No producer stack required.&lt;/p&gt;

&lt;p&gt;A passing run prints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;parse        OK
signature    OK
chain-replay OK
artifacts    OK
timestamp    OK
PASS  bundleId=...  rootHash=...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cryptography here is mostly off-the-shelf. The actual engineering work is in normalization: getting Go and Node to serialize identically, getting timing and ordering right across capture sources, deciding what counts as one event versus two. Canonical JSON is the unsexy part. Float formatting, key ordering, unicode escapes: Go and Node have to agree byte-for-byte or the verifier rejects a bundle the producer thinks is fine. That's what the cross-language conformance vectors in &lt;code&gt;tests/conformance/&lt;/code&gt; are for.&lt;/p&gt;

&lt;p&gt;Verifiers can pin a producer's expected key fingerprint and consult a revocation list, both at the command line. The RFC 3161 timestamp does double duty here: a bundle stamped before a key is revoked stays time-anchored, so "when was this signed" remains answerable even if the key is later compromised.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capture modes
&lt;/h2&gt;

&lt;p&gt;Two modes, different coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reconstruction&lt;/strong&gt; reads the Claude Code session JSONL after the fact, compares it against shell history (bash, zsh, fish) and git reflog where available, and builds a bundle. Lower-bound mode. It can verify integrity after packaging. It can't prove the original session file was complete before capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active capture&lt;/strong&gt; installs a Claude Code &lt;code&gt;PreToolUse&lt;/code&gt; hook and POSIX shell shims for the binaries that tend to do destructive things: &lt;code&gt;terraform&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;gh&lt;/code&gt;, &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;psql&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt;, &lt;code&gt;railway&lt;/code&gt;, &lt;code&gt;rm&lt;/code&gt;. Records land under &lt;code&gt;~/.depose/captures/&lt;/code&gt; at execution time. A later &lt;code&gt;depose package&lt;/code&gt; merges them with the session JSONL so every covered event has a verified pre-execution intent on record.&lt;/p&gt;

&lt;p&gt;DEPOSE can prove integrity of captured events. It can't prove an uninstrumented system captured everything. An agent that shells out to a binary not in the shim list, or hits an API directly, still shows up in the JSONL but won't have an active-capture record. The coverage matrix is in the repo.&lt;/p&gt;

&lt;p&gt;macOS and Linux only. Windows isn't supported (POSIX 0600 on the key store, POSIX shell scripts for the shims). WSL2 works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release pipeline
&lt;/h2&gt;

&lt;p&gt;Releases ship with SBOMs, provenance attestations, and signed checksums. The specifics: CycloneDX for both halves, SLSA L3 provenance, and &lt;code&gt;SHA256SUMS&lt;/code&gt; signed via cosign keyless. CI rebuilds the two checked-in example bundles (an &lt;code&gt;rm -rf&lt;/code&gt; on training data, a &lt;code&gt;terraform destroy&lt;/code&gt; on infrastructure) on every push and runs three semantic tamper rejections to confirm the verifier fails closed.&lt;/p&gt;




&lt;p&gt;Right now most coding-agent session logs are treated like disposable debug output. That assumption gets weaker the moment an agent can modify infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Aftermath-Technologies-Ltd/depose" rel="noopener noreferrer"&gt;https://github.com/Aftermath-Technologies-Ltd/depose&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
    <item>
      <title>Gemma.Witness - Offline Multimodal Evidence Capture with Gemma 4</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 17 May 2026 03:43:46 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/gemmawitness-offline-multimodal-evidence-capture-with-gemma-4-2d53</link>
      <guid>https://dev.to/moonrunnerkc/gemmawitness-offline-multimodal-evidence-capture-with-gemma-4-2d53</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Gemma.Witness is an offline-first multimodal evidence capture system built for environments where cloud access, trust, or chain-of-custody assumptions fail.&lt;/p&gt;

&lt;p&gt;The system records audio alongside supporting images, runs local multimodal analysis through Gemma 4, and produces a signed evidence bundle containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured incident reports&lt;/li&gt;
&lt;li&gt;Timestamped evidence metadata&lt;/li&gt;
&lt;li&gt;Local reasoning traces&lt;/li&gt;
&lt;li&gt;Hash-linked verification artifacts&lt;/li&gt;
&lt;li&gt;Exportable forensic bundles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The focus was reliability and local verification instead of "AI assistant" behavior.&lt;/p&gt;

&lt;p&gt;Most evidence tooling today assumes internet access, centralized APIs, or mutable storage. Gemma.Witness was designed around the opposite assumption: the network may be unavailable, the machine may be isolated, and every generated output may eventually need independent verification.&lt;/p&gt;

&lt;p&gt;The application runs fully local through a desktop interface using Rust, Tauri, and local inference orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;github: &lt;a href="https://github.com/moonrunnerkc/gemma-witness" rel="noopener noreferrer"&gt;https://github.com/moonrunnerkc/gemma-witness&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Source code is available at the repository above.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;Gemma 4 is the reasoning layer behind the entire evidence pipeline.&lt;/p&gt;

&lt;p&gt;I used Gemma 4's multimodal capabilities to process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audio-derived transcripts&lt;/li&gt;
&lt;li&gt;Scene images&lt;/li&gt;
&lt;li&gt;Cross-evidence consistency analysis&lt;/li&gt;
&lt;li&gt;Structured incident extraction&lt;/li&gt;
&lt;li&gt;Reasoning trace generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is used in a multi-pass workflow instead of a single prompt-response cycle. Each pass validates or expands on the previous stage before the final signed bundle is emitted.&lt;/p&gt;

&lt;p&gt;This matters because evidence systems fail quietly when models hallucinate details, merge assumptions into facts, or overstate certainty. The pipeline was intentionally designed to separate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raw observations&lt;/li&gt;
&lt;li&gt;Inferred conclusions&lt;/li&gt;
&lt;li&gt;Confidence scoring&lt;/li&gt;
&lt;li&gt;Verifiable artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gemma 4 was a strong fit because it could operate locally while still handling multimodal reasoning tasks without requiring cloud APIs or external orchestration services.&lt;/p&gt;

&lt;p&gt;The project prioritizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Offline operation&lt;/li&gt;
&lt;li&gt;Verifiable outputs&lt;/li&gt;
&lt;li&gt;Local ownership of evidence&lt;/li&gt;
&lt;li&gt;Minimal trust assumptions&lt;/li&gt;
&lt;li&gt;Reproducible forensic artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A surprising challenge was not getting the model to generate reports. That part was easy.&lt;/p&gt;

&lt;p&gt;The difficult part was building guardrails around evidence integrity so the system does not quietly become a very confident fiction generator wearing a necktie.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Gemma 4&lt;/li&gt;
&lt;li&gt;Rust&lt;/li&gt;
&lt;li&gt;Tauri&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Local multimodal inference&lt;/li&gt;
&lt;li&gt;Cryptographic hashing and bundle verification&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/gemma-witness" rel="noopener noreferrer"&gt;
        gemma-witness
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Offline multimodal evidence capture that emits a signed, locally verifiable .witness bundle. Tauri + Rust + Gemma 4 + Ed25519. Static HTML verifier runs with no server.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/gemma-witness/docs/cover.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fgemma-witness%2FHEAD%2Fdocs%2Fcover.svg" alt="Gemma.Witness: offline, multimodal, tamper-evident evidence capture" width="100%"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Gemma.Witness&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;
  Offline, tamper-evident evidence capture for field journalism. Signed in your hand, verified in a browser, with no server in the loop
&lt;/p&gt;

&lt;p&gt;
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness/LICENSE" rel="noopener noreferrer"&gt;&lt;img alt="License: MIT" src="https://camo.githubusercontent.com/61e87a946e1d1e4f95e16a52e349a61c65e00addb6e1f6352e004edb7b00f251/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d3764643366633f7374796c653d666c61742d737175617265"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/56affe7b8dd50ea73a7fdf8d92fc10df1d58a8771f211c9e945e2de0b22c9fb9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f727573742d312e38302532422d3161323534383f7374796c653d666c61742d737175617265266c6f676f3d72757374266c6f676f436f6c6f723d666666666666"&gt;&lt;img alt="Rust 1.80+" src="https://camo.githubusercontent.com/56affe7b8dd50ea73a7fdf8d92fc10df1d58a8771f211c9e945e2de0b22c9fb9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f727573742d312e38302532422d3161323534383f7374796c653d666c61742d737175617265266c6f676f3d72757374266c6f676f436f6c6f723d666666666666"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/3d0ddbbe52ba06ef509b8814cc17cc0b1ffbfe2c08554b1c96a1dee2dbef60e7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d32322e782d3161323534383f7374796c653d666c61742d737175617265266c6f676f3d6e6f64652e6a73266c6f676f436f6c6f723d666666666666"&gt;&lt;img alt="Node 22" src="https://camo.githubusercontent.com/3d0ddbbe52ba06ef509b8814cc17cc0b1ffbfe2c08554b1c96a1dee2dbef60e7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d32322e782d3161323534383f7374796c653d666c61742d737175617265266c6f676f3d6e6f64652e6a73266c6f676f436f6c6f723d666666666666"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/40ee04c06557c4dae7dcd7dcc2e96ac467317a2e7dbec324f892d71b746f5db6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74617572692d322e782d3161323534383f7374796c653d666c61742d737175617265266c6f676f3d7461757269266c6f676f436f6c6f723d666666666666"&gt;&lt;img alt="Tauri 2" src="https://camo.githubusercontent.com/40ee04c06557c4dae7dcd7dcc2e96ac467317a2e7dbec324f892d71b746f5db6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74617572692d322e782d3161323534383f7374796c653d666c61742d737175617265266c6f676f3d7461757269266c6f676f436f6c6f723d666666666666"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/cf681d7792cecf4a73b78f013297fb124c0fba63ad8e10fb3c1234fb03537845/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7374617475732d626574612d3362383266363f7374796c653d666c61742d737175617265"&gt;&lt;img alt="Status: beta" src="https://camo.githubusercontent.com/cf681d7792cecf4a73b78f013297fb124c0fba63ad8e10fb3c1234fb03537845/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7374617475732d626574612d3362383266363f7374796c653d666c61742d737175617265"&gt;&lt;/a&gt;
&lt;/p&gt;




  
    
    

    &lt;span class="m-1"&gt;demo.mov&lt;/span&gt;
    
  

  

  





&lt;p&gt;
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#why-this-matters" rel="noopener noreferrer"&gt;Why&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#status" rel="noopener noreferrer"&gt;Status&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#installation" rel="noopener noreferrer"&gt;Install&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#usage" rel="noopener noreferrer"&gt;Usage&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#configuration" rel="noopener noreferrer"&gt;Configuration&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#threat-model" rel="noopener noreferrer"&gt;Threat model&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#current-limitations" rel="noopener noreferrer"&gt;Limitations&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#what-you-can-verify-yourself" rel="noopener noreferrer"&gt;Verify yourself&lt;/a&gt; ·
  &lt;a href="https://github.com/moonrunnerkc/gemma-witness#contributing" rel="noopener noreferrer"&gt;Contributing&lt;/a&gt;
&lt;/p&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why this matters&lt;/h2&gt;

&lt;/div&gt;

&lt;p&gt;A reporter is working in a country where journalists are detained for their reporting. She records a witness account. She attaches the photos she just took. She seals the file before she leaves the room.&lt;/p&gt;

&lt;p&gt;A week later, an editor on another continent opens a single static HTML page in any browser and drags the file in. Three checks turn green:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the signature comes from the reporter's device&lt;/li&gt;
&lt;li&gt;the audio and the photos have not been altered by a single byte&lt;/li&gt;
&lt;li&gt;the AI model in the chain is bit-for-bit the published Gemma model her manifest names, by &lt;code&gt;model_id&lt;/code&gt;, &lt;code&gt;revision&lt;/code&gt;, and &lt;code&gt;model.safetensors&lt;/code&gt; SHA-256&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/gemma-witness" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


</description>
      <category>gemma</category>
      <category>ai</category>
      <category>rust</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Swarm Orchestrator v8.0.2</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Tue, 12 May 2026 02:31:18 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/swarm-orchestrator-v802-pf</link>
      <guid>https://dev.to/moonrunnerkc/swarm-orchestrator-v802-pf</guid>
      <description>&lt;p&gt;v8.0.2 is out now and it cleans up several rough edges that kept showing up under heavy tournament and falsification workloads.&lt;/p&gt;

&lt;p&gt;The biggest operational change is that all four previously documented architectural limitations are now closed in the same release (7b68867).&lt;/p&gt;

&lt;h2&gt;
  
  
  Notable Changes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tournament mode now streams through the same pipeline as single mode. If one candidate fails streaming verification, it gets aborted independently instead of poisoning the whole run.&lt;/li&gt;
&lt;li&gt;Live cost-cap enforcement is now real-time. Concurrent streams project cumulative USD usage continuously and abort the moment projected spend crosses the configured cap.&lt;/li&gt;
&lt;li&gt;Snapshot cleanup is automatic now and supports retention policies like:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  retain-last:N
  max-age:&amp;lt;dur&amp;gt;
  max-disk:&amp;lt;sz&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Adaptive falsifier dispatch using UCB1 is available behind:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="nt"&gt;--falsifier-scheduler&lt;/span&gt; ucb1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;ARIES-style rollback support landed for falsified obligations. If a counter-example appears after apply, the workspace restores from the pre-apply snapshot and verifies the rollback by hashing the restored bytes against the original SHA.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  New Command
&lt;/h2&gt;

&lt;p&gt;There is also a new command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm v8 stats &amp;lt;run-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That surfaces persisted falsifier metrics directly from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;.swarm/falsifier-stats.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;including regression discoveries, false positives, success counts, and latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replay Determinism
&lt;/h2&gt;

&lt;p&gt;One important detail: replay determinism remains intact across all of this. Every scheduler decision and abort event still lands in the ledger so replay reproduces the same winner consistently. That part was non-negotiable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/" rel="noopener noreferrer"&gt;https://github.com/moonrunnerkc/swarm-orchestrator/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>How Swarm Orchestrator v8 Tries to Break Its Own AI Patches</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 10 May 2026 02:10:05 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/how-swarm-orchestrator-v8-tries-to-break-its-own-ai-patches-2513</link>
      <guid>https://dev.to/moonrunnerkc/how-swarm-orchestrator-v8-tries-to-break-its-own-ai-patches-2513</guid>
      <description>&lt;p&gt;Most AI coding tools commit when their own checks pass. Swarm Orchestrator v8 adds a second adversarial layer: independent falsifier adapters that try to break each patch before it merges. v8.0.1 is on &lt;code&gt;main&lt;/code&gt; with that subsystem on by default.&lt;/p&gt;

&lt;p&gt;This post walks through the v8 architecture, the four verification points, the producer/falsifier adapter split, and the limitations that haven't been solved in v8.0 yet.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;What is Swarm Orchestrator?&lt;/strong&gt; A contract-first AI coding swarm with hash-chained evidence and verifier-gated commits. It compiles a natural-language goal into a typed contract, dispatches it to a population of personas inside one cached Anthropic session, races candidate diffs per obligation, and commits only what passes verification. It wraps an LLM; it doesn't replace one.&lt;br&gt;

&lt;/div&gt;


&lt;h2&gt;
  
  
  The shape of a run
&lt;/h2&gt;

&lt;p&gt;You hand it a goal in plain English. The contract compiler turns that into &lt;code&gt;contract.jsonl&lt;/code&gt; plus a &lt;code&gt;manifest.json&lt;/code&gt; carrying the goal, repo context, extractor provenance, and a SHA-256 of the canonical contract bytes. Identical inputs produce identical contract hashes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;goal (text)
   |
   v
contract compiler  -&amp;gt;  contract.jsonl + manifest.json
   |
   v
+-------------------------------------------------+
|        population manager (single session)      |
|                                                 |
|  ledger (jsonl, hash-chain) &amp;lt;- personas (8)     |
|       ^                          |              |
|       | tournament + verifier scoring           |
|       |                                         |
|  WASM deterministic floor (zero-LLM obligs)     |
+-------------------------------------------------+
   |                              |
   v                              v
streaming verifier      post-merge integration
   |                              |
   +--------------+---------------+
                  v
       falsifier adapters (Codex, Copilot)
                  |
                  v
            committed diffs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The population manager opens one cached Anthropic session and walks each obligation. It picks the persona whose trigger predicate matches the obligation type. In tournament mode, N candidates run in parallel; the verifier scores them, the top scorer is a commit candidate, and losers get logged but never merge.&lt;/p&gt;
&lt;h2&gt;
  
  
  Two adapter subsystems
&lt;/h2&gt;

&lt;p&gt;The most common confusion in v6 was treating the coding CLIs and the falsifiers as one thing. v8 splits them cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Producer adapters&lt;/strong&gt; (&lt;code&gt;src/adapters/&lt;/code&gt;) wrap third-party coding CLIs as the worker in the v6 verified-branch pipeline. Backends: Copilot, Claude Code, Codex, Claude Code Teams. All four are opt-in via &lt;code&gt;swarm run --v6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falsifier adapters&lt;/strong&gt; (&lt;code&gt;src/falsification/adapters/&lt;/code&gt;) take a patch the producer's verifier already accepted and try to falsify the obligation by surfacing a counter-example, regression fixture, or property-violation trace. A confirmed counter-example flips the obligation back to &lt;code&gt;failed&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Falsifier&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Obligation types&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CodexFalsifier&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;&lt;code&gt;property-must-hold&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CopilotFalsifier&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;import-graph-must-satisfy&lt;/code&gt;, &lt;code&gt;function-must-have-signature&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ClaudeCodeFalsifier&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;off (per-adapter opt-in)&lt;/td&gt;
&lt;td&gt;all three&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The CLI surface is one flag: &lt;code&gt;--falsifiers &amp;lt;on|off&amp;gt;&lt;/code&gt; (default on). Per-adapter selection happens at the API layer via &lt;code&gt;defaultAdapterRegistry({ includeCopilot, includeClaudeCode })&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Four verification points
&lt;/h2&gt;

&lt;p&gt;A patch has to survive these before it merges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pre-generation memoization.&lt;/strong&gt; Skip generation if the obligation result is already cached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-stream abort.&lt;/strong&gt; During generation, the streaming verifier can abort the call. Works in &lt;code&gt;--mode single&lt;/code&gt; only; tournament mode skips it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-generation per-obligation verifier.&lt;/strong&gt; Scores the candidate diff. In tournament mode, top scorer wins; in single mode it's pass/fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-merge integration check.&lt;/strong&gt; After the diff lands, the integration check confirms the broader system still holds.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The architectural rule from the README: nothing commits without passing the obligation's verifier. Then the falsifiers get a shot.&lt;/p&gt;
&lt;h2&gt;
  
  
  The hash-chained ledger
&lt;/h2&gt;

&lt;p&gt;Every action lands in &lt;code&gt;.swarm/ledger/&amp;lt;run-id&amp;gt;.jsonl&lt;/code&gt; with the SHA of the prior entry. Tampering is detectable; runs resume from any prior state. If a process is killed mid-run, &lt;code&gt;swarm v8 resume &amp;lt;run-id&amp;gt;&lt;/code&gt; walks the ledger and picks up where it left off.&lt;/p&gt;

&lt;p&gt;The ledger format is shared with v6, but v8 writes more granular events (per-persona dispatch, per-candidate score, falsifier verdict) so a run can be replayed or audited end-to-end.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
&lt;span class="nb"&gt;cd &lt;/span&gt;swarm-orchestrator &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;link&lt;/span&gt;

&lt;span class="c"&gt;# Compile a goal, then run it&lt;/span&gt;
swarm v8 compile &lt;span class="s2"&gt;"add a /health endpoint that returns 200 OK"&lt;/span&gt; &lt;span class="nt"&gt;--yes&lt;/span&gt;
swarm v8 run .swarm/contracts/&amp;lt;contract-id&amp;gt;

&lt;span class="c"&gt;# Or both in one step (defaults to v8)&lt;/span&gt;
swarm run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"add a /health endpoint that returns 200 OK"&lt;/span&gt;

&lt;span class="c"&gt;# Resume a killed run&lt;/span&gt;
swarm v8 resume &amp;lt;run-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Node &amp;gt;= 20, git &amp;gt;= 2.40, and &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;. Pass &lt;code&gt;--extractor stub --session stub&lt;/code&gt; to run offline.&lt;/p&gt;

&lt;p&gt;There's also a GitHub Action:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moonrunnerkc/swarm-orchestrator@v8&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;endpoint"&lt;/span&gt;
    &lt;span class="na"&gt;contract-only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;cost-cap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.00"&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  What v8.0 doesn't do
&lt;/h2&gt;

&lt;p&gt;
  Limitations worth reading before adopting
  &lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tournament mode doesn't stream.&lt;/strong&gt; &lt;code&gt;--mode tournament&lt;/code&gt; plus &lt;code&gt;--forbid-import&lt;/code&gt; skips the streaming abort; streaming verification is &lt;code&gt;--mode single&lt;/code&gt; only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-merge failure doesn't auto-rollback.&lt;/strong&gt; The run is marked failed; per-obligation worktree snapshots are post-v8.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--cost-cap&lt;/code&gt; is enforced post-obligation, not mid-call.&lt;/strong&gt; Cumulative spend is checked at the end of each obligation against estimated Sonnet 4 pricing. Exit code 6 if exceeded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bandit dispatch is not built (Phase 5).&lt;/strong&gt; Codex and Copilot have disjoint obligation types, so there's nothing to arbitrate between yet.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-vendor producer race is deferred (Phase 6).&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full list with rationale lives in &lt;code&gt;docs/v8-architecture-deviations.md&lt;/code&gt;.&lt;/p&gt;



&lt;/p&gt;
&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;
        swarm-orchestrator
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Contract-first AI coding swarm with hash-chained evidence. Compiles a goal into typed obligations, races persona candidates per obligation in a single cached inference session, verifies before commit, and logs every action in an append-only ledger.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/assets/header.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fassets%2Fheader.svg" alt="Swarm Orchestrator" width="100%"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Swarm Orchestrator&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Contract-first AI coding swarm with hash-chained evidence and verifier-gated commits.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fc208599ef300dfbb7d7b65c32d4e1364b62c8c0bd3cc6df8a16615f7ccd9991/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4953432d626c75653f7374796c653d666c61742d737175617265" alt="License"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/37e12b341829a2c53b69b36b6fe5a9a4f42cf56b82722fac6b5011085a3749e6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334432302d3333393933333f7374796c653d666c61742d737175617265" alt="Node"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3f31914b57bc82fa5dcbe2b429e1d486362f11bfa4411282ff311f2885102e19/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f722f63692e796d6c3f6272616e63683d6d61696e266c6162656c3d6369267374796c653d666c61742d737175617265" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/4aea3e72d83f2dd34ce19e0393e3a10766f325d0d6356fc71c7c190676acf5e2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f7061636b6167652d6a736f6e2f762f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f723f7374796c653d666c61742d737175617265" alt="Version"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;swarm&lt;/code&gt; compiles a natural-language goal into a typed contract, dispatches it to a
population of personas inside one cached Anthropic session, races candidate diffs per
obligation, and commits only the diffs that pass verification. After the producer's
verifier accepts a patch, registered falsifier adapters get a chance to break it
before it merges. Every action lands in an append-only hash-chained ledger you can
audit, resume, or replay.&lt;/p&gt;
&lt;p&gt;It wraps an LLM; it does not replace one. The model writes the code, the orchestrator
decides what reaches your repo.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Status&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Version &lt;code&gt;8.0.1&lt;/code&gt; on &lt;code&gt;main&lt;/code&gt;. Node &lt;code&gt;&amp;gt;= 20&lt;/code&gt; (CI matrix: 20, 22). License ISC. The v8
architecture is the default for &lt;code&gt;swarm run&lt;/code&gt;; the v6 verified-branch pipeline is
preserved under &lt;code&gt;swarm run --v6&lt;/code&gt; and the &lt;code&gt;swarm swarm&lt;/code&gt; / &lt;code&gt;swarm execute&lt;/code&gt; commands
Falsifier subsystem: Codex on, Copilot on, ClaudeCode…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How to Write a CLAUDE.md Rule That Actually Gets Enforced</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 10 May 2026 02:06:45 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/how-to-write-a-claudemd-rule-that-actually-gets-enforced-3npa</link>
      <guid>https://dev.to/moonrunnerkc/how-to-write-a-claudemd-rule-that-actually-gets-enforced-3npa</guid>
      <description>&lt;p&gt;Open a CLAUDE.md file at random and you'll find build commands, architecture notes, and rules. The rules tend to be the unenforceable kind. "Write clean code." "Be careful with types." "Follow our conventions." The author meant every word. The agent reads them. And nothing checks whether the agent followed them, because nothing can.&lt;/p&gt;

&lt;p&gt;In a corpus of 580 CLAUDE.md, AGENTS.md, and &lt;code&gt;.cursorrules&lt;/code&gt; files from public GitHub repos with 10+ stars, &lt;strong&gt;74% contained zero machine-extractable rules&lt;/strong&gt;. Not because the authors didn't care about rules. Because most rules were written in a form no parser could pull out as a deterministic check.&lt;/p&gt;

&lt;p&gt;This post is about the difference. Specifically: how to phrase a rule so a parser can extract it and a verifier can check it, without sacrificing what you actually meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;Enforceability comes from a verifiable surface. A rule is verifiable when there's a concrete pattern in code that either matches it or doesn't. "Use &lt;code&gt;camelCase&lt;/code&gt; for function names" is verifiable: read the AST, list the function names, check the casing. "Name things consistently" is not: there's no concrete pattern to check, only a judgment to make.&lt;/p&gt;

&lt;p&gt;The gap between the two is the gap between intent and enforcement. You meant the same thing in both cases. But only one of them survives translation into a check.&lt;/p&gt;

&lt;p&gt;Here's the heuristic I use: &lt;strong&gt;could a junior engineer with no context mechanically check whether code follows this rule, just by reading the rule and looking at the code?&lt;/strong&gt; If yes, the rule is enforceable. If they'd have to ask "what does 'consistent' mean here?", it isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three pairs, walked through
&lt;/h2&gt;

&lt;p&gt;Take a few common intents and look at how they fail or succeed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type safety.&lt;/strong&gt; You want strong typing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  Be careful with types.
Good: No `any` types in `src/`. Async functions require explicit return types.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bad version has no surface. "Careful" isn't a check. The good version has two: a forbidden token (&lt;code&gt;any&lt;/code&gt;) and a structural property (return type annotation on async function declarations). Both check directly against the AST.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Module structure.&lt;/strong&gt; You want predictable imports and exports.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  Prefer clean module structure.
Good: Named exports only. No default exports. Filenames in kebab-case.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Clean" is meaningless to a parser. Named-only and kebab-case are both binary properties of code that exist or don't. The first version sounds like more guidance because it's broader, but breadth is the problem: it covers everything and enforces nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Preferences over alternatives.&lt;/strong&gt; You want React functional components, not class components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  Write modern React.
Good: Prefer functional components over class components.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Modern" is a moving target with no fixed surface. The "prefer X over Y" pattern, on the other hand, has a clean check: count instances of each, compute a ratio, score against a threshold. This is one of the most useful patterns in instruction files because it captures real-world preference (not absolute prohibition) in a measurable way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reference table
&lt;/h2&gt;

&lt;p&gt;Twelve common intents, paired:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;Unenforceable&lt;/th&gt;
&lt;th&gt;Enforceable&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naming functions&lt;/td&gt;
&lt;td&gt;Name things consistently&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;camelCase&lt;/code&gt; for function names&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filenames&lt;/td&gt;
&lt;td&gt;Pick reasonable filenames&lt;/td&gt;
&lt;td&gt;All filenames in kebab-case&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type safety&lt;/td&gt;
&lt;td&gt;Be careful with types&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;any&lt;/code&gt; types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async return types&lt;/td&gt;
&lt;td&gt;Make types clear&lt;/td&gt;
&lt;td&gt;Async functions require explicit return types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Module exports&lt;/td&gt;
&lt;td&gt;Prefer clean module structure&lt;/td&gt;
&lt;td&gt;Named exports only, no default exports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File size&lt;/td&gt;
&lt;td&gt;Keep files manageable&lt;/td&gt;
&lt;td&gt;Maximum 300 lines per file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;Be mindful of logging&lt;/td&gt;
&lt;td&gt;Never use &lt;code&gt;console.log&lt;/code&gt;; use &lt;code&gt;src/logger.ts&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Component style&lt;/td&gt;
&lt;td&gt;Write modern React&lt;/td&gt;
&lt;td&gt;Prefer functional components over class components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package manager&lt;/td&gt;
&lt;td&gt;Use the right package manager&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;pnpm&lt;/code&gt;, not &lt;code&gt;npm&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test files&lt;/td&gt;
&lt;td&gt;Keep tests organized&lt;/td&gt;
&lt;td&gt;All test files end with &lt;code&gt;.test.ts&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error handling&lt;/td&gt;
&lt;td&gt;Handle errors properly&lt;/td&gt;
&lt;td&gt;Async functions must use &lt;code&gt;try/catch&lt;/code&gt; or return a &lt;code&gt;Result&lt;/code&gt; type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commit format&lt;/td&gt;
&lt;td&gt;Write clear commit messages&lt;/td&gt;
&lt;td&gt;Use conventional commits (&lt;code&gt;feat:&lt;/code&gt;, &lt;code&gt;fix:&lt;/code&gt;, &lt;code&gt;chore:&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every right-hand cell points at a concrete check: a token, a casing rule, a count, a file pattern, a configured tool. Every left-hand cell points at a judgment.&lt;/p&gt;

&lt;p&gt;Real-world rules usually carry scope: "no &lt;code&gt;any&lt;/code&gt; in &lt;code&gt;src/&lt;/code&gt;," "named exports outside &lt;code&gt;index.ts&lt;/code&gt; files," "no &lt;code&gt;console.log&lt;/code&gt; in production code paths." Scope makes a rule narrower and more accurate without making it less enforceable. The interop layer that genuinely needs &lt;code&gt;any&lt;/code&gt; keeps it; the rest of the codebase doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kinds of checks exist
&lt;/h2&gt;

&lt;p&gt;Worth knowing what's available, because it shapes what's writable. Static analysis tools targeting AI instruction files generally support a few classes of check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AST-level&lt;/strong&gt;: function names, type annotations, import patterns, forbidden tokens, structural properties&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filesystem&lt;/strong&gt;: file existence, naming conventions, directory layout, file size limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regex&lt;/strong&gt;: literal strings, content patterns, conventional formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tooling&lt;/strong&gt;: presence and configuration of linters, formatters, package managers, test runners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config-file&lt;/strong&gt;: contents of &lt;code&gt;.eslintrc&lt;/code&gt;, &lt;code&gt;tsconfig.json&lt;/code&gt;, &lt;code&gt;.prettierrc&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git-history&lt;/strong&gt;: commit message formats, branch naming conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preference ratios&lt;/strong&gt;: "prefer X over Y" with a compliance percentage instead of a binary verdict&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your rule maps to one of these classes, it's enforceable. If it doesn't, it isn't. The trick when writing instruction files is to keep that map in mind: when you're about to write "be careful with X", ask which of these classes "carefulness with X" lives in. Usually the answer points at a concrete reformulation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unenforceable rules aren't worthless
&lt;/h2&gt;

&lt;p&gt;Here's a real tension: most of what makes a CLAUDE.md useful isn't enforceable at all. Project context (what the repo does, where the architecture lives), agent behavior directives (be succinct, ask before deleting, don't touch &lt;code&gt;/legacy&lt;/code&gt;), and onboarding instructions are all valuable. None of them extract as rules.&lt;/p&gt;

&lt;p&gt;Don't try to make them enforceable. They're a different kind of content with a different purpose. Project context grounds the agent. Behavior directives shape its style. Neither is supposed to be checked against output; they're checked against the agent's process, which is a different problem.&lt;/p&gt;

&lt;p&gt;The mistake worth avoiding is letting unenforceable prose crowd out enforceable rules. Anthropic's Claude Code best practices doc recommends deleting any instruction the model already follows correctly without it. Most "write clean code" style rules fail that test: the model already does its version of clean code, so the line is taking up attention budget your agent could be spending on the specific, verifiable rules that actually distinguish your codebase from a generic project. Cut what the model already does. Keep the checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  A test for your own files
&lt;/h2&gt;

&lt;p&gt;Pull up your CLAUDE.md or AGENTS.md right now. For each line that looks like a rule, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Could a junior engineer check this without asking clarifying questions?&lt;/li&gt;
&lt;li&gt;Does it name a specific pattern, file, token, casing, or value?&lt;/li&gt;
&lt;li&gt;Would 5 different reviewers all agree on whether a piece of code passes this rule?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a rule fails 1 or 2, it's not a rule, it's a wish. If it fails 3, it's ambiguous. Rewrite or delete.&lt;/p&gt;

&lt;p&gt;If you want a mechanical version of this test, &lt;a href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;RuleProbe&lt;/a&gt; parses CLAUDE.md, AGENTS.md, &lt;code&gt;.cursorrules&lt;/code&gt;, &lt;code&gt;.windsurfrules&lt;/code&gt;, GEMINI.md, and &lt;code&gt;copilot-instructions.md&lt;/code&gt; against 102 matchers and tells you which lines extracted as rules and which didn't:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; ruleprobe
ruleprobe parse ./CLAUDE.md &lt;span class="nt"&gt;--show-unparseable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--show-unparseable&lt;/code&gt; flag is the interesting one. It surfaces every line that looked rule-shaped but didn't map to a check. That list is your rewrite queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/ruleprobe" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;RuleProbe on GitHub&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  What this leaves out
&lt;/h2&gt;

&lt;p&gt;The hardest case is rules like "follow the existing error handling pattern in this codebase." That's enforceable in principle (compare new code's structural shape against the codebase's dominant pattern), but not by simple AST or regex matching. It needs codebase-aware analysis. Some tools handle that; most don't. If you find yourself writing those kinds of rules, know that they'll either need a tool that does pattern profiling or they'll stay aspirational.&lt;/p&gt;

&lt;p&gt;The other thing enforceability doesn't catch: an agent that follows every rule and still writes broken code. Static rules reduce variance, they don't eliminate it. A function with &lt;code&gt;any&lt;/code&gt; removed and an explicit return type can still have wrong logic. Treat passing rule checks as a floor, not a ceiling.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>I Dropped Multi-Agent Coordination for a 5-Layer Falsification Battery</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sat, 02 May 2026 15:00:11 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/i-dropped-multi-agent-coordination-for-a-5-layer-falsification-battery-48cb</link>
      <guid>https://dev.to/moonrunnerkc/i-dropped-multi-agent-coordination-for-a-5-layer-falsification-battery-48cb</guid>
      <description>&lt;p&gt;Swarm Orchestrator just lost its swarm. Dropped the multi-agent parallel coordination layer. Running one agent now and putting all the weight on a five-layer post-merge falsification battery instead.&lt;/p&gt;

&lt;p&gt;This is an experiment, not an endpoint. v8 will bring proper multi-agent swarming back. The reason for cutting it temporarily: I want to know whether the value I was getting from coordinated parallel agents was the coordination itself, or the verification pressure that coordination produced. Easier to measure with one variable. Intended side effect: cost reduction, since the previous architecture spun up multiple CLI agent instances per run. Real benchmarks pending.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;TL;DR&lt;/strong&gt;: every patch survives a five-layer post-merge battery before the orchestrator declares success. Layers 1 and 2 are hard gates. Layers 3, 4, 5 are advisory and feed a composite score. Hard-gate failure throws before attestation, before final gates, before any external success signal.&lt;br&gt;

&lt;/div&gt;


&lt;h2&gt;
  
  
  Pipeline Order
&lt;/h2&gt;

&lt;p&gt;The battery runs once per orchestrator execution against the merged working tree, not per-step branches. The per-step verifier is a separate component. Layers fire in fixed order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Differential gate (hard)&lt;/li&gt;
&lt;li&gt;Mutation gate (hard)&lt;/li&gt;
&lt;li&gt;Cheat detector (advisory)&lt;/li&gt;
&lt;li&gt;Property gate (advisory)&lt;/li&gt;
&lt;li&gt;Attestation (advisory on first run, signed after)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the hard gate fails, the composite is forced to &lt;code&gt;0&lt;/code&gt; and the orchestrator throws &lt;code&gt;falsification battery blocked the patch&lt;/code&gt; before any external success signal can fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Differential Gate (Hard)
&lt;/h2&gt;

&lt;p&gt;Before any agent touches the repo, a synthesizer generates a regression test against the goal. Layer 1 then runs that test in two detached worktrees: one at the base commit, one at the patch commit.&lt;/p&gt;

&lt;p&gt;The contract: the test must fail at base and pass at patch.&lt;/p&gt;

&lt;p&gt;If the test passes at base, the layer returns &lt;code&gt;INVALID_TEST&lt;/code&gt;. This catches the specific failure mode where an agent writes a tautological test that passes against any code. Without this gate, that pattern slips through every other check downstream.&lt;/p&gt;

&lt;p&gt;If no command can be synthesized and the caller doesn't pass &lt;code&gt;--differentialTestCommand&lt;/code&gt;, the layer fails closed. Deliberate policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Mutation Gate (Hard)
&lt;/h2&gt;

&lt;p&gt;Runs Stryker for JS/TS, mutmut for Python, PITest (Gradle/Maven) for Java, against changed files only. First runs the regression command; if that fails, the layer fails immediately. On pass, dispatches the mutation tool and parses the reporter output, including Stryker's clear-text table layout.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.6&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.8&lt;/td&gt;
&lt;td&gt;WARNING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;≥ 0.8&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Patches with no meaningful test coverage of the changed lines fail here even when &lt;code&gt;npm test&lt;/code&gt; is green.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Cheat Detector (Advisory)
&lt;/h2&gt;

&lt;p&gt;Diff-static scan for known agent-shortcut patterns. Each rule maps to a real category of behaviour I've watched agents fall into under pressure.&lt;/p&gt;

&lt;p&gt;
  Cheat detector rule IDs
  &lt;ul&gt;
&lt;li&gt;
&lt;code&gt;test-modification&lt;/code&gt;: agent edited the test instead of the implementation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;complexity-mismatch&lt;/code&gt;: trivial implementation paired with a non-trivial test&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mock-mutation&lt;/code&gt;: real call replaced with a mock that returns the expected value&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;exception-swallowing&lt;/code&gt;: caught and ignored exceptions to make tests pass&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hardcoded-answer&lt;/code&gt;: literal return of the expected output
&lt;/li&gt;
&lt;/ul&gt;




&lt;/p&gt;
&lt;p&gt;Optional Semgrep integration loads any registered cheat-rule packs into a transient config. The scan reads added and removed lines and patterns; it does not execute the patch. An agent that hides a hardcoded answer behind a small computation can still pass this layer. Honest about that.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 4: Property Gate (Advisory)
&lt;/h2&gt;

&lt;p&gt;Discovers modified TS/JS/Python functions, parses their parameter types, maps each to a fast-check arbitrary or Hypothesis strategy, generates a harness, runs it. Counterexamples surface as findings. Untyped or unsupported types degrade to a low-severity advisory finding rather than blocking.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 5: Attestation (Advisory on First Run)
&lt;/h2&gt;

&lt;p&gt;Reads the &lt;code&gt;refs/notes/swarm-attestation&lt;/code&gt; git note for the patch commit, validates the in-toto SLSA v1.0 envelope's subject SHA against the patch commit, then verifies the cosign signature. On the first run for a commit there's no note yet, so this layer reports advisory-warn and the post-battery attestation step writes the note.&lt;/p&gt;

&lt;p&gt;The note is verifiable later via &lt;code&gt;swarm attest verify &amp;lt;commit&amp;gt;&lt;/code&gt;. A downstream consumer can verify the patch survived the battery without trusting the running orchestrator process.&lt;/p&gt;
&lt;h2&gt;
  
  
  Composite Scoring
&lt;/h2&gt;

&lt;p&gt;When the hard gate passes, a weighted composite is computed across the three advisory layers and any optional advisory quality-gate results. Failed advisory gates each subtract a fixed penalty.&lt;/p&gt;

&lt;p&gt;
  Default scoring (overridable via .swarm/gates.yaml)
  &lt;ul&gt;
&lt;li&gt;composite threshold: &lt;code&gt;0.7&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;weights: cheat detector &lt;code&gt;0.4&lt;/code&gt;, property gate &lt;code&gt;0.4&lt;/code&gt;, attestation &lt;code&gt;0.2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;advisory gate penalty: &lt;code&gt;0.02&lt;/code&gt; per failure
&lt;/li&gt;
&lt;/ul&gt;




&lt;/p&gt;
&lt;p&gt;&lt;code&gt;humanReviewRequired&lt;/code&gt; is true when the composite score is below threshold or any advisory layer is in advisory-warn status.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where It Actually Runs
&lt;/h2&gt;

&lt;p&gt;Three real call sites, not just unit tests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production orchestrator on every &lt;code&gt;swarm&lt;/code&gt; run&lt;/li&gt;
&lt;li&gt;Synthetic calibration corpus (36 paired test specs across 6 broken-category families) executing in CI on every push&lt;/li&gt;
&lt;li&gt;SWE-bench harness using Layer 1 and Layer 4 as standalone spot-check eval drivers&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Honest Caveats
&lt;/h2&gt;

&lt;p&gt;These are in &lt;code&gt;docs/known-gaps.md&lt;/code&gt; and I won't hide them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Differential gate is host-Python-sensitive on legacy codebases. The synth-eval can reflect import-chain errors rather than assertion outcomes. The authoritative resolution gate in the per-instance Docker image is unaffected.&lt;/li&gt;
&lt;li&gt;Mutation gate skips quietly when no changed files match supported languages. YAML, Markdown, Rust, Go diffs don't get mutation-tested.&lt;/li&gt;
&lt;li&gt;Cheat detector is diff-static, not behavioural. The hidden-computation-around-hardcoded-answer pattern can pass it.&lt;/li&gt;
&lt;li&gt;Attestation signing is best-effort. Cosign-not-installed errors get logged and the run proceeds without a note. The note's absence is reflected in Layer 5's advisory-warn on subsequent runs but does not block.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Why Run This Experiment
&lt;/h2&gt;

&lt;p&gt;If the falsification battery alone produces patches that survive scrutiny at acceptable quality, then a lot of the apparent value of multi-agent coordination was actually the verification pressure it created, not the agent diversity itself. If the battery alone isn't enough, then v8 multi-agent gets a clearer mandate: the swarm is the value, not the side effect.&lt;/p&gt;

&lt;p&gt;Either result is useful. The point of the rewrite is to make the answer measurable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devtools</category>
      <category>testing</category>
    </item>
    <item>
      <title>swarm-orchestrator v7.0.0-alpha.0</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Thu, 30 Apr 2026 05:28:06 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/swarm-orchestrator-v700-alpha0-41g9</link>
      <guid>https://dev.to/moonrunnerkc/swarm-orchestrator-v700-alpha0-41g9</guid>
      <description>&lt;p&gt;The agent generates code. The orchestrator tries to find reasons not to trust it.&lt;/p&gt;

&lt;p&gt;That sentence is the entire pivot. Earlier versions of swarm-orchestrator coordinated multiple agents working on the same task. v7 wraps a single agent CLI (Copilot, Claude Code, or Codex) and runs five independent checks on the patch before allowing a merge.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;TL;DR&lt;/strong&gt;

&lt;p&gt;Five-layer verification battery sits between any agent CLI and your &lt;code&gt;main&lt;/code&gt; branch. Two layers are hard gates. Three feed a composite score. Every verified merge gets a signed SLSA attestation as a git note.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  The five checks
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Gate type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Intent verification&lt;/td&gt;
&lt;td&gt;Patch doesn't actually fix the stated problem&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hard&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Regression verification&lt;/td&gt;
&lt;td&gt;Patch breaks existing behavior, or test coverage is too weak to know&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hard&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Solution quality&lt;/td&gt;
&lt;td&gt;Agent gamed the test (hardcoded values, swallowed exceptions, modified tests)&lt;/td&gt;
&lt;td&gt;Composite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Behavioral verification&lt;/td&gt;
&lt;td&gt;Patch works on the happy path, crashes on edge cases&lt;/td&gt;
&lt;td&gt;Composite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Provenance&lt;/td&gt;
&lt;td&gt;No signed attestation produced for the merge&lt;/td&gt;
&lt;td&gt;Composite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1. Intent verification
&lt;/h3&gt;

&lt;p&gt;The patch must make a previously-failing test pass. For SWE-bench instances, that's the &lt;code&gt;FAIL_TO_PASS&lt;/code&gt; test from the instance JSON. For user-facing goals, a reviewer synthesizes a regression test before the worker runs and confirms it fails against the base commit first.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Regression verification
&lt;/h3&gt;

&lt;p&gt;Existing tests must pass. Then mutation testing runs on the modified files to check whether coverage around the change is actually strong enough to catch regressions. A patch that works but lives in weakly-tested code gets flagged.&lt;/p&gt;

&lt;p&gt;
  Mutation testing tooling per language
  &lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JS / TS&lt;/strong&gt;: Stryker&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: mutmut&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Java&lt;/strong&gt;: PITest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mutation score thresholds are configurable in &lt;code&gt;.swarm/gates.yaml&lt;/code&gt;. Defaults: below 0.6 fails, 0.6 to 0.8 warns, above 0.8 passes.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Solution quality
&lt;/h3&gt;

&lt;p&gt;A Semgrep rule pack scans for the specific shortcuts agents take when they're being graded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardcoded values matching test expectations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;try/catch&lt;/code&gt; blocks swallowing the exact exception a failing test was asserting on&lt;/li&gt;
&lt;li&gt;Modifications to test files outside the stated scope&lt;/li&gt;
&lt;li&gt;Mock mutations that make tests pass without changing the implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Behavioral verification
&lt;/h3&gt;

&lt;p&gt;Property-based testing runs against modified functions for 60 seconds each, using Hypothesis (Python) or fast-check (TypeScript). Counterexamples that crash the patched code or violate type contracts get reported.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Provenance
&lt;/h3&gt;

&lt;p&gt;A signed SLSA v1.0 attestation is generated for each verified merge and attached to the commit as a git note. Signed via cosign keyless OIDC. The attestation contains agent identity, model version, per-layer results, and the composite score.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm attest verify &amp;lt;commit&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That command pulls the note and verifies the signature. Useful when something breaks in production three months later and someone asks which agent wrote the offending code and what was checked at merge time.&lt;/p&gt;


&lt;h2&gt;
  
  
  Status
&lt;/h2&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Alpha.&lt;/strong&gt; SWE-bench Verified 50-instance sweeps across Copilot CLI, Claude Code, and Codex are running now. Headline metric is the &lt;strong&gt;falsification catch rate&lt;/strong&gt;: of the patches each agent claimed succeeded, what percentage failed at least one layer. Numbers drop in a follow-up post when the sweeps complete.&lt;br&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where this goes next
&lt;/h2&gt;

&lt;p&gt;v8 brings parallel execution back, applied to verification instead of generation.&lt;/p&gt;

&lt;p&gt;The orchestrator will compute a risk score for each patch, then spawn a population of independent falsifiers sized to that risk. Falsifiers share findings through a coordination channel so a discovery from one steers the targeting of others. A bandit selects which falsifier types to spawn based on past outcomes.&lt;/p&gt;

&lt;p&gt;The v7 five-layer battery becomes the seed pool that v8 grows from. The project name finally fits.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;
        swarm-orchestrator
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Independent verification battery for patches written by AI coding agents. Wraps Copilot, Claude Code, and Codex; applies a five-layer falsification battery (intent, mutation, cheat detection, property tests, signed attestation) to gate merges.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/assets/header.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fassets%2Fheader.svg" alt="Swarm Orchestrator" width="100%"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Swarm Orchestrator&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Independent verification battery for patches written by AI coding agents.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fc208599ef300dfbb7d7b65c32d4e1364b62c8c0bd3cc6df8a16615f7ccd9991/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4953432d626c75653f7374796c653d666c61742d737175617265" alt="License"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/7e9d8f19047f8a8c87d4828d268725442644c934e1e8acc4d5387426dabe6d41/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f76657273696f6e2d372e302e302d2d616c7068612e302d6f72616e67653f7374796c653d666c61742d737175617265" alt="Version"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/37e12b341829a2c53b69b36b6fe5a9a4f42cf56b82722fac6b5011085a3749e6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334432302d3333393933333f7374796c653d666c61742d737175617265" alt="Node"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3f31914b57bc82fa5dcbe2b429e1d486362f11bfa4411282ff311f2885102e19/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f722f63692e796d6c3f6272616e63683d6d61696e266c6162656c3d6369267374796c653d666c61742d737175617265" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1e5ff00a2deeb89446b34d1c735acc066221a43fb000f5d5a446b33581a6edb1/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f723f7374796c653d666c61742d737175617265" alt="Stars"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#how-it-works" rel="noopener noreferrer"&gt;How It Works&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#documentation" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#contributing" rel="noopener noreferrer"&gt;Contributing&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Wraps third-party coding-agent CLIs (Copilot, Claude Code, Codex), runs worker and reviewer steps on isolated git branches, and applies a five-layer falsification battery to each agent-authored patch. Hard gates block patches that fail intent or regression checks; advisory layers feed a composite score.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You run this around an agent CLI, not instead of one. The agent produces the patch; the orchestrator tries to break it. Patches that survive merge to &lt;code&gt;main&lt;/code&gt;; patches that don't are rolled back with a verification report.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Five-layer falsification battery.&lt;/strong&gt; Intent verification, regression and mutation testing, cheat detection, property-based testing, and signed attestation. Layers 1 and 2 are hard gates; layers 3 to 5 feed an advisory composite score. Implementations live under &lt;code&gt;src/verification/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolated worker and reviewer steps.&lt;/strong&gt; Each step runs…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>94% of Published SKILL.md Files Skip the Spec's Two Most Basic Patterns</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Wed, 29 Apr 2026 02:30:37 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/94-of-published-skillmd-files-skip-the-specs-two-most-basic-patterns-oo0</link>
      <guid>https://dev.to/moonrunnerkc/94-of-published-skillmd-files-skip-the-specs-two-most-basic-patterns-oo0</guid>
      <description>&lt;p&gt;The agentskills.io spec recommends two things in every description: start with an action verb, and include a trigger phrase like "use when..." that tells the routing layer when to fire the skill. They take five seconds to add and they're the difference between a skill an agent picks up and a skill that sits unused in the catalog.&lt;/p&gt;

&lt;p&gt;I sampled 500 skills at random from a 1,436-skill public corpus and measured both. 5.8% follow both recommendations. 61.8% follow neither.&lt;/p&gt;

&lt;p&gt;The full breakdown of what the SKILL.md ecosystem actually looks like in production, as of late April 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Corpus: &lt;code&gt;sickn33/antigravity-awesome-skills&lt;/code&gt; at HEAD on April 29, 2026. This is the largest publicly bundled SKILL.md collection in a single repo (1,436 indexed skills with metadata for category, source, and risk classification).&lt;/p&gt;

&lt;p&gt;Sample: 500 skills, random with seed 42 for reproducibility.&lt;/p&gt;

&lt;p&gt;Tool: &lt;a href="https://github.com/moonrunnerkc/skillcheck" rel="noopener noreferrer"&gt;&lt;code&gt;skillcheck&lt;/code&gt;&lt;/a&gt; v1.2.0 from PyPI.&lt;/p&gt;

&lt;p&gt;Per-skill features captured: every skillcheck diagnostic (rule, severity, message), description quality score, body line count, body and metadata token estimates, activation entropy and top-hypothesis score from &lt;code&gt;--activation-hypotheses&lt;/code&gt;, structural features computed locally (description length in chars and words, action verb in first position, trigger-phrase presence, presence of &lt;code&gt;resources/&lt;/code&gt;/&lt;code&gt;scripts/&lt;/code&gt;/&lt;code&gt;references/&lt;/code&gt; subdirectories, frontmatter field count and which fields), plus the antigravity-supplied category, source, and risk metadata.&lt;/p&gt;

&lt;p&gt;Caveat one: skillcheck's description quality score is a heuristic that includes action-verb and trigger-phrase detection as positive signals. So the correlation between these two features and the score is partly mechanical. The headline finding is not "we discovered these patterns predict quality." It's "the spec recommends these patterns, the linter that encodes the spec rewards them, and almost nobody is using them."&lt;/p&gt;

&lt;p&gt;Caveat two: antigravity's bundler injects &lt;code&gt;risk&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;date_added&lt;/code&gt;, and &lt;code&gt;category&lt;/code&gt; fields into the SKILL.md frontmatter when packaging skills. The author-original frontmatter analysis below excludes these injected fields.&lt;/p&gt;

&lt;p&gt;Reproduce in five commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;skillcheck&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;1.2.0
git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/sickn33/antigravity-awesome-skills.git
&lt;span class="nb"&gt;cd &lt;/span&gt;antigravity-awesome-skills
&lt;span class="c"&gt;# Then sample from skills_index.json with seed 42 and run skillcheck against each&lt;/span&gt;
&lt;span class="c"&gt;# Full analysis script: see the dataset link at the bottom&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  The two-pattern adoption gap
&lt;/h2&gt;

&lt;p&gt;Every skill description was classified on two binary features: does it start with an action verb (Generates, Validates, Creates, Builds, Analyzes, etc., from a 90-verb allowlist), and does it contain a trigger phrase (&lt;code&gt;use when&lt;/code&gt;, &lt;code&gt;use this skill when&lt;/code&gt;, &lt;code&gt;when the user&lt;/code&gt;, &lt;code&gt;when working with&lt;/code&gt;, &lt;code&gt;whenever&lt;/code&gt;, etc.)?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Has both action verb and trigger phrase&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;5.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action verb only&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;td&gt;21.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trigger phrase only&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;10.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neither&lt;/td&gt;
&lt;td&gt;309&lt;/td&gt;
&lt;td&gt;61.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same four groups, scored against skillcheck's description quality metric:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Median score&lt;/th&gt;
&lt;th&gt;% scoring 70+&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Has both&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;100.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action verb only&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;72.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trigger phrase only&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;94.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neither&lt;/td&gt;
&lt;td&gt;309&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;td&gt;8.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 100% rate in the both-features group isn't magic. It reflects that skillcheck's heuristic was designed around the spec's recommendations and rewards skills that follow them. What's actually striking is the bottom line: 309 of 500 published skills skip both recommendations. That's the working majority of the ecosystem leaving easy quality on the floor.&lt;/p&gt;
&lt;h2&gt;
  
  
  What authors actually fill in
&lt;/h2&gt;

&lt;p&gt;Outside &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt;, frontmatter is mostly empty. The median author-original frontmatter (excluding the bundler's injected fields) has just two fields. Two.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Adoption&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;99.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;description&lt;/td&gt;
&lt;td&gt;99.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;author&lt;/td&gt;
&lt;td&gt;10.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tags&lt;/td&gt;
&lt;td&gt;10.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tools&lt;/td&gt;
&lt;td&gt;8.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;license&lt;/td&gt;
&lt;td&gt;3.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;allowed-tools&lt;/td&gt;
&lt;td&gt;2.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;version&lt;/td&gt;
&lt;td&gt;2.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;triggers&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;user-invokable&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;capabilities&lt;/td&gt;
&lt;td&gt;0.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The spec offers &lt;code&gt;version&lt;/code&gt;, &lt;code&gt;author&lt;/code&gt;, &lt;code&gt;tags&lt;/code&gt;, &lt;code&gt;allowed-tools&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;agent&lt;/code&gt;, &lt;code&gt;hooks&lt;/code&gt;, &lt;code&gt;user-invocable&lt;/code&gt;, &lt;code&gt;disable-model-invocation&lt;/code&gt;, &lt;code&gt;skills&lt;/code&gt;, &lt;code&gt;mode&lt;/code&gt;. Almost none of them are being used. 80% of authors stop after name and description. There's an entire optional metadata layer the spec defines and the ecosystem ignores.&lt;/p&gt;
&lt;h2&gt;
  
  
  Progressive disclosure adoption is 16%
&lt;/h2&gt;

&lt;p&gt;The spec's load-bearing concept is progressive disclosure: keep metadata tiny so the routing layer scans it cheaply, keep the body lean so it fits the agent's context window, push heavy material into &lt;code&gt;resources/&lt;/code&gt;, &lt;code&gt;scripts/&lt;/code&gt;, or &lt;code&gt;references/&lt;/code&gt; subdirectories that load only when needed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Subdirectory&lt;/th&gt;
&lt;th&gt;Adoption&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;resources/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scripts/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;references/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Any of the three&lt;/td&gt;
&lt;td&gt;16.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;84% of skills inline everything in &lt;code&gt;SKILL.md&lt;/code&gt;. The whole architectural promise of progressive disclosure (multiple skills can sit in the agent's catalog without overwhelming context) requires authors to actually use the pattern. Most don't.&lt;/p&gt;
&lt;h2&gt;
  
  
  Body bloat is real
&lt;/h2&gt;

&lt;p&gt;23% of skills triggered &lt;code&gt;disclosure.body-bloat&lt;/code&gt; warnings, meaning they contain code blocks over 50 lines or tables over 20 rows in the SKILL.md body itself. These are exactly the things the progressive disclosure pattern was designed to push out into &lt;code&gt;references/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;13.6% exceeded the spec's 500-line soft cap on body length. 8.4% exceeded the 5,000-token body budget when skillcheck's tokenizer flagged them (the rest weren't measured because they didn't trip the warning threshold).&lt;/p&gt;
&lt;h2&gt;
  
  
  Description length sweet spot
&lt;/h2&gt;

&lt;p&gt;Quality scores rise with description length up to about 175-225 characters, then plateau:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Length range (chars)&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Median quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;25-49&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-99&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100-149&lt;/td&gt;
&lt;td&gt;158&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;150-199&lt;/td&gt;
&lt;td&gt;131&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200-249&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;67.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;250-299&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The spec's character cap is 1,024. Almost nobody's pushing it. The ecosystem clusters between 100 and 200 chars (median 145), which is roughly the bottom edge of the quality plateau. Authors writing 150+ char descriptions get noticeably better routing signal density.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cross-source patterns
&lt;/h2&gt;

&lt;p&gt;Antigravity's index classifies each skill's source. Quality patterns by source class:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source class&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Median quality&lt;/th&gt;
&lt;th&gt;% action verb&lt;/th&gt;
&lt;th&gt;% trigger&lt;/th&gt;
&lt;th&gt;% progressive disclosure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;community&lt;/td&gt;
&lt;td&gt;394&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;26.6%&lt;/td&gt;
&lt;td&gt;17.5%&lt;/td&gt;
&lt;td&gt;16.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;external_repo&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;65.0&lt;/td&gt;
&lt;td&gt;34.2%&lt;/td&gt;
&lt;td&gt;31.6%&lt;/td&gt;
&lt;td&gt;18.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;official_org&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;77.8%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;33.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;personal&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three observations. Skills from official org repos (Anthropic, Hugging Face, etc.) hit 77.8% action-verb adoption, miles above the community baseline, but zero trigger-phrase use; their descriptions are direct and verb-led without the "use when" preamble. Skills from individual external repos (someone's personal GitHub project) actually hit the highest trigger-phrase rate (31.6%), suggesting individual maintainers writing for their own activation problem think harder about it than community contributors writing for a shared list. Skills tagged "personal" (someone's curated set of their own work) hit 0% on both patterns, which is the cleanest signal that "I made this for me" doesn't translate to "an agent will pick this up."&lt;/p&gt;
&lt;h2&gt;
  
  
  Skillcheck v1.2.0 against the corpus
&lt;/h2&gt;

&lt;p&gt;The new version was released April 28, 2026. The skillcheck rule set found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 of 500 skills produced an actual ERROR (0.2%): &lt;code&gt;android_ui_verification&lt;/code&gt;, which has invalid characters in its name.&lt;/li&gt;
&lt;li&gt;499 of 500 produced WARNINGs (99.8%).&lt;/li&gt;
&lt;li&gt;0 skills passed completely clean.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most-fired rules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.field.unknown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;description.quality-score&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;499&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disclosure.body-bloat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;115&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compat.unverified&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disclosure.metadata-budget&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sizing.body.line-count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disclosure.body-budget&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.description.person-voice&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.field.ecosystem&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sizing.body.token-estimate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.name.reserved-word&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;frontmatter.field.unknown&lt;/code&gt; warning fires on every file because antigravity injects bundler-only fields into the frontmatter (&lt;code&gt;risk&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;date_added&lt;/code&gt;); strip those and the genuine unknown-field rate drops dramatically. Worth knowing if you're running skillcheck against bundled corpora versus author-original repos.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this means if you publish skills
&lt;/h2&gt;

&lt;p&gt;Four things, all reversible in a single commit per skill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Start the description with an action verb (&lt;code&gt;Generates&lt;/code&gt;, &lt;code&gt;Validates&lt;/code&gt;, &lt;code&gt;Creates&lt;/code&gt;, &lt;code&gt;Analyzes&lt;/code&gt;, &lt;code&gt;Refactors&lt;/code&gt;, &lt;code&gt;Audits&lt;/code&gt;, etc.). Not &lt;code&gt;Expert in&lt;/code&gt;, not &lt;code&gt;Comprehensive&lt;/code&gt;, not &lt;code&gt;One-stop&lt;/code&gt;. The verb tells the routing layer what the skill does in two syllables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Include a trigger phrase (&lt;code&gt;Use when ...&lt;/code&gt;, &lt;code&gt;Trigger when ...&lt;/code&gt;, &lt;code&gt;Use this skill when the user ...&lt;/code&gt;). The agent's routing decision is "should I activate this." A trigger phrase answers it directly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aim for 175-225 characters in the description. Short descriptions don't carry enough routing signal; long ones bury it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Push large code blocks (&amp;gt;50 lines), large tables (&amp;gt;20 rows), and detailed reference material out of &lt;code&gt;SKILL.md&lt;/code&gt; and into &lt;code&gt;resources/&lt;/code&gt;, &lt;code&gt;scripts/&lt;/code&gt;, or &lt;code&gt;references/&lt;/code&gt;. The body should describe the work; the reference files should hold the work.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Four changes that move a skill from the 61.8% of the ecosystem ignoring spec recommendations to the 5.8% following them.&lt;/p&gt;
&lt;h2&gt;
  
  
  Methodology, for anyone who wants to push back
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tool: &lt;code&gt;skillcheck&lt;/code&gt; v1.2.0 from PyPI (released April 28, 2026)&lt;/li&gt;
&lt;li&gt;Corpus: &lt;code&gt;sickn33/antigravity-awesome-skills&lt;/code&gt; at HEAD on April 29, 2026 (1,436 indexed skills)&lt;/li&gt;
&lt;li&gt;Sample: 500 skills, drawn with &lt;code&gt;random.seed(42)&lt;/code&gt; then &lt;code&gt;random.sample&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Per-skill processing: &lt;code&gt;skillcheck path --format json --skip-ref-check&lt;/code&gt; plus &lt;code&gt;skillcheck path --activation-hypotheses --format json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Feature extraction: action-verb match against a 90-verb allowlist (gerund and base forms); trigger-phrase match against 9 regex patterns; structural facts computed from filesystem and parsed frontmatter&lt;/li&gt;
&lt;li&gt;Quality score: pulled from skillcheck's &lt;code&gt;description.quality-score&lt;/code&gt; info diagnostic (a published heuristic whose source is at &lt;code&gt;src/skillcheck/rules/description.py&lt;/code&gt; in the skillcheck repo)&lt;/li&gt;
&lt;li&gt;Frontmatter analysis: bundler-injected fields (&lt;code&gt;risk&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;date_added&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;, &lt;code&gt;id&lt;/code&gt;) excluded from the author-original counts above&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full dataset (500 skills, all features, all diagnostics) and the analysis output are in the skillcheck repo under &lt;a href="https://github.com/moonrunnerkc/skillcheck/tree/main/docs" rel="noopener noreferrer"&gt;&lt;code&gt;docs/&lt;/code&gt;&lt;/a&gt;. Anyone who wants to verify a finding, slice it differently, or run the same pipeline against a different corpus has everything they need.&lt;/p&gt;
&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;This study used skillcheck's symbolic mode and the activation-hypotheses generator. The agent-native critique mode (&lt;code&gt;--ingest-critique&lt;/code&gt;) and capability graph extraction (&lt;code&gt;--ingest-graph&lt;/code&gt;) weren't run here because they require a real agent in the loop and would have made the corpus run significantly longer. A follow-up study using those modes on a smaller subset (50-100 skills) would tell us what an agent actually sees in a skill versus what a static linter can measure. That's the next post.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/skillcheck" rel="noopener noreferrer"&gt;
        skillcheck
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Cross-agent skill quality gate for SKILL.md files. Validates frontmatter, scores description discoverability, checks file references, enforces three-tier token budgets, and flags compatibility issues across Claude Code, VS Code/Copilot, Codex, and Cursor.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;

  
  
  &lt;img alt="skillcheck" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fskillcheck%2FHEAD%2F.github%2Fbanner.svg" width="600"&gt;

&lt;br&gt;
&lt;p&gt;&lt;strong&gt;Cross-agent skill quality gate for &lt;code&gt;SKILL.md&lt;/code&gt; files.&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What This Does&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;skillcheck&lt;/code&gt; validates SKILL.md files against the &lt;a href="https://agentskills.io/specification" rel="nofollow noopener noreferrer"&gt;agentskills.io specification&lt;/a&gt;: frontmatter structure, description quality, body size, file references, and cross-agent compatibility. New in v1.0: agent-native semantic self-critique, heuristic capability graph extraction with five structural analyzers, and a per-skill validation history ledger. It does not call any LLM API, execute skill instructions, or modify files.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why This Exists&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;Analysis of 580 AI instruction files found that 96% of their content cannot be verified by any static tool. A separate survey found that 22% of SKILL.md files fail basic structural validation. Skills get written, committed, and published to catalogs; nobody proves they work.&lt;/p&gt;

&lt;p&gt;skillcheck addresses both gaps with a two-mode design. When a calling agent is present, it uses that agent for semantic self-critique and capability graph extraction: the agent reads the skill's instructions and reports whether they are clear, complete, and internally…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/skillcheck" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>claude</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Jupyter notebook bug that only crashes for other people</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Tue, 28 Apr 2026 04:53:59 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/the-jupyter-notebook-bug-that-only-crashes-for-other-people-5aek</link>
      <guid>https://dev.to/moonrunnerkc/the-jupyter-notebook-bug-that-only-crashes-for-other-people-5aek</guid>
      <description>&lt;p&gt;Cell 0 uses &lt;code&gt;df&lt;/code&gt;. Cell 1 defines &lt;code&gt;df&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Notebook works for you because your kernel ran the cells in some other order and the variable's still in memory. You commit. Someone clones the repo, hits Restart and Run All, dies on cell 0.&lt;/p&gt;

&lt;p&gt;Standard Python linters can't catch this. ruff, flake8, mypy operate on one source file at a time. A notebook is N cells whose execution order in your kernel may have nothing to do with their order on disk. The bug isn't inside any single cell. It's in the relationship between cells.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nborder&lt;/code&gt; is a static linter for that relationship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rules
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Flags&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB101&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;execution_count&lt;/code&gt; decreases in source order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB201&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Name used in cell N, only defined in cell M where M &amp;gt; N&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB102&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Name used somewhere, never defined anywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB103&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stochastic call (numpy, torch, tensorflow, stdlib random) before any seed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How the cross-cell analysis works
&lt;/h2&gt;

&lt;p&gt;Each cell gets parsed with &lt;a href="https://github.com/Instagram/LibCST" rel="noopener noreferrer"&gt;libCST&lt;/a&gt;. A visitor extracts symbol definitions (assignments, function defs, class defs, imports) and symbol uses (name references, attribute roots) per cell. Connect them across cells in source order, you get a dataflow graph at notebook scope.&lt;/p&gt;

&lt;p&gt;NB201 findings are uses whose nearest matching definition lives in a later cell. NB102 findings are uses with no matching definition anywhere.&lt;/p&gt;

&lt;p&gt;The graph also makes the auto-fix safe. When NB201 fires, the fixer runs a topological sort over cell dependency edges. Sort succeeds, cells get reordered to respect dataflow and execution counts get cleared. Cycle detected, fixer bails with an explicit message naming the cycle.&lt;/p&gt;

&lt;p&gt;
  NB201 fix example
  &lt;p&gt;Input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cell 0
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# cell 1
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Run &lt;code&gt;nborder check --fix notebook.ipynb&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;notebook.ipynb:cell_0:1:10: NB201 Variable `df` used in cell 0 is only defined in cell 1. The notebook will fail on Restart-and-Run-All. [*]
Fix outcomes:
  reorder: applied (reordered 2 cells and cleared execution counts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cell 0
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;

&lt;span class="c1"&gt;# cell 1
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Cell IDs preserved. Execution counts cleared. Second &lt;code&gt;nborder check&lt;/code&gt; exits 0.&lt;/p&gt;



&lt;/p&gt;
&lt;h2&gt;
  
  
  NB103 and seed injection
&lt;/h2&gt;

&lt;p&gt;NB103 walks the same graph for stochastic calls (&lt;code&gt;np.random.rand&lt;/code&gt;, &lt;code&gt;torch.rand&lt;/code&gt;, &lt;code&gt;tf.random.normal&lt;/code&gt;, &lt;code&gt;random.random&lt;/code&gt;) firing before any matching seed. The fix injects a single seed cell at the right position. Multi-library notebooks get one cell:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Alias-aware. &lt;code&gt;import numpy as numpy_lib&lt;/code&gt; produces a seed line using &lt;code&gt;numpy_lib&lt;/code&gt;, not a redundant fresh import. After fixing a NumPy notebook, computed cell outputs are byte-identical across consecutive &lt;code&gt;jupyter nbconvert --execute&lt;/code&gt; runs.&lt;/p&gt;

&lt;p&gt;JAX and scikit-learn get diagnostic-only handling. JAX needs &lt;code&gt;PRNGKey&lt;/code&gt; threading through call signatures. sklearn &lt;code&gt;random_state=None&lt;/code&gt; needs a value chosen against your testing strategy. Neither is a single line you can inject.&lt;/p&gt;
&lt;h2&gt;
  
  
  Byte-stable writer
&lt;/h2&gt;

&lt;p&gt;Parse a notebook, modify nothing, write it back, bytes match exactly. Verified against &lt;code&gt;nbformat&lt;/code&gt; v4.0, v4.4, v4.5 fixtures plus a real-world notebook corpus. When the writer does mutate during a fix, only the cells that actually changed get rewritten. Cell IDs, metadata, and unrelated cells stay verbatim.&lt;/p&gt;
&lt;h2&gt;
  
  
  Outputs
&lt;/h2&gt;

&lt;p&gt;Four reporters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;text&lt;/strong&gt;: ruff-style &lt;code&gt;path:cell:line:col: NB### message&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;json&lt;/strong&gt;: machine-readable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;github&lt;/strong&gt;: &lt;code&gt;::error file=...,line=...,title=NB201::&lt;/code&gt; annotations for PR inline comments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sarif&lt;/strong&gt;: SARIF 2.1.0, schema-validated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pre-commit hook and a composite GitHub Action included:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moonrunnerkc/nborder@v0.1.4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notebooks/&lt;/span&gt;
    &lt;span class="na"&gt;select&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NB201,NB103&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  What it doesn't do
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Doesn't execute notebooks. Pair with &lt;a href="https://github.com/computationalmodelling/nbval" rel="noopener noreferrer"&gt;nbval&lt;/a&gt; or &lt;a href="https://github.com/nteract/papermill" rel="noopener noreferrer"&gt;papermill&lt;/a&gt; for kernel-level validation.&lt;/li&gt;
&lt;li&gt;Doesn't lint cell-internal style. That's &lt;a href="https://github.com/nbQA-dev/nbQA" rel="noopener noreferrer"&gt;nbqa&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Dynamic name resolution (&lt;code&gt;exec&lt;/code&gt;, &lt;code&gt;getattr&lt;/code&gt;, &lt;code&gt;**kwargs&lt;/code&gt;, monkey-patching) is invisible. Same limitation as any static analyzer.&lt;/li&gt;
&lt;li&gt;Cell magics are stripped before analysis. Names introduced by &lt;code&gt;%%capture&lt;/code&gt; get tracked. Anything magic-internal does not.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;nborder
nborder check path/to/notebooks/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Python 3.10+.&lt;/p&gt;


&lt;/div&gt;




&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/nborder" rel="noopener noreferrer"&gt;
        nborder
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A fast, opinionated linter and auto-fixer for Jupyter notebook hidden-state and execution-order bugs.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;nborder&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;A fast, opinionated linter and auto-fixer for Jupyter notebook hidden-state and execution-order bugs.&lt;/p&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/nborder/docs/images/hero.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fnborder%2FHEAD%2Fdocs%2Fimages%2Fhero.png" alt="nborder catches four classes of notebook bug in one pass"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://pypi.org/project/nborder/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e3b5ccfec928f35e7daa5ff4a841dd0685a5a3646652971eb5834e527ee0e373/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f6e626f726465722e737667" alt="PyPI version"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/nborder/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/moonrunnerkc/nborder/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/nborder/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/299dedd9b8667ac146540cd90fa831a9803e6e152f430bfbefaa9bee8d56236a/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f6e626f726465722e737667" alt="Python"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/nborder/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/08cef40a9105b6526ca22088bc514fbfdbc9aac1ddbf8d4e6c750e3a88a44dca/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d626c75652e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What this catches&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;One-line example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NB101&lt;/td&gt;
&lt;td&gt;Non-monotonic execution counts&lt;/td&gt;
&lt;td&gt;Cell 1 ran with &lt;code&gt;In [3]:&lt;/code&gt; after cell 0 ran with &lt;code&gt;In [5]:&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NB102&lt;/td&gt;
&lt;td&gt;Won't survive Restart-and-Run-All&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;print(df)&lt;/code&gt; references a name no cell in the notebook defines.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NB201&lt;/td&gt;
&lt;td&gt;Use-before-assign across cells&lt;/td&gt;
&lt;td&gt;Cell 0 uses &lt;code&gt;df&lt;/code&gt;; &lt;code&gt;df = ...&lt;/code&gt; only appears in cell 1.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NB103&lt;/td&gt;
&lt;td&gt;Stochastic library used without seed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;np.random.rand(3)&lt;/code&gt; runs with no seed call before it.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;Each rule has a docs page under &lt;a href="https://github.com/moonrunnerkc/nborder/docs/rules/" rel="noopener noreferrer"&gt;&lt;code&gt;docs/rules/&lt;/code&gt;&lt;/a&gt; explaining the bug class, a bad and good example, and the auto-fix behaviour. The four sections below walk through each rule with the diagnostic nborder actually emits.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;NB101: out-of-order execution&lt;/h3&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;execution_count&lt;/code&gt; field on each cell records the order Jupyter actually ran cells in, not the order they appear in the file. When those orders disagree, the recorded…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/nborder" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/nborder" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>python</category>
      <category>jupyter</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Four Security Bugs That Shipped in AI-Generated Code (and How They Got Caught)</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Wed, 15 Apr 2026 18:36:15 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/four-security-bugs-that-shipped-in-ai-generated-code-and-how-they-got-caught-10i8</link>
      <guid>https://dev.to/moonrunnerkc/four-security-bugs-that-shipped-in-ai-generated-code-and-how-they-got-caught-10i8</guid>
      <description>&lt;p&gt;A single Copilot CLI run against a FastAPI application produced four distinct security issues. The code worked. Tests passed. The endpoint did what was asked. None of the issues would surface during a demo or a code review focused on functionality.&lt;/p&gt;

&lt;h2&gt;
  
  
  User input rendered as raw HTML
&lt;/h2&gt;

&lt;p&gt;The application tracks satellite data. Satellite names come from user input. The agent rendered them directly into HTML templates in four separate locations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;strong&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/strong&amp;gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sat1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; vs &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sat2&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No escaping. Four blocks, same pattern. A single-purpose security scanning agent found all four and applied &lt;code&gt;markupsafe.escape()&lt;/code&gt;. A general-purpose agent reviewing the same code caught three of four, missing one buried in a conditional branch.&lt;/p&gt;

&lt;p&gt;The difference isn't model quality. The security-focused agent had a narrower scope and explicit instructions to scan for unescaped user input in template rendering. Scope and prompt specificity determined the outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health endpoint that lies to the load balancer
&lt;/h2&gt;

&lt;p&gt;The agent built a &lt;code&gt;/health&lt;/code&gt; endpoint. It returned HTTP 200 unconditionally, including when the database was unreachable.&lt;/p&gt;

&lt;p&gt;Kubernetes liveness and readiness probes interpret 200 as "this instance is healthy, keep routing traffic." An instance that returns 200 with a dead database stays in the rotation. Users hit it. Requests fail. The cluster thinks everything is fine.&lt;/p&gt;

&lt;p&gt;The correct response is 503 (Service Unavailable). The orchestrator's verification caught this because runtime behavior checks are part of the quality gate surface, not just static analysis.&lt;/p&gt;

&lt;p&gt;This one's subtle. The endpoint "works" in every test environment where the database is actually running. It only fails in the exact production scenario it was designed to protect against.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exception details returned to clients
&lt;/h2&gt;

&lt;p&gt;Error handlers used &lt;code&gt;str(e)&lt;/code&gt; as the response body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Database connection strings, file paths, internal state. All returned directly to whoever triggered the error. In a security audit this is an information disclosure finding. In a FastAPI app behind an API gateway, it's a path to mapping internal infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deprecated datetime API
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;datetime.utcnow()&lt;/code&gt; has been deprecated since Python 3.12. The replacement is &lt;code&gt;datetime.now(timezone.utc)&lt;/code&gt;. The agent also used &lt;code&gt;time.time()&lt;/code&gt; for uptime tracking, which is affected by NTP clock adjustments and can report negative uptime if the system clock steps backward. &lt;code&gt;time.monotonic()&lt;/code&gt; exists specifically for this case.&lt;/p&gt;

&lt;p&gt;Neither of these will cause a production outage today. Both are the kind of technical debt that accumulates when generated code isn't checked against current language standards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;None of these bugs required a sophisticated analysis to find. They're patterns: unescaped user input in templates, unconditional success responses in health checks, raw exception strings in error responses, deprecated stdlib usage. Each one is a known category with a known fix.&lt;/p&gt;

&lt;p&gt;The problem is attention. A general-purpose agent optimizing for "make this feature work" doesn't allocate attention to these categories unless explicitly prompted. The feature works. The tests pass. The agent moves on.&lt;/p&gt;

&lt;p&gt;This is where orchestration changes the economics. Instead of one agent covering everything, specialized agents with narrow scopes check specific categories. A security auditor scans for injection and information disclosure. A runtime checker validates health endpoint semantics. Each agent's prompt is focused enough that known bug patterns get caught.&lt;/p&gt;

&lt;p&gt;The alternative is what most developers do today: manually reprompt. "Now check for XSS." "Now add proper error handling." "Now fix the health check to actually check health." We measured this on the same codebase. 14 follow-up prompts to bring the standalone output to the same level. Each prompt required reading the previous output, identifying what was wrong, and writing a specific correction. About 45 minutes of continuous supervision.&lt;/p&gt;

&lt;p&gt;The orchestrated run took 22 minutes, unattended. 7 premium requests vs 15. Zero human review cycles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Swarm Orchestrator v5.0.0
&lt;/h2&gt;

&lt;p&gt;The tool that caught these is open source. It wraps existing agent CLIs (Copilot, Claude Code, Codex) and adds verification, quality gates, and parallel execution. It doesn't generate code. It delegates code generation and verifies the output against outcome-based checks: git diff, build success, test pass, runtime behavior.&lt;/p&gt;

&lt;p&gt;v5.0.0 adds three features relevant to this problem:&lt;/p&gt;

&lt;p&gt;Spec-aware planning reads the quality gate configuration before generating agent prompts. Security requirements, test coverage thresholds, and configuration standards get injected before agents write code, not discovered through iteration afterward.&lt;/p&gt;

&lt;p&gt;SARIF output exports quality gate violations as SARIF 2.1.0 JSON compatible with GitHub code scanning. Same PR annotation workflow teams already use for CodeQL.&lt;/p&gt;

&lt;p&gt;Per-project gate configuration via &lt;code&gt;.swarm/gates.yaml&lt;/code&gt; lets teams override thresholds and disable gates that don't apply to their project type.&lt;/p&gt;

&lt;p&gt;1,386 passing tests, 84 source files, 7 documented benchmarks. The release notes include commit hashes for every bug fix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Swarm Orchestrator on GitHub&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;What categories of bugs do you consistently find in AI-generated code that could be caught by a specialized check rather than manual review?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
