<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brad Kinnard</title>
    <description>The latest articles on DEV Community by Brad Kinnard (@moonrunnerkc).</description>
    <link>https://dev.to/moonrunnerkc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3727405%2Fdace59d9-5970-49b1-9ee7-0836891c5a65.png</url>
      <title>DEV Community: Brad Kinnard</title>
      <link>https://dev.to/moonrunnerkc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/moonrunnerkc"/>
    <language>en</language>
    <item>
      <title>How Swarm Orchestrator v8 Tries to Break Its Own AI Patches</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 10 May 2026 02:10:05 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/how-swarm-orchestrator-v8-tries-to-break-its-own-ai-patches-2513</link>
      <guid>https://dev.to/moonrunnerkc/how-swarm-orchestrator-v8-tries-to-break-its-own-ai-patches-2513</guid>
      <description>&lt;p&gt;Most AI coding tools commit when their own checks pass. Swarm Orchestrator v8 adds a second adversarial layer: independent falsifier adapters that try to break each patch before it merges. v8.0.1 is on &lt;code&gt;main&lt;/code&gt; with that subsystem on by default.&lt;/p&gt;

&lt;p&gt;This post walks through the v8 architecture, the four verification points, the producer/falsifier adapter split, and the limitations that haven't been solved in v8.0 yet.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;What is Swarm Orchestrator?&lt;/strong&gt; A contract-first AI coding swarm with hash-chained evidence and verifier-gated commits. It compiles a natural-language goal into a typed contract, dispatches it to a population of personas inside one cached Anthropic session, races candidate diffs per obligation, and commits only what passes verification. It wraps an LLM; it doesn't replace one.&lt;br&gt;

&lt;/div&gt;


&lt;h2&gt;
  
  
  The shape of a run
&lt;/h2&gt;

&lt;p&gt;You hand it a goal in plain English. The contract compiler turns that into &lt;code&gt;contract.jsonl&lt;/code&gt; plus a &lt;code&gt;manifest.json&lt;/code&gt; carrying the goal, repo context, extractor provenance, and a SHA-256 of the canonical contract bytes. Identical inputs produce identical contract hashes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;goal (text)
   |
   v
contract compiler  -&amp;gt;  contract.jsonl + manifest.json
   |
   v
+-------------------------------------------------+
|        population manager (single session)      |
|                                                 |
|  ledger (jsonl, hash-chain) &amp;lt;- personas (8)     |
|       ^                          |              |
|       | tournament + verifier scoring           |
|       |                                         |
|  WASM deterministic floor (zero-LLM obligs)     |
+-------------------------------------------------+
   |                              |
   v                              v
streaming verifier      post-merge integration
   |                              |
   +--------------+---------------+
                  v
       falsifier adapters (Codex, Copilot)
                  |
                  v
            committed diffs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The population manager opens one cached Anthropic session and walks each obligation. It picks the persona whose trigger predicate matches the obligation type. In tournament mode, N candidates run in parallel; the verifier scores them, the top scorer is a commit candidate, and losers get logged but never merge.&lt;/p&gt;
&lt;h2&gt;
  
  
  Two adapter subsystems
&lt;/h2&gt;

&lt;p&gt;The most common confusion in v6 was treating the coding CLIs and the falsifiers as one thing. v8 splits them cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Producer adapters&lt;/strong&gt; (&lt;code&gt;src/adapters/&lt;/code&gt;) wrap third-party coding CLIs as the worker in the v6 verified-branch pipeline. Backends: Copilot, Claude Code, Codex, Claude Code Teams. All four are opt-in via &lt;code&gt;swarm run --v6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falsifier adapters&lt;/strong&gt; (&lt;code&gt;src/falsification/adapters/&lt;/code&gt;) take a patch the producer's verifier already accepted and try to falsify the obligation by surfacing a counter-example, regression fixture, or property-violation trace. A confirmed counter-example flips the obligation back to &lt;code&gt;failed&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Falsifier&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Obligation types&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CodexFalsifier&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;&lt;code&gt;property-must-hold&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CopilotFalsifier&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;import-graph-must-satisfy&lt;/code&gt;, &lt;code&gt;function-must-have-signature&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ClaudeCodeFalsifier&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;off (per-adapter opt-in)&lt;/td&gt;
&lt;td&gt;all three&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The CLI surface is one flag: &lt;code&gt;--falsifiers &amp;lt;on|off&amp;gt;&lt;/code&gt; (default on). Per-adapter selection happens at the API layer via &lt;code&gt;defaultAdapterRegistry({ includeCopilot, includeClaudeCode })&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Four verification points
&lt;/h2&gt;

&lt;p&gt;A patch has to survive these before it merges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pre-generation memoization.&lt;/strong&gt; Skip generation if the obligation result is already cached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-stream abort.&lt;/strong&gt; During generation, the streaming verifier can abort the call. Works in &lt;code&gt;--mode single&lt;/code&gt; only; tournament mode skips it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-generation per-obligation verifier.&lt;/strong&gt; Scores the candidate diff. In tournament mode, top scorer wins; in single mode it's pass/fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-merge integration check.&lt;/strong&gt; After the diff lands, the integration check confirms the broader system still holds.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The architectural rule from the README: nothing commits without passing the obligation's verifier. Then the falsifiers get a shot.&lt;/p&gt;
&lt;h2&gt;
  
  
  The hash-chained ledger
&lt;/h2&gt;

&lt;p&gt;Every action lands in &lt;code&gt;.swarm/ledger/&amp;lt;run-id&amp;gt;.jsonl&lt;/code&gt; with the SHA of the prior entry. Tampering is detectable; runs resume from any prior state. If a process is killed mid-run, &lt;code&gt;swarm v8 resume &amp;lt;run-id&amp;gt;&lt;/code&gt; walks the ledger and picks up where it left off.&lt;/p&gt;

&lt;p&gt;The ledger format is shared with v6, but v8 writes more granular events (per-persona dispatch, per-candidate score, falsifier verdict) so a run can be replayed or audited end-to-end.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
&lt;span class="nb"&gt;cd &lt;/span&gt;swarm-orchestrator &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;link&lt;/span&gt;

&lt;span class="c"&gt;# Compile a goal, then run it&lt;/span&gt;
swarm v8 compile &lt;span class="s2"&gt;"add a /health endpoint that returns 200 OK"&lt;/span&gt; &lt;span class="nt"&gt;--yes&lt;/span&gt;
swarm v8 run .swarm/contracts/&amp;lt;contract-id&amp;gt;

&lt;span class="c"&gt;# Or both in one step (defaults to v8)&lt;/span&gt;
swarm run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"add a /health endpoint that returns 200 OK"&lt;/span&gt;

&lt;span class="c"&gt;# Resume a killed run&lt;/span&gt;
swarm v8 resume &amp;lt;run-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Node &amp;gt;= 20, git &amp;gt;= 2.40, and &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;. Pass &lt;code&gt;--extractor stub --session stub&lt;/code&gt; to run offline.&lt;/p&gt;

&lt;p&gt;There's also a GitHub Action:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moonrunnerkc/swarm-orchestrator@v8&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;endpoint"&lt;/span&gt;
    &lt;span class="na"&gt;contract-only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;cost-cap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.00"&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  What v8.0 doesn't do
&lt;/h2&gt;

&lt;p&gt;
  Limitations worth reading before adopting
  &lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tournament mode doesn't stream.&lt;/strong&gt; &lt;code&gt;--mode tournament&lt;/code&gt; plus &lt;code&gt;--forbid-import&lt;/code&gt; skips the streaming abort; streaming verification is &lt;code&gt;--mode single&lt;/code&gt; only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-merge failure doesn't auto-rollback.&lt;/strong&gt; The run is marked failed; per-obligation worktree snapshots are post-v8.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--cost-cap&lt;/code&gt; is enforced post-obligation, not mid-call.&lt;/strong&gt; Cumulative spend is checked at the end of each obligation against estimated Sonnet 4 pricing. Exit code 6 if exceeded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bandit dispatch is not built (Phase 5).&lt;/strong&gt; Codex and Copilot have disjoint obligation types, so there's nothing to arbitrate between yet.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-vendor producer race is deferred (Phase 6).&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full list with rationale lives in &lt;code&gt;docs/v8-architecture-deviations.md&lt;/code&gt;.&lt;/p&gt;



&lt;/p&gt;
&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;
        swarm-orchestrator
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Contract-first AI coding swarm with hash-chained evidence. Compiles a goal into typed obligations, races persona candidates per obligation in a single cached inference session, verifies before commit, and logs every action in an append-only ledger.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/assets/header.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fassets%2Fheader.svg" alt="Swarm Orchestrator" width="100%"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Swarm Orchestrator&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Contract-first AI coding swarm with hash-chained evidence and verifier-gated commits.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fc208599ef300dfbb7d7b65c32d4e1364b62c8c0bd3cc6df8a16615f7ccd9991/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4953432d626c75653f7374796c653d666c61742d737175617265" alt="License"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/37e12b341829a2c53b69b36b6fe5a9a4f42cf56b82722fac6b5011085a3749e6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334432302d3333393933333f7374796c653d666c61742d737175617265" alt="Node"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3f31914b57bc82fa5dcbe2b429e1d486362f11bfa4411282ff311f2885102e19/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f722f63692e796d6c3f6272616e63683d6d61696e266c6162656c3d6369267374796c653d666c61742d737175617265" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/4aea3e72d83f2dd34ce19e0393e3a10766f325d0d6356fc71c7c190676acf5e2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f7061636b6167652d6a736f6e2f762f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f723f7374796c653d666c61742d737175617265" alt="Version"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;swarm&lt;/code&gt; compiles a natural-language goal into a typed contract, dispatches it to a
population of personas inside one cached Anthropic session, races candidate diffs per
obligation, and commits only the diffs that pass verification. After the producer's
verifier accepts a patch, registered falsifier adapters get a chance to break it
before it merges. Every action lands in an append-only hash-chained ledger you can
audit, resume, or replay.&lt;/p&gt;
&lt;p&gt;It wraps an LLM; it does not replace one. The model writes the code, the orchestrator
decides what reaches your repo.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Status&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Version &lt;code&gt;8.0.1&lt;/code&gt; on &lt;code&gt;main&lt;/code&gt;. Node &lt;code&gt;&amp;gt;= 20&lt;/code&gt; (CI matrix: 20, 22). License ISC. The v8
architecture is the default for &lt;code&gt;swarm run&lt;/code&gt;; the v6 verified-branch pipeline is
preserved under &lt;code&gt;swarm run --v6&lt;/code&gt; and the &lt;code&gt;swarm swarm&lt;/code&gt; / &lt;code&gt;swarm execute&lt;/code&gt; commands
Falsifier subsystem: Codex on, Copilot on, ClaudeCode…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How to Write a CLAUDE.md Rule That Actually Gets Enforced</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 10 May 2026 02:06:45 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/how-to-write-a-claudemd-rule-that-actually-gets-enforced-3npa</link>
      <guid>https://dev.to/moonrunnerkc/how-to-write-a-claudemd-rule-that-actually-gets-enforced-3npa</guid>
      <description>&lt;p&gt;Open a CLAUDE.md file at random and you'll find build commands, architecture notes, and rules. The rules tend to be the unenforceable kind. "Write clean code." "Be careful with types." "Follow our conventions." The author meant every word. The agent reads them. And nothing checks whether the agent followed them, because nothing can.&lt;/p&gt;

&lt;p&gt;In a corpus of 580 CLAUDE.md, AGENTS.md, and &lt;code&gt;.cursorrules&lt;/code&gt; files from public GitHub repos with 10+ stars, &lt;strong&gt;74% contained zero machine-extractable rules&lt;/strong&gt;. Not because the authors didn't care about rules. Because most rules were written in a form no parser could pull out as a deterministic check.&lt;/p&gt;

&lt;p&gt;This post is about the difference. Specifically: how to phrase a rule so a parser can extract it and a verifier can check it, without sacrificing what you actually meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;Enforceability comes from a verifiable surface. A rule is verifiable when there's a concrete pattern in code that either matches it or doesn't. "Use &lt;code&gt;camelCase&lt;/code&gt; for function names" is verifiable: read the AST, list the function names, check the casing. "Name things consistently" is not: there's no concrete pattern to check, only a judgment to make.&lt;/p&gt;

&lt;p&gt;The gap between the two is the gap between intent and enforcement. You meant the same thing in both cases. But only one of them survives translation into a check.&lt;/p&gt;

&lt;p&gt;Here's the heuristic I use: &lt;strong&gt;could a junior engineer with no context mechanically check whether code follows this rule, just by reading the rule and looking at the code?&lt;/strong&gt; If yes, the rule is enforceable. If they'd have to ask "what does 'consistent' mean here?", it isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three pairs, walked through
&lt;/h2&gt;

&lt;p&gt;Take a few common intents and look at how they fail or succeed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type safety.&lt;/strong&gt; You want strong typing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  Be careful with types.
Good: No `any` types in `src/`. Async functions require explicit return types.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bad version has no surface. "Careful" isn't a check. The good version has two: a forbidden token (&lt;code&gt;any&lt;/code&gt;) and a structural property (return type annotation on async function declarations). Both check directly against the AST.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Module structure.&lt;/strong&gt; You want predictable imports and exports.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  Prefer clean module structure.
Good: Named exports only. No default exports. Filenames in kebab-case.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Clean" is meaningless to a parser. Named-only and kebab-case are both binary properties of code that exist or don't. The first version sounds like more guidance because it's broader, but breadth is the problem: it covers everything and enforces nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Preferences over alternatives.&lt;/strong&gt; You want React functional components, not class components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  Write modern React.
Good: Prefer functional components over class components.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Modern" is a moving target with no fixed surface. The "prefer X over Y" pattern, on the other hand, has a clean check: count instances of each, compute a ratio, score against a threshold. This is one of the most useful patterns in instruction files because it captures real-world preference (not absolute prohibition) in a measurable way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reference table
&lt;/h2&gt;

&lt;p&gt;Twelve common intents, paired:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;Unenforceable&lt;/th&gt;
&lt;th&gt;Enforceable&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naming functions&lt;/td&gt;
&lt;td&gt;Name things consistently&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;camelCase&lt;/code&gt; for function names&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filenames&lt;/td&gt;
&lt;td&gt;Pick reasonable filenames&lt;/td&gt;
&lt;td&gt;All filenames in kebab-case&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type safety&lt;/td&gt;
&lt;td&gt;Be careful with types&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;any&lt;/code&gt; types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async return types&lt;/td&gt;
&lt;td&gt;Make types clear&lt;/td&gt;
&lt;td&gt;Async functions require explicit return types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Module exports&lt;/td&gt;
&lt;td&gt;Prefer clean module structure&lt;/td&gt;
&lt;td&gt;Named exports only, no default exports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File size&lt;/td&gt;
&lt;td&gt;Keep files manageable&lt;/td&gt;
&lt;td&gt;Maximum 300 lines per file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;Be mindful of logging&lt;/td&gt;
&lt;td&gt;Never use &lt;code&gt;console.log&lt;/code&gt;; use &lt;code&gt;src/logger.ts&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Component style&lt;/td&gt;
&lt;td&gt;Write modern React&lt;/td&gt;
&lt;td&gt;Prefer functional components over class components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package manager&lt;/td&gt;
&lt;td&gt;Use the right package manager&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;pnpm&lt;/code&gt;, not &lt;code&gt;npm&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test files&lt;/td&gt;
&lt;td&gt;Keep tests organized&lt;/td&gt;
&lt;td&gt;All test files end with &lt;code&gt;.test.ts&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error handling&lt;/td&gt;
&lt;td&gt;Handle errors properly&lt;/td&gt;
&lt;td&gt;Async functions must use &lt;code&gt;try/catch&lt;/code&gt; or return a &lt;code&gt;Result&lt;/code&gt; type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commit format&lt;/td&gt;
&lt;td&gt;Write clear commit messages&lt;/td&gt;
&lt;td&gt;Use conventional commits (&lt;code&gt;feat:&lt;/code&gt;, &lt;code&gt;fix:&lt;/code&gt;, &lt;code&gt;chore:&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every right-hand cell points at a concrete check: a token, a casing rule, a count, a file pattern, a configured tool. Every left-hand cell points at a judgment.&lt;/p&gt;

&lt;p&gt;Real-world rules usually carry scope: "no &lt;code&gt;any&lt;/code&gt; in &lt;code&gt;src/&lt;/code&gt;," "named exports outside &lt;code&gt;index.ts&lt;/code&gt; files," "no &lt;code&gt;console.log&lt;/code&gt; in production code paths." Scope makes a rule narrower and more accurate without making it less enforceable. The interop layer that genuinely needs &lt;code&gt;any&lt;/code&gt; keeps it; the rest of the codebase doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kinds of checks exist
&lt;/h2&gt;

&lt;p&gt;Worth knowing what's available, because it shapes what's writable. Static analysis tools targeting AI instruction files generally support a few classes of check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AST-level&lt;/strong&gt;: function names, type annotations, import patterns, forbidden tokens, structural properties&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filesystem&lt;/strong&gt;: file existence, naming conventions, directory layout, file size limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regex&lt;/strong&gt;: literal strings, content patterns, conventional formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tooling&lt;/strong&gt;: presence and configuration of linters, formatters, package managers, test runners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config-file&lt;/strong&gt;: contents of &lt;code&gt;.eslintrc&lt;/code&gt;, &lt;code&gt;tsconfig.json&lt;/code&gt;, &lt;code&gt;.prettierrc&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git-history&lt;/strong&gt;: commit message formats, branch naming conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preference ratios&lt;/strong&gt;: "prefer X over Y" with a compliance percentage instead of a binary verdict&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your rule maps to one of these classes, it's enforceable. If it doesn't, it isn't. The trick when writing instruction files is to keep that map in mind: when you're about to write "be careful with X", ask which of these classes "carefulness with X" lives in. Usually the answer points at a concrete reformulation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unenforceable rules aren't worthless
&lt;/h2&gt;

&lt;p&gt;Here's a real tension: most of what makes a CLAUDE.md useful isn't enforceable at all. Project context (what the repo does, where the architecture lives), agent behavior directives (be succinct, ask before deleting, don't touch &lt;code&gt;/legacy&lt;/code&gt;), and onboarding instructions are all valuable. None of them extract as rules.&lt;/p&gt;

&lt;p&gt;Don't try to make them enforceable. They're a different kind of content with a different purpose. Project context grounds the agent. Behavior directives shape its style. Neither is supposed to be checked against output; they're checked against the agent's process, which is a different problem.&lt;/p&gt;

&lt;p&gt;The mistake worth avoiding is letting unenforceable prose crowd out enforceable rules. Anthropic's Claude Code best practices doc recommends deleting any instruction the model already follows correctly without it. Most "write clean code" style rules fail that test: the model already does its version of clean code, so the line is taking up attention budget your agent could be spending on the specific, verifiable rules that actually distinguish your codebase from a generic project. Cut what the model already does. Keep the checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  A test for your own files
&lt;/h2&gt;

&lt;p&gt;Pull up your CLAUDE.md or AGENTS.md right now. For each line that looks like a rule, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Could a junior engineer check this without asking clarifying questions?&lt;/li&gt;
&lt;li&gt;Does it name a specific pattern, file, token, casing, or value?&lt;/li&gt;
&lt;li&gt;Would 5 different reviewers all agree on whether a piece of code passes this rule?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a rule fails 1 or 2, it's not a rule, it's a wish. If it fails 3, it's ambiguous. Rewrite or delete.&lt;/p&gt;

&lt;p&gt;If you want a mechanical version of this test, &lt;a href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;RuleProbe&lt;/a&gt; parses CLAUDE.md, AGENTS.md, &lt;code&gt;.cursorrules&lt;/code&gt;, &lt;code&gt;.windsurfrules&lt;/code&gt;, GEMINI.md, and &lt;code&gt;copilot-instructions.md&lt;/code&gt; against 102 matchers and tells you which lines extracted as rules and which didn't:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; ruleprobe
ruleprobe parse ./CLAUDE.md &lt;span class="nt"&gt;--show-unparseable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--show-unparseable&lt;/code&gt; flag is the interesting one. It surfaces every line that looked rule-shaped but didn't map to a check. That list is your rewrite queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/ruleprobe" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;RuleProbe on GitHub&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  What this leaves out
&lt;/h2&gt;

&lt;p&gt;The hardest case is rules like "follow the existing error handling pattern in this codebase." That's enforceable in principle (compare new code's structural shape against the codebase's dominant pattern), but not by simple AST or regex matching. It needs codebase-aware analysis. Some tools handle that; most don't. If you find yourself writing those kinds of rules, know that they'll either need a tool that does pattern profiling or they'll stay aspirational.&lt;/p&gt;

&lt;p&gt;The other thing enforceability doesn't catch: an agent that follows every rule and still writes broken code. Static rules reduce variance, they don't eliminate it. A function with &lt;code&gt;any&lt;/code&gt; removed and an explicit return type can still have wrong logic. Treat passing rule checks as a floor, not a ceiling.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>I Dropped Multi-Agent Coordination for a 5-Layer Falsification Battery</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sat, 02 May 2026 15:00:11 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/i-dropped-multi-agent-coordination-for-a-5-layer-falsification-battery-48cb</link>
      <guid>https://dev.to/moonrunnerkc/i-dropped-multi-agent-coordination-for-a-5-layer-falsification-battery-48cb</guid>
      <description>&lt;p&gt;Swarm Orchestrator just lost its swarm. Dropped the multi-agent parallel coordination layer. Running one agent now and putting all the weight on a five-layer post-merge falsification battery instead.&lt;/p&gt;

&lt;p&gt;This is an experiment, not an endpoint. v8 will bring proper multi-agent swarming back. The reason for cutting it temporarily: I want to know whether the value I was getting from coordinated parallel agents was the coordination itself, or the verification pressure that coordination produced. Easier to measure with one variable. Intended side effect: cost reduction, since the previous architecture spun up multiple CLI agent instances per run. Real benchmarks pending.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;TL;DR&lt;/strong&gt;: every patch survives a five-layer post-merge battery before the orchestrator declares success. Layers 1 and 2 are hard gates. Layers 3, 4, 5 are advisory and feed a composite score. Hard-gate failure throws before attestation, before final gates, before any external success signal.&lt;br&gt;

&lt;/div&gt;


&lt;h2&gt;
  
  
  Pipeline Order
&lt;/h2&gt;

&lt;p&gt;The battery runs once per orchestrator execution against the merged working tree, not per-step branches. The per-step verifier is a separate component. Layers fire in fixed order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Differential gate (hard)&lt;/li&gt;
&lt;li&gt;Mutation gate (hard)&lt;/li&gt;
&lt;li&gt;Cheat detector (advisory)&lt;/li&gt;
&lt;li&gt;Property gate (advisory)&lt;/li&gt;
&lt;li&gt;Attestation (advisory on first run, signed after)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the hard gate fails, the composite is forced to &lt;code&gt;0&lt;/code&gt; and the orchestrator throws &lt;code&gt;falsification battery blocked the patch&lt;/code&gt; before any external success signal can fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Differential Gate (Hard)
&lt;/h2&gt;

&lt;p&gt;Before any agent touches the repo, a synthesizer generates a regression test against the goal. Layer 1 then runs that test in two detached worktrees: one at the base commit, one at the patch commit.&lt;/p&gt;

&lt;p&gt;The contract: the test must fail at base and pass at patch.&lt;/p&gt;

&lt;p&gt;If the test passes at base, the layer returns &lt;code&gt;INVALID_TEST&lt;/code&gt;. This catches the specific failure mode where an agent writes a tautological test that passes against any code. Without this gate, that pattern slips through every other check downstream.&lt;/p&gt;

&lt;p&gt;If no command can be synthesized and the caller doesn't pass &lt;code&gt;--differentialTestCommand&lt;/code&gt;, the layer fails closed. Deliberate policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Mutation Gate (Hard)
&lt;/h2&gt;

&lt;p&gt;Runs Stryker for JS/TS, mutmut for Python, PITest (Gradle/Maven) for Java, against changed files only. First runs the regression command; if that fails, the layer fails immediately. On pass, dispatches the mutation tool and parses the reporter output, including Stryker's clear-text table layout.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.6&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.8&lt;/td&gt;
&lt;td&gt;WARNING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;≥ 0.8&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Patches with no meaningful test coverage of the changed lines fail here even when &lt;code&gt;npm test&lt;/code&gt; is green.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Cheat Detector (Advisory)
&lt;/h2&gt;

&lt;p&gt;Diff-static scan for known agent-shortcut patterns. Each rule maps to a real category of behaviour I've watched agents fall into under pressure.&lt;/p&gt;

&lt;p&gt;
  Cheat detector rule IDs
  &lt;ul&gt;
&lt;li&gt;
&lt;code&gt;test-modification&lt;/code&gt;: agent edited the test instead of the implementation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;complexity-mismatch&lt;/code&gt;: trivial implementation paired with a non-trivial test&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mock-mutation&lt;/code&gt;: real call replaced with a mock that returns the expected value&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;exception-swallowing&lt;/code&gt;: caught and ignored exceptions to make tests pass&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hardcoded-answer&lt;/code&gt;: literal return of the expected output
&lt;/li&gt;
&lt;/ul&gt;




&lt;/p&gt;
&lt;p&gt;Optional Semgrep integration loads any registered cheat-rule packs into a transient config. The scan reads added and removed lines and patterns; it does not execute the patch. An agent that hides a hardcoded answer behind a small computation can still pass this layer. Honest about that.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 4: Property Gate (Advisory)
&lt;/h2&gt;

&lt;p&gt;Discovers modified TS/JS/Python functions, parses their parameter types, maps each to a fast-check arbitrary or Hypothesis strategy, generates a harness, runs it. Counterexamples surface as findings. Untyped or unsupported types degrade to a low-severity advisory finding rather than blocking.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 5: Attestation (Advisory on First Run)
&lt;/h2&gt;

&lt;p&gt;Reads the &lt;code&gt;refs/notes/swarm-attestation&lt;/code&gt; git note for the patch commit, validates the in-toto SLSA v1.0 envelope's subject SHA against the patch commit, then verifies the cosign signature. On the first run for a commit there's no note yet, so this layer reports advisory-warn and the post-battery attestation step writes the note.&lt;/p&gt;

&lt;p&gt;The note is verifiable later via &lt;code&gt;swarm attest verify &amp;lt;commit&amp;gt;&lt;/code&gt;. A downstream consumer can verify the patch survived the battery without trusting the running orchestrator process.&lt;/p&gt;
&lt;h2&gt;
  
  
  Composite Scoring
&lt;/h2&gt;

&lt;p&gt;When the hard gate passes, a weighted composite is computed across the three advisory layers and any optional advisory quality-gate results. Failed advisory gates each subtract a fixed penalty.&lt;/p&gt;

&lt;p&gt;
  Default scoring (overridable via .swarm/gates.yaml)
  &lt;ul&gt;
&lt;li&gt;composite threshold: &lt;code&gt;0.7&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;weights: cheat detector &lt;code&gt;0.4&lt;/code&gt;, property gate &lt;code&gt;0.4&lt;/code&gt;, attestation &lt;code&gt;0.2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;advisory gate penalty: &lt;code&gt;0.02&lt;/code&gt; per failure
&lt;/li&gt;
&lt;/ul&gt;




&lt;/p&gt;
&lt;p&gt;&lt;code&gt;humanReviewRequired&lt;/code&gt; is true when the composite score is below threshold or any advisory layer is in advisory-warn status.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where It Actually Runs
&lt;/h2&gt;

&lt;p&gt;Three real call sites, not just unit tests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production orchestrator on every &lt;code&gt;swarm&lt;/code&gt; run&lt;/li&gt;
&lt;li&gt;Synthetic calibration corpus (36 paired test specs across 6 broken-category families) executing in CI on every push&lt;/li&gt;
&lt;li&gt;SWE-bench harness using Layer 1 and Layer 4 as standalone spot-check eval drivers&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Honest Caveats
&lt;/h2&gt;

&lt;p&gt;These are in &lt;code&gt;docs/known-gaps.md&lt;/code&gt; and I won't hide them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Differential gate is host-Python-sensitive on legacy codebases. The synth-eval can reflect import-chain errors rather than assertion outcomes. The authoritative resolution gate in the per-instance Docker image is unaffected.&lt;/li&gt;
&lt;li&gt;Mutation gate skips quietly when no changed files match supported languages. YAML, Markdown, Rust, Go diffs don't get mutation-tested.&lt;/li&gt;
&lt;li&gt;Cheat detector is diff-static, not behavioural. The hidden-computation-around-hardcoded-answer pattern can pass it.&lt;/li&gt;
&lt;li&gt;Attestation signing is best-effort. Cosign-not-installed errors get logged and the run proceeds without a note. The note's absence is reflected in Layer 5's advisory-warn on subsequent runs but does not block.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Why Run This Experiment
&lt;/h2&gt;

&lt;p&gt;If the falsification battery alone produces patches that survive scrutiny at acceptable quality, then a lot of the apparent value of multi-agent coordination was actually the verification pressure it created, not the agent diversity itself. If the battery alone isn't enough, then v8 multi-agent gets a clearer mandate: the swarm is the value, not the side effect.&lt;/p&gt;

&lt;p&gt;Either result is useful. The point of the rewrite is to make the answer measurable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devtools</category>
      <category>testing</category>
    </item>
    <item>
      <title>swarm-orchestrator v7.0.0-alpha.0</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Thu, 30 Apr 2026 05:28:06 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/swarm-orchestrator-v700-alpha0-41g9</link>
      <guid>https://dev.to/moonrunnerkc/swarm-orchestrator-v700-alpha0-41g9</guid>
      <description>&lt;p&gt;The agent generates code. The orchestrator tries to find reasons not to trust it.&lt;/p&gt;

&lt;p&gt;That sentence is the entire pivot. Earlier versions of swarm-orchestrator coordinated multiple agents working on the same task. v7 wraps a single agent CLI (Copilot, Claude Code, or Codex) and runs five independent checks on the patch before allowing a merge.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;TL;DR&lt;/strong&gt;

&lt;p&gt;Five-layer verification battery sits between any agent CLI and your &lt;code&gt;main&lt;/code&gt; branch. Two layers are hard gates. Three feed a composite score. Every verified merge gets a signed SLSA attestation as a git note.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  The five checks
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Gate type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Intent verification&lt;/td&gt;
&lt;td&gt;Patch doesn't actually fix the stated problem&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hard&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Regression verification&lt;/td&gt;
&lt;td&gt;Patch breaks existing behavior, or test coverage is too weak to know&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hard&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Solution quality&lt;/td&gt;
&lt;td&gt;Agent gamed the test (hardcoded values, swallowed exceptions, modified tests)&lt;/td&gt;
&lt;td&gt;Composite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Behavioral verification&lt;/td&gt;
&lt;td&gt;Patch works on the happy path, crashes on edge cases&lt;/td&gt;
&lt;td&gt;Composite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Provenance&lt;/td&gt;
&lt;td&gt;No signed attestation produced for the merge&lt;/td&gt;
&lt;td&gt;Composite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1. Intent verification
&lt;/h3&gt;

&lt;p&gt;The patch must make a previously-failing test pass. For SWE-bench instances, that's the &lt;code&gt;FAIL_TO_PASS&lt;/code&gt; test from the instance JSON. For user-facing goals, a reviewer synthesizes a regression test before the worker runs and confirms it fails against the base commit first.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Regression verification
&lt;/h3&gt;

&lt;p&gt;Existing tests must pass. Then mutation testing runs on the modified files to check whether coverage around the change is actually strong enough to catch regressions. A patch that works but lives in weakly-tested code gets flagged.&lt;/p&gt;

&lt;p&gt;
  Mutation testing tooling per language
  &lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JS / TS&lt;/strong&gt;: Stryker&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: mutmut&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Java&lt;/strong&gt;: PITest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mutation score thresholds are configurable in &lt;code&gt;.swarm/gates.yaml&lt;/code&gt;. Defaults: below 0.6 fails, 0.6 to 0.8 warns, above 0.8 passes.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Solution quality
&lt;/h3&gt;

&lt;p&gt;A Semgrep rule pack scans for the specific shortcuts agents take when they're being graded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardcoded values matching test expectations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;try/catch&lt;/code&gt; blocks swallowing the exact exception a failing test was asserting on&lt;/li&gt;
&lt;li&gt;Modifications to test files outside the stated scope&lt;/li&gt;
&lt;li&gt;Mock mutations that make tests pass without changing the implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Behavioral verification
&lt;/h3&gt;

&lt;p&gt;Property-based testing runs against modified functions for 60 seconds each, using Hypothesis (Python) or fast-check (TypeScript). Counterexamples that crash the patched code or violate type contracts get reported.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Provenance
&lt;/h3&gt;

&lt;p&gt;A signed SLSA v1.0 attestation is generated for each verified merge and attached to the commit as a git note. Signed via cosign keyless OIDC. The attestation contains agent identity, model version, per-layer results, and the composite score.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm attest verify &amp;lt;commit&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That command pulls the note and verifies the signature. Useful when something breaks in production three months later and someone asks which agent wrote the offending code and what was checked at merge time.&lt;/p&gt;


&lt;h2&gt;
  
  
  Status
&lt;/h2&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Alpha.&lt;/strong&gt; SWE-bench Verified 50-instance sweeps across Copilot CLI, Claude Code, and Codex are running now. Headline metric is the &lt;strong&gt;falsification catch rate&lt;/strong&gt;: of the patches each agent claimed succeeded, what percentage failed at least one layer. Numbers drop in a follow-up post when the sweeps complete.&lt;br&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where this goes next
&lt;/h2&gt;

&lt;p&gt;v8 brings parallel execution back, applied to verification instead of generation.&lt;/p&gt;

&lt;p&gt;The orchestrator will compute a risk score for each patch, then spawn a population of independent falsifiers sized to that risk. Falsifiers share findings through a coordination channel so a discovery from one steers the targeting of others. A bandit selects which falsifier types to spawn based on past outcomes.&lt;/p&gt;

&lt;p&gt;The v7 five-layer battery becomes the seed pool that v8 grows from. The project name finally fits.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;
        swarm-orchestrator
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Independent verification battery for patches written by AI coding agents. Wraps Copilot, Claude Code, and Codex; applies a five-layer falsification battery (intent, mutation, cheat detection, property tests, signed attestation) to gate merges.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/assets/header.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fassets%2Fheader.svg" alt="Swarm Orchestrator" width="100%"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Swarm Orchestrator&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Independent verification battery for patches written by AI coding agents.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fc208599ef300dfbb7d7b65c32d4e1364b62c8c0bd3cc6df8a16615f7ccd9991/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4953432d626c75653f7374796c653d666c61742d737175617265" alt="License"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/7e9d8f19047f8a8c87d4828d268725442644c934e1e8acc4d5387426dabe6d41/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f76657273696f6e2d372e302e302d2d616c7068612e302d6f72616e67653f7374796c653d666c61742d737175617265" alt="Version"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/package.json" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/37e12b341829a2c53b69b36b6fe5a9a4f42cf56b82722fac6b5011085a3749e6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334432302d3333393933333f7374796c653d666c61742d737175617265" alt="Node"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3f31914b57bc82fa5dcbe2b429e1d486362f11bfa4411282ff311f2885102e19/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f722f63692e796d6c3f6272616e63683d6d61696e266c6162656c3d6369267374796c653d666c61742d737175617265" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1e5ff00a2deeb89446b34d1c735acc066221a43fb000f5d5a446b33581a6edb1/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f6d6f6f6e72756e6e65726b632f737761726d2d6f7263686573747261746f723f7374796c653d666c61742d737175617265" alt="Stars"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#how-it-works" rel="noopener noreferrer"&gt;How It Works&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#documentation" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#contributing" rel="noopener noreferrer"&gt;Contributing&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Wraps third-party coding-agent CLIs (Copilot, Claude Code, Codex), runs worker and reviewer steps on isolated git branches, and applies a five-layer falsification battery to each agent-authored patch. Hard gates block patches that fail intent or regression checks; advisory layers feed a composite score.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You run this around an agent CLI, not instead of one. The agent produces the patch; the orchestrator tries to break it. Patches that survive merge to &lt;code&gt;main&lt;/code&gt;; patches that don't are rolled back with a verification report.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Five-layer falsification battery.&lt;/strong&gt; Intent verification, regression and mutation testing, cheat detection, property-based testing, and signed attestation. Layers 1 and 2 are hard gates; layers 3 to 5 feed an advisory composite score. Implementations live under &lt;code&gt;src/verification/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolated worker and reviewer steps.&lt;/strong&gt; Each step runs…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>94% of Published SKILL.md Files Skip the Spec's Two Most Basic Patterns</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Wed, 29 Apr 2026 02:30:37 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/94-of-published-skillmd-files-skip-the-specs-two-most-basic-patterns-oo0</link>
      <guid>https://dev.to/moonrunnerkc/94-of-published-skillmd-files-skip-the-specs-two-most-basic-patterns-oo0</guid>
      <description>&lt;p&gt;The agentskills.io spec recommends two things in every description: start with an action verb, and include a trigger phrase like "use when..." that tells the routing layer when to fire the skill. They take five seconds to add and they're the difference between a skill an agent picks up and a skill that sits unused in the catalog.&lt;/p&gt;

&lt;p&gt;I sampled 500 skills at random from a 1,436-skill public corpus and measured both. 5.8% follow both recommendations. 61.8% follow neither.&lt;/p&gt;

&lt;p&gt;The full breakdown of what the SKILL.md ecosystem actually looks like in production, as of late April 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Corpus: &lt;code&gt;sickn33/antigravity-awesome-skills&lt;/code&gt; at HEAD on April 29, 2026. This is the largest publicly bundled SKILL.md collection in a single repo (1,436 indexed skills with metadata for category, source, and risk classification).&lt;/p&gt;

&lt;p&gt;Sample: 500 skills, random with seed 42 for reproducibility.&lt;/p&gt;

&lt;p&gt;Tool: &lt;a href="https://github.com/moonrunnerkc/skillcheck" rel="noopener noreferrer"&gt;&lt;code&gt;skillcheck&lt;/code&gt;&lt;/a&gt; v1.2.0 from PyPI.&lt;/p&gt;

&lt;p&gt;Per-skill features captured: every skillcheck diagnostic (rule, severity, message), description quality score, body line count, body and metadata token estimates, activation entropy and top-hypothesis score from &lt;code&gt;--activation-hypotheses&lt;/code&gt;, structural features computed locally (description length in chars and words, action verb in first position, trigger-phrase presence, presence of &lt;code&gt;resources/&lt;/code&gt;/&lt;code&gt;scripts/&lt;/code&gt;/&lt;code&gt;references/&lt;/code&gt; subdirectories, frontmatter field count and which fields), plus the antigravity-supplied category, source, and risk metadata.&lt;/p&gt;

&lt;p&gt;Caveat one: skillcheck's description quality score is a heuristic that includes action-verb and trigger-phrase detection as positive signals. So the correlation between these two features and the score is partly mechanical. The headline finding is not "we discovered these patterns predict quality." It's "the spec recommends these patterns, the linter that encodes the spec rewards them, and almost nobody is using them."&lt;/p&gt;

&lt;p&gt;Caveat two: antigravity's bundler injects &lt;code&gt;risk&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;date_added&lt;/code&gt;, and &lt;code&gt;category&lt;/code&gt; fields into the SKILL.md frontmatter when packaging skills. The author-original frontmatter analysis below excludes these injected fields.&lt;/p&gt;

&lt;p&gt;Reproduce in five commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;skillcheck&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;1.2.0
git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/sickn33/antigravity-awesome-skills.git
&lt;span class="nb"&gt;cd &lt;/span&gt;antigravity-awesome-skills
&lt;span class="c"&gt;# Then sample from skills_index.json with seed 42 and run skillcheck against each&lt;/span&gt;
&lt;span class="c"&gt;# Full analysis script: see the dataset link at the bottom&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  The two-pattern adoption gap
&lt;/h2&gt;

&lt;p&gt;Every skill description was classified on two binary features: does it start with an action verb (Generates, Validates, Creates, Builds, Analyzes, etc., from a 90-verb allowlist), and does it contain a trigger phrase (&lt;code&gt;use when&lt;/code&gt;, &lt;code&gt;use this skill when&lt;/code&gt;, &lt;code&gt;when the user&lt;/code&gt;, &lt;code&gt;when working with&lt;/code&gt;, &lt;code&gt;whenever&lt;/code&gt;, etc.)?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Has both action verb and trigger phrase&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;5.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action verb only&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;td&gt;21.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trigger phrase only&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;10.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neither&lt;/td&gt;
&lt;td&gt;309&lt;/td&gt;
&lt;td&gt;61.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same four groups, scored against skillcheck's description quality metric:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Median score&lt;/th&gt;
&lt;th&gt;% scoring 70+&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Has both&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;100.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action verb only&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;72.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trigger phrase only&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;94.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neither&lt;/td&gt;
&lt;td&gt;309&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;td&gt;8.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 100% rate in the both-features group isn't magic. It reflects that skillcheck's heuristic was designed around the spec's recommendations and rewards skills that follow them. What's actually striking is the bottom line: 309 of 500 published skills skip both recommendations. That's the working majority of the ecosystem leaving easy quality on the floor.&lt;/p&gt;
&lt;h2&gt;
  
  
  What authors actually fill in
&lt;/h2&gt;

&lt;p&gt;Outside &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt;, frontmatter is mostly empty. The median author-original frontmatter (excluding the bundler's injected fields) has just two fields. Two.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Adoption&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;99.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;description&lt;/td&gt;
&lt;td&gt;99.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;author&lt;/td&gt;
&lt;td&gt;10.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tags&lt;/td&gt;
&lt;td&gt;10.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tools&lt;/td&gt;
&lt;td&gt;8.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;license&lt;/td&gt;
&lt;td&gt;3.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;allowed-tools&lt;/td&gt;
&lt;td&gt;2.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;version&lt;/td&gt;
&lt;td&gt;2.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;triggers&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;user-invokable&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;capabilities&lt;/td&gt;
&lt;td&gt;0.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The spec offers &lt;code&gt;version&lt;/code&gt;, &lt;code&gt;author&lt;/code&gt;, &lt;code&gt;tags&lt;/code&gt;, &lt;code&gt;allowed-tools&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;agent&lt;/code&gt;, &lt;code&gt;hooks&lt;/code&gt;, &lt;code&gt;user-invocable&lt;/code&gt;, &lt;code&gt;disable-model-invocation&lt;/code&gt;, &lt;code&gt;skills&lt;/code&gt;, &lt;code&gt;mode&lt;/code&gt;. Almost none of them are being used. 80% of authors stop after name and description. There's an entire optional metadata layer the spec defines and the ecosystem ignores.&lt;/p&gt;
&lt;h2&gt;
  
  
  Progressive disclosure adoption is 16%
&lt;/h2&gt;

&lt;p&gt;The spec's load-bearing concept is progressive disclosure: keep metadata tiny so the routing layer scans it cheaply, keep the body lean so it fits the agent's context window, push heavy material into &lt;code&gt;resources/&lt;/code&gt;, &lt;code&gt;scripts/&lt;/code&gt;, or &lt;code&gt;references/&lt;/code&gt; subdirectories that load only when needed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Subdirectory&lt;/th&gt;
&lt;th&gt;Adoption&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;resources/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scripts/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;references/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Any of the three&lt;/td&gt;
&lt;td&gt;16.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;84% of skills inline everything in &lt;code&gt;SKILL.md&lt;/code&gt;. The whole architectural promise of progressive disclosure (multiple skills can sit in the agent's catalog without overwhelming context) requires authors to actually use the pattern. Most don't.&lt;/p&gt;
&lt;h2&gt;
  
  
  Body bloat is real
&lt;/h2&gt;

&lt;p&gt;23% of skills triggered &lt;code&gt;disclosure.body-bloat&lt;/code&gt; warnings, meaning they contain code blocks over 50 lines or tables over 20 rows in the SKILL.md body itself. These are exactly the things the progressive disclosure pattern was designed to push out into &lt;code&gt;references/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;13.6% exceeded the spec's 500-line soft cap on body length. 8.4% exceeded the 5,000-token body budget when skillcheck's tokenizer flagged them (the rest weren't measured because they didn't trip the warning threshold).&lt;/p&gt;
&lt;h2&gt;
  
  
  Description length sweet spot
&lt;/h2&gt;

&lt;p&gt;Quality scores rise with description length up to about 175-225 characters, then plateau:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Length range (chars)&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Median quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;25-49&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-99&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100-149&lt;/td&gt;
&lt;td&gt;158&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;150-199&lt;/td&gt;
&lt;td&gt;131&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200-249&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;67.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;250-299&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The spec's character cap is 1,024. Almost nobody's pushing it. The ecosystem clusters between 100 and 200 chars (median 145), which is roughly the bottom edge of the quality plateau. Authors writing 150+ char descriptions get noticeably better routing signal density.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cross-source patterns
&lt;/h2&gt;

&lt;p&gt;Antigravity's index classifies each skill's source. Quality patterns by source class:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source class&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Median quality&lt;/th&gt;
&lt;th&gt;% action verb&lt;/th&gt;
&lt;th&gt;% trigger&lt;/th&gt;
&lt;th&gt;% progressive disclosure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;community&lt;/td&gt;
&lt;td&gt;394&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;26.6%&lt;/td&gt;
&lt;td&gt;17.5%&lt;/td&gt;
&lt;td&gt;16.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;external_repo&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;65.0&lt;/td&gt;
&lt;td&gt;34.2%&lt;/td&gt;
&lt;td&gt;31.6%&lt;/td&gt;
&lt;td&gt;18.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;official_org&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;77.8%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;33.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;personal&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;50.0&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three observations. Skills from official org repos (Anthropic, Hugging Face, etc.) hit 77.8% action-verb adoption, miles above the community baseline, but zero trigger-phrase use; their descriptions are direct and verb-led without the "use when" preamble. Skills from individual external repos (someone's personal GitHub project) actually hit the highest trigger-phrase rate (31.6%), suggesting individual maintainers writing for their own activation problem think harder about it than community contributors writing for a shared list. Skills tagged "personal" (someone's curated set of their own work) hit 0% on both patterns, which is the cleanest signal that "I made this for me" doesn't translate to "an agent will pick this up."&lt;/p&gt;
&lt;h2&gt;
  
  
  Skillcheck v1.2.0 against the corpus
&lt;/h2&gt;

&lt;p&gt;The new version was released April 28, 2026. The skillcheck rule set found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 of 500 skills produced an actual ERROR (0.2%): &lt;code&gt;android_ui_verification&lt;/code&gt;, which has invalid characters in its name.&lt;/li&gt;
&lt;li&gt;499 of 500 produced WARNINGs (99.8%).&lt;/li&gt;
&lt;li&gt;0 skills passed completely clean.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most-fired rules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.field.unknown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;description.quality-score&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;499&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disclosure.body-bloat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;115&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compat.unverified&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disclosure.metadata-budget&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sizing.body.line-count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disclosure.body-budget&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.description.person-voice&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.field.ecosystem&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sizing.body.token-estimate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontmatter.name.reserved-word&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;frontmatter.field.unknown&lt;/code&gt; warning fires on every file because antigravity injects bundler-only fields into the frontmatter (&lt;code&gt;risk&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;date_added&lt;/code&gt;); strip those and the genuine unknown-field rate drops dramatically. Worth knowing if you're running skillcheck against bundled corpora versus author-original repos.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this means if you publish skills
&lt;/h2&gt;

&lt;p&gt;Four things, all reversible in a single commit per skill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Start the description with an action verb (&lt;code&gt;Generates&lt;/code&gt;, &lt;code&gt;Validates&lt;/code&gt;, &lt;code&gt;Creates&lt;/code&gt;, &lt;code&gt;Analyzes&lt;/code&gt;, &lt;code&gt;Refactors&lt;/code&gt;, &lt;code&gt;Audits&lt;/code&gt;, etc.). Not &lt;code&gt;Expert in&lt;/code&gt;, not &lt;code&gt;Comprehensive&lt;/code&gt;, not &lt;code&gt;One-stop&lt;/code&gt;. The verb tells the routing layer what the skill does in two syllables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Include a trigger phrase (&lt;code&gt;Use when ...&lt;/code&gt;, &lt;code&gt;Trigger when ...&lt;/code&gt;, &lt;code&gt;Use this skill when the user ...&lt;/code&gt;). The agent's routing decision is "should I activate this." A trigger phrase answers it directly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aim for 175-225 characters in the description. Short descriptions don't carry enough routing signal; long ones bury it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Push large code blocks (&amp;gt;50 lines), large tables (&amp;gt;20 rows), and detailed reference material out of &lt;code&gt;SKILL.md&lt;/code&gt; and into &lt;code&gt;resources/&lt;/code&gt;, &lt;code&gt;scripts/&lt;/code&gt;, or &lt;code&gt;references/&lt;/code&gt;. The body should describe the work; the reference files should hold the work.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Four changes that move a skill from the 61.8% of the ecosystem ignoring spec recommendations to the 5.8% following them.&lt;/p&gt;
&lt;h2&gt;
  
  
  Methodology, for anyone who wants to push back
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tool: &lt;code&gt;skillcheck&lt;/code&gt; v1.2.0 from PyPI (released April 28, 2026)&lt;/li&gt;
&lt;li&gt;Corpus: &lt;code&gt;sickn33/antigravity-awesome-skills&lt;/code&gt; at HEAD on April 29, 2026 (1,436 indexed skills)&lt;/li&gt;
&lt;li&gt;Sample: 500 skills, drawn with &lt;code&gt;random.seed(42)&lt;/code&gt; then &lt;code&gt;random.sample&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Per-skill processing: &lt;code&gt;skillcheck path --format json --skip-ref-check&lt;/code&gt; plus &lt;code&gt;skillcheck path --activation-hypotheses --format json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Feature extraction: action-verb match against a 90-verb allowlist (gerund and base forms); trigger-phrase match against 9 regex patterns; structural facts computed from filesystem and parsed frontmatter&lt;/li&gt;
&lt;li&gt;Quality score: pulled from skillcheck's &lt;code&gt;description.quality-score&lt;/code&gt; info diagnostic (a published heuristic whose source is at &lt;code&gt;src/skillcheck/rules/description.py&lt;/code&gt; in the skillcheck repo)&lt;/li&gt;
&lt;li&gt;Frontmatter analysis: bundler-injected fields (&lt;code&gt;risk&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;date_added&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;, &lt;code&gt;id&lt;/code&gt;) excluded from the author-original counts above&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full dataset (500 skills, all features, all diagnostics) and the analysis output are in the skillcheck repo under &lt;a href="https://github.com/moonrunnerkc/skillcheck/tree/main/docs" rel="noopener noreferrer"&gt;&lt;code&gt;docs/&lt;/code&gt;&lt;/a&gt;. Anyone who wants to verify a finding, slice it differently, or run the same pipeline against a different corpus has everything they need.&lt;/p&gt;
&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;This study used skillcheck's symbolic mode and the activation-hypotheses generator. The agent-native critique mode (&lt;code&gt;--ingest-critique&lt;/code&gt;) and capability graph extraction (&lt;code&gt;--ingest-graph&lt;/code&gt;) weren't run here because they require a real agent in the loop and would have made the corpus run significantly longer. A follow-up study using those modes on a smaller subset (50-100 skills) would tell us what an agent actually sees in a skill versus what a static linter can measure. That's the next post.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/skillcheck" rel="noopener noreferrer"&gt;
        skillcheck
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Cross-agent skill quality gate for SKILL.md files. Validates frontmatter, scores description discoverability, checks file references, enforces three-tier token budgets, and flags compatibility issues across Claude Code, VS Code/Copilot, Codex, and Cursor.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;

  
  
  &lt;img alt="skillcheck" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fskillcheck%2FHEAD%2F.github%2Fbanner.svg" width="600"&gt;

&lt;br&gt;
&lt;p&gt;&lt;strong&gt;Cross-agent skill quality gate for &lt;code&gt;SKILL.md&lt;/code&gt; files.&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What This Does&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;skillcheck&lt;/code&gt; validates SKILL.md files against the &lt;a href="https://agentskills.io/specification" rel="nofollow noopener noreferrer"&gt;agentskills.io specification&lt;/a&gt;: frontmatter structure, description quality, body size, file references, and cross-agent compatibility. New in v1.0: agent-native semantic self-critique, heuristic capability graph extraction with five structural analyzers, and a per-skill validation history ledger. It does not call any LLM API, execute skill instructions, or modify files.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why This Exists&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;Analysis of 580 AI instruction files found that 96% of their content cannot be verified by any static tool. A separate survey found that 22% of SKILL.md files fail basic structural validation. Skills get written, committed, and published to catalogs; nobody proves they work.&lt;/p&gt;

&lt;p&gt;skillcheck addresses both gaps with a two-mode design. When a calling agent is present, it uses that agent for semantic self-critique and capability graph extraction: the agent reads the skill's instructions and reports whether they are clear, complete, and internally…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/skillcheck" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>claude</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Jupyter notebook bug that only crashes for other people</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Tue, 28 Apr 2026 04:53:59 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/the-jupyter-notebook-bug-that-only-crashes-for-other-people-5aek</link>
      <guid>https://dev.to/moonrunnerkc/the-jupyter-notebook-bug-that-only-crashes-for-other-people-5aek</guid>
      <description>&lt;p&gt;Cell 0 uses &lt;code&gt;df&lt;/code&gt;. Cell 1 defines &lt;code&gt;df&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Notebook works for you because your kernel ran the cells in some other order and the variable's still in memory. You commit. Someone clones the repo, hits Restart and Run All, dies on cell 0.&lt;/p&gt;

&lt;p&gt;Standard Python linters can't catch this. ruff, flake8, mypy operate on one source file at a time. A notebook is N cells whose execution order in your kernel may have nothing to do with their order on disk. The bug isn't inside any single cell. It's in the relationship between cells.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nborder&lt;/code&gt; is a static linter for that relationship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rules
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Flags&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB101&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;execution_count&lt;/code&gt; decreases in source order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB201&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Name used in cell N, only defined in cell M where M &amp;gt; N&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB102&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Name used somewhere, never defined anywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NB103&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stochastic call (numpy, torch, tensorflow, stdlib random) before any seed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How the cross-cell analysis works
&lt;/h2&gt;

&lt;p&gt;Each cell gets parsed with &lt;a href="https://github.com/Instagram/LibCST" rel="noopener noreferrer"&gt;libCST&lt;/a&gt;. A visitor extracts symbol definitions (assignments, function defs, class defs, imports) and symbol uses (name references, attribute roots) per cell. Connect them across cells in source order, you get a dataflow graph at notebook scope.&lt;/p&gt;

&lt;p&gt;NB201 findings are uses whose nearest matching definition lives in a later cell. NB102 findings are uses with no matching definition anywhere.&lt;/p&gt;

&lt;p&gt;The graph also makes the auto-fix safe. When NB201 fires, the fixer runs a topological sort over cell dependency edges. Sort succeeds, cells get reordered to respect dataflow and execution counts get cleared. Cycle detected, fixer bails with an explicit message naming the cycle.&lt;/p&gt;

&lt;p&gt;
  NB201 fix example
  &lt;p&gt;Input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cell 0
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# cell 1
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Run &lt;code&gt;nborder check --fix notebook.ipynb&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;notebook.ipynb:cell_0:1:10: NB201 Variable `df` used in cell 0 is only defined in cell 1. The notebook will fail on Restart-and-Run-All. [*]
Fix outcomes:
  reorder: applied (reordered 2 cells and cleared execution counts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cell 0
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;

&lt;span class="c1"&gt;# cell 1
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Cell IDs preserved. Execution counts cleared. Second &lt;code&gt;nborder check&lt;/code&gt; exits 0.&lt;/p&gt;



&lt;/p&gt;
&lt;h2&gt;
  
  
  NB103 and seed injection
&lt;/h2&gt;

&lt;p&gt;NB103 walks the same graph for stochastic calls (&lt;code&gt;np.random.rand&lt;/code&gt;, &lt;code&gt;torch.rand&lt;/code&gt;, &lt;code&gt;tf.random.normal&lt;/code&gt;, &lt;code&gt;random.random&lt;/code&gt;) firing before any matching seed. The fix injects a single seed cell at the right position. Multi-library notebooks get one cell:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Alias-aware. &lt;code&gt;import numpy as numpy_lib&lt;/code&gt; produces a seed line using &lt;code&gt;numpy_lib&lt;/code&gt;, not a redundant fresh import. After fixing a NumPy notebook, computed cell outputs are byte-identical across consecutive &lt;code&gt;jupyter nbconvert --execute&lt;/code&gt; runs.&lt;/p&gt;

&lt;p&gt;JAX and scikit-learn get diagnostic-only handling. JAX needs &lt;code&gt;PRNGKey&lt;/code&gt; threading through call signatures. sklearn &lt;code&gt;random_state=None&lt;/code&gt; needs a value chosen against your testing strategy. Neither is a single line you can inject.&lt;/p&gt;
&lt;h2&gt;
  
  
  Byte-stable writer
&lt;/h2&gt;

&lt;p&gt;Parse a notebook, modify nothing, write it back, bytes match exactly. Verified against &lt;code&gt;nbformat&lt;/code&gt; v4.0, v4.4, v4.5 fixtures plus a real-world notebook corpus. When the writer does mutate during a fix, only the cells that actually changed get rewritten. Cell IDs, metadata, and unrelated cells stay verbatim.&lt;/p&gt;
&lt;h2&gt;
  
  
  Outputs
&lt;/h2&gt;

&lt;p&gt;Four reporters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;text&lt;/strong&gt;: ruff-style &lt;code&gt;path:cell:line:col: NB### message&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;json&lt;/strong&gt;: machine-readable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;github&lt;/strong&gt;: &lt;code&gt;::error file=...,line=...,title=NB201::&lt;/code&gt; annotations for PR inline comments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sarif&lt;/strong&gt;: SARIF 2.1.0, schema-validated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pre-commit hook and a composite GitHub Action included:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moonrunnerkc/nborder@v0.1.4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notebooks/&lt;/span&gt;
    &lt;span class="na"&gt;select&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NB201,NB103&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  What it doesn't do
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Doesn't execute notebooks. Pair with &lt;a href="https://github.com/computationalmodelling/nbval" rel="noopener noreferrer"&gt;nbval&lt;/a&gt; or &lt;a href="https://github.com/nteract/papermill" rel="noopener noreferrer"&gt;papermill&lt;/a&gt; for kernel-level validation.&lt;/li&gt;
&lt;li&gt;Doesn't lint cell-internal style. That's &lt;a href="https://github.com/nbQA-dev/nbQA" rel="noopener noreferrer"&gt;nbqa&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Dynamic name resolution (&lt;code&gt;exec&lt;/code&gt;, &lt;code&gt;getattr&lt;/code&gt;, &lt;code&gt;**kwargs&lt;/code&gt;, monkey-patching) is invisible. Same limitation as any static analyzer.&lt;/li&gt;
&lt;li&gt;Cell magics are stripped before analysis. Names introduced by &lt;code&gt;%%capture&lt;/code&gt; get tracked. Anything magic-internal does not.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;nborder
nborder check path/to/notebooks/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Python 3.10+.&lt;/p&gt;


&lt;/div&gt;




&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/nborder" rel="noopener noreferrer"&gt;
        nborder
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A fast, opinionated linter and auto-fixer for Jupyter notebook hidden-state and execution-order bugs.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;nborder&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;A fast, opinionated linter and auto-fixer for Jupyter notebook hidden-state and execution-order bugs.&lt;/p&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/nborder/docs/images/hero.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fnborder%2FHEAD%2Fdocs%2Fimages%2Fhero.png" alt="nborder catches four classes of notebook bug in one pass"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://pypi.org/project/nborder/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e3b5ccfec928f35e7daa5ff4a841dd0685a5a3646652971eb5834e527ee0e373/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f6e626f726465722e737667" alt="PyPI version"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/nborder/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/moonrunnerkc/nborder/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/nborder/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/299dedd9b8667ac146540cd90fa831a9803e6e152f430bfbefaa9bee8d56236a/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f6e626f726465722e737667" alt="Python"&gt;&lt;/a&gt;
&lt;a href="https://github.com/moonrunnerkc/nborder/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/08cef40a9105b6526ca22088bc514fbfdbc9aac1ddbf8d4e6c750e3a88a44dca/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d626c75652e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What this catches&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;One-line example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NB101&lt;/td&gt;
&lt;td&gt;Non-monotonic execution counts&lt;/td&gt;
&lt;td&gt;Cell 1 ran with &lt;code&gt;In [3]:&lt;/code&gt; after cell 0 ran with &lt;code&gt;In [5]:&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NB102&lt;/td&gt;
&lt;td&gt;Won't survive Restart-and-Run-All&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;print(df)&lt;/code&gt; references a name no cell in the notebook defines.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NB201&lt;/td&gt;
&lt;td&gt;Use-before-assign across cells&lt;/td&gt;
&lt;td&gt;Cell 0 uses &lt;code&gt;df&lt;/code&gt;; &lt;code&gt;df = ...&lt;/code&gt; only appears in cell 1.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NB103&lt;/td&gt;
&lt;td&gt;Stochastic library used without seed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;np.random.rand(3)&lt;/code&gt; runs with no seed call before it.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;Each rule has a docs page under &lt;a href="https://github.com/moonrunnerkc/nborder/docs/rules/" rel="noopener noreferrer"&gt;&lt;code&gt;docs/rules/&lt;/code&gt;&lt;/a&gt; explaining the bug class, a bad and good example, and the auto-fix behaviour. The four sections below walk through each rule with the diagnostic nborder actually emits.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;NB101: out-of-order execution&lt;/h3&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;execution_count&lt;/code&gt; field on each cell records the order Jupyter actually ran cells in, not the order they appear in the file. When those orders disagree, the recorded…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/nborder" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/nborder" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>python</category>
      <category>jupyter</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Four Security Bugs That Shipped in AI-Generated Code (and How They Got Caught)</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Wed, 15 Apr 2026 18:36:15 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/four-security-bugs-that-shipped-in-ai-generated-code-and-how-they-got-caught-10i8</link>
      <guid>https://dev.to/moonrunnerkc/four-security-bugs-that-shipped-in-ai-generated-code-and-how-they-got-caught-10i8</guid>
      <description>&lt;p&gt;A single Copilot CLI run against a FastAPI application produced four distinct security issues. The code worked. Tests passed. The endpoint did what was asked. None of the issues would surface during a demo or a code review focused on functionality.&lt;/p&gt;

&lt;h2&gt;
  
  
  User input rendered as raw HTML
&lt;/h2&gt;

&lt;p&gt;The application tracks satellite data. Satellite names come from user input. The agent rendered them directly into HTML templates in four separate locations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;strong&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/strong&amp;gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sat1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; vs &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sat2&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No escaping. Four blocks, same pattern. A single-purpose security scanning agent found all four and applied &lt;code&gt;markupsafe.escape()&lt;/code&gt;. A general-purpose agent reviewing the same code caught three of four, missing one buried in a conditional branch.&lt;/p&gt;

&lt;p&gt;The difference isn't model quality. The security-focused agent had a narrower scope and explicit instructions to scan for unescaped user input in template rendering. Scope and prompt specificity determined the outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health endpoint that lies to the load balancer
&lt;/h2&gt;

&lt;p&gt;The agent built a &lt;code&gt;/health&lt;/code&gt; endpoint. It returned HTTP 200 unconditionally, including when the database was unreachable.&lt;/p&gt;

&lt;p&gt;Kubernetes liveness and readiness probes interpret 200 as "this instance is healthy, keep routing traffic." An instance that returns 200 with a dead database stays in the rotation. Users hit it. Requests fail. The cluster thinks everything is fine.&lt;/p&gt;

&lt;p&gt;The correct response is 503 (Service Unavailable). The orchestrator's verification caught this because runtime behavior checks are part of the quality gate surface, not just static analysis.&lt;/p&gt;

&lt;p&gt;This one's subtle. The endpoint "works" in every test environment where the database is actually running. It only fails in the exact production scenario it was designed to protect against.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exception details returned to clients
&lt;/h2&gt;

&lt;p&gt;Error handlers used &lt;code&gt;str(e)&lt;/code&gt; as the response body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Database connection strings, file paths, internal state. All returned directly to whoever triggered the error. In a security audit this is an information disclosure finding. In a FastAPI app behind an API gateway, it's a path to mapping internal infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deprecated datetime API
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;datetime.utcnow()&lt;/code&gt; has been deprecated since Python 3.12. The replacement is &lt;code&gt;datetime.now(timezone.utc)&lt;/code&gt;. The agent also used &lt;code&gt;time.time()&lt;/code&gt; for uptime tracking, which is affected by NTP clock adjustments and can report negative uptime if the system clock steps backward. &lt;code&gt;time.monotonic()&lt;/code&gt; exists specifically for this case.&lt;/p&gt;

&lt;p&gt;Neither of these will cause a production outage today. Both are the kind of technical debt that accumulates when generated code isn't checked against current language standards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;None of these bugs required a sophisticated analysis to find. They're patterns: unescaped user input in templates, unconditional success responses in health checks, raw exception strings in error responses, deprecated stdlib usage. Each one is a known category with a known fix.&lt;/p&gt;

&lt;p&gt;The problem is attention. A general-purpose agent optimizing for "make this feature work" doesn't allocate attention to these categories unless explicitly prompted. The feature works. The tests pass. The agent moves on.&lt;/p&gt;

&lt;p&gt;This is where orchestration changes the economics. Instead of one agent covering everything, specialized agents with narrow scopes check specific categories. A security auditor scans for injection and information disclosure. A runtime checker validates health endpoint semantics. Each agent's prompt is focused enough that known bug patterns get caught.&lt;/p&gt;

&lt;p&gt;The alternative is what most developers do today: manually reprompt. "Now check for XSS." "Now add proper error handling." "Now fix the health check to actually check health." We measured this on the same codebase. 14 follow-up prompts to bring the standalone output to the same level. Each prompt required reading the previous output, identifying what was wrong, and writing a specific correction. About 45 minutes of continuous supervision.&lt;/p&gt;

&lt;p&gt;The orchestrated run took 22 minutes, unattended. 7 premium requests vs 15. Zero human review cycles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Swarm Orchestrator v5.0.0
&lt;/h2&gt;

&lt;p&gt;The tool that caught these is open source. It wraps existing agent CLIs (Copilot, Claude Code, Codex) and adds verification, quality gates, and parallel execution. It doesn't generate code. It delegates code generation and verifies the output against outcome-based checks: git diff, build success, test pass, runtime behavior.&lt;/p&gt;

&lt;p&gt;v5.0.0 adds three features relevant to this problem:&lt;/p&gt;

&lt;p&gt;Spec-aware planning reads the quality gate configuration before generating agent prompts. Security requirements, test coverage thresholds, and configuration standards get injected before agents write code, not discovered through iteration afterward.&lt;/p&gt;

&lt;p&gt;SARIF output exports quality gate violations as SARIF 2.1.0 JSON compatible with GitHub code scanning. Same PR annotation workflow teams already use for CodeQL.&lt;/p&gt;

&lt;p&gt;Per-project gate configuration via &lt;code&gt;.swarm/gates.yaml&lt;/code&gt; lets teams override thresholds and disable gates that don't apply to their project type.&lt;/p&gt;

&lt;p&gt;1,386 passing tests, 84 source files, 7 documented benchmarks. The release notes include commit hashes for every bug fix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Swarm Orchestrator on GitHub&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;What categories of bugs do you consistently find in AI-generated code that could be caught by a specialized check rather than manual review?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We Parsed 580 AI Instruction Files. 96% of the Content Can't Be Verified.</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Wed, 15 Apr 2026 04:31:26 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/we-parsed-580-ai-instruction-files-96-of-the-content-cant-be-verified-4cg5</link>
      <guid>https://dev.to/moonrunnerkc/we-parsed-580-ai-instruction-files-96-of-the-content-cant-be-verified-4cg5</guid>
      <description>&lt;p&gt;Every AI coding agent reads an instruction file. CLAUDE.md, AGENTS.md, .cursorrules, whatever your agent uses. You write rules in it. The agent says "Done." And you have no idea whether it followed any of them.&lt;/p&gt;

&lt;p&gt;We wanted to know what's actually inside these files. Not what people think they contain, but what a machine can extract and verify through static analysis. So we scraped instruction files from 568 public GitHub repos with 10+ stars, ran them through a parser backed by 102 matchers across 8 verifier engines (AST, filesystem, regex, tree-sitter, preference, tooling, config-file, git-history), and counted what came out.&lt;/p&gt;

&lt;p&gt;The short version: across the entire corpus, 3.8% of lines were extracted as verifiable coding rules. The other 96% is markdown headers, code examples, project descriptions, build commands, agent behavior directives, and contextual prose.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dataset
&lt;/h2&gt;

&lt;p&gt;580 instruction files from 568 repos, including Sentry (43k stars), PingCAP/TiDB (40k), Lerna (36k), Dragonfly (30k), Kubernetes/kops (17k), javascript-obfuscator (16k), RabbitMQ (14k), Google APIs (14k), Redpanda (12k), and hundreds of others. Six file formats represented: AGENTS.md (149 files), CLAUDE.md (111), .cursorrules (102), .windsurfrules (95), GEMINI.md (89), and copilot-instructions.md (34). This sample skews toward larger public repos. Enterprise internal repos with stricter governance, or solo projects with tightly scoped instruction files, may look different. We'd like to see that data.&lt;/p&gt;

&lt;p&gt;The parser reads each file and classifies every line: is this a rule that can be checked against code, or is it something else? "Something else" includes headers, blank lines, code blocks, explanatory prose, build instructions, and agent personality configuration.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Corpus stats:&lt;/strong&gt; 8,222 total instruction lines parsed. 309 rules extracted. 7,913 lines classified as non-rule content.&lt;br&gt;

&lt;/div&gt;


&lt;h2&gt;
  
  
  What instruction files actually contain
&lt;/h2&gt;

&lt;p&gt;The 96% that isn't rules breaks down into several categories. Some of it is necessary context (project structure explanations, build command documentation). Some of it is agent behavior configuration ("be succinct," "avoid providing explanations"). Some of it is just markdown formatting overhead.&lt;/p&gt;

&lt;p&gt;Here's what stood out: 430 of the 580 files (74%) had zero extractable rules. Of those, 67 were completely empty to the parser: zero extracted, zero unparseable. Many were single-line redirects. Dragonfly's .cursorrules (30k stars) says "READ AGENTS.md." Umi's .cursorrules (16k stars) contains the single word "RULE.md." Mautic's GEMINI.md says "Read and follow all instructions in ./AGENTS.md."&lt;/p&gt;

&lt;p&gt;At the other end, a few files were dense with rules. Apache Skywalking-java's CLAUDE.md extracted 6 rules from 26 lines (23%). Cloudflare chanfana's AGENTS.md: 5 rules from 21 lines (24%). But those files tend to be short, focused lists of concrete instructions.&lt;/p&gt;

&lt;p&gt;The heavy files tell a different story. javascript-obfuscator's CLAUDE.md (16k stars): 197 lines, zero rules extracted. These files are documentation with no machine-verifiable instructions embedded.&lt;/p&gt;

&lt;p&gt;
  Parse rate distribution across all 580 files
  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parse Rate&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0% (no rules)&lt;/td&gt;
&lt;td&gt;430&lt;/td&gt;
&lt;td&gt;74.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1-9%&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;12.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-19%&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;9.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-29%&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;2.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30-49%&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;1.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt;= 80%&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only 2 files (0.3%) had parse rates at or above 80%. Nearly three quarters had zero.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Types of content the parser correctly skips
&lt;/h2&gt;

&lt;p&gt;"3.8% extraction rate" sounds like the parser is broken. It isn't. These are lines that genuinely aren't rules:&lt;/p&gt;

&lt;p&gt;Markdown structure (headers, horizontal rules, blank lines). Code examples showing how to use a function or run a command. Project descriptions explaining what the repo does. Build and deployment instructions. Links to external documentation. Agent behavior directives that have no code-level representation ("be concise," "ask before making changes"). Workflow instructions ("use this branch strategy," "run tests before pushing").&lt;/p&gt;

&lt;p&gt;The parser isn't failing on these. It's correctly identifying them as not-rules. The denominator is every line in the file, not every line that looks like it could be a rule.&lt;/p&gt;

&lt;p&gt;A second metric tells the complementary story. 150 of 580 files (25.9%) contained at least one extractable rule. Across those 150 files, 309 rules is an average of 2.1 rules per file. So only a quarter of instruction files contain anything enforceable at all, and when they do, they typically contain two rules. The 3.8% describes the corpus-wide line ratio. The 25.9% and 2.1-per-file numbers describe what rule-writers are actually producing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a "verifiable rule" looks like
&lt;/h2&gt;

&lt;p&gt;The 309 rules that did get extracted map to concrete checks. Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Use camelCase for function names" (AST naming check)&lt;/li&gt;
&lt;li&gt;"No &lt;code&gt;any&lt;/code&gt; types" (TypeScript type safety check)&lt;/li&gt;
&lt;li&gt;"Use named exports, not default exports" (import pattern check)&lt;/li&gt;
&lt;li&gt;"Prefer &lt;code&gt;const&lt;/code&gt; over &lt;code&gt;let&lt;/code&gt;" (preference ratio check)&lt;/li&gt;
&lt;li&gt;"Test files must exist for every source file" (filesystem check)&lt;/li&gt;
&lt;li&gt;"Use Yarn, not npm" (tooling check)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each rule gets a category, a verifier type (AST, filesystem, regex, tree-sitter, preference, tooling, config-file, or git-history), and a qualifier (always, prefer, when-possible, avoid-unless, try-to, never).&lt;/p&gt;

&lt;p&gt;
  Rule extraction by category
  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Rules Extracted&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;naming&lt;/td&gt;
&lt;td&gt;169&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;structure&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;code-style&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;forbidden-pattern&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;type-safety&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dependency&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;error-handling&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;import-pattern&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;test-requirement&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Naming rules dominate: 55% of all extracted rules. That's likely a combination of two factors. Naming conventions ("use camelCase," "kebab-case filenames") are the most concrete, unambiguous instructions people write, so they appear frequently. They're also the rule class that static analysis matchers handle most cleanly, so the parser has high affinity for them. We can't fully separate how much of the 55% is user behavior vs. parser strength, but both contribute.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;
  Rule extraction by instruction file type
  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Files with Rules&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;th&gt;Total Lines&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;copilot-instructions.md&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;556&lt;/td&gt;
&lt;td&gt;5.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.cursorrules&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;1,508&lt;/td&gt;
&lt;td&gt;5.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;149&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;1,961&lt;/td&gt;
&lt;td&gt;4.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.windsurfrules&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;1,866&lt;/td&gt;
&lt;td&gt;2.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;1,501&lt;/td&gt;
&lt;td&gt;2.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GEMINI.md&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;830&lt;/td&gt;
&lt;td&gt;1.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;copilot-instructions.md had the highest extraction rate (5.9%), likely because those files tend to be shorter and more prescriptive. GEMINI.md files had the lowest (1.4%).&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  E2E verification: does excalidraw follow its own instruction files?
&lt;/h2&gt;

&lt;p&gt;This is a pipeline demonstration on one repo, not broad validation across ecosystems. We ran the full pipeline on excalidraw (~95k stars) because it's large, well-maintained, and has instruction files with extractable rules: both a CLAUDE.md and a copilot-instructions.md.&lt;/p&gt;

&lt;p&gt;The parser found 9 verifiable rules across both files. Deterministic analysis scored 66.1% compliance. Semantic analysis (structural fingerprinting of 626 source files) produced 9 verdicts, all resolved via fast-path vector similarity. Zero LLM calls, zero cost:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Compliance&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefer functional components&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;td&gt;structural-fast-path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PascalCase type naming&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;td&gt;structural-fast-path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async try/catch usage&lt;/td&gt;
&lt;td&gt;0.983&lt;/td&gt;
&lt;td&gt;structural-fast-path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contextual error logging&lt;/td&gt;
&lt;td&gt;0.979&lt;/td&gt;
&lt;td&gt;structural-fast-path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Yarn as package manager&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;no matching topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript required&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;no matching topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optional chaining preference&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;no matching topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;camelCase variables&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;no matching topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UPPER_CASE constants&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;no matching topic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rules that match established code pattern topics (component-structure, error-handling) score 0.97+, meaning the codebase's structural fingerprint strongly matches the instruction. The remaining five rules scored a neutral 0.50 because they describe tooling choices and naming conventions that don't have structural AST representations. That's itself a finding: even among the 4% of lines that get extracted as verifiable rules, some fall into categories that resist automated verification beyond simple presence checks. The verifier is real, but not comprehensive. No static analysis tool covers every rule class, and pretending otherwise would be dishonest.&lt;/p&gt;

&lt;p&gt;Privacy note: 626 files scanned, all file IDs are opaque sequential integers. No source code strings, file paths, variable names, or comments appear in any payload. In this case, no LLM was even called.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for anyone writing instruction files
&lt;/h2&gt;

&lt;p&gt;Two clarifications before the takeaways. First, "96% can't be verified" means can't be verified through static analysis, not "is useless." Agent behavior configuration, project context, and workflow documentation all have value. They guide the agent even if no tool can confirm compliance after the fact. Second, the 4% that is verifiable still matters. Excalidraw's 9 extractable rules produced a 66.1% deterministic compliance score with specific failures at specific line numbers. Nine rules doesn't sound like much until three of them fail and you find the agent ignored your naming conventions across 626 files.&lt;/p&gt;

&lt;p&gt;The real problem isn't that instruction files contain documentation. It's that most people don't know which of their lines are enforceable and which are suggestions the agent can silently drop. That ratio isn't fixed, either. People write unverifiable instructions because nobody's told them which phrasings produce checkable rules.&lt;/p&gt;

&lt;p&gt;To write rules that can actually be checked:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use imperative verbs with specific targets.&lt;/strong&gt; "Use camelCase for all function names" is verifiable. "Follow good naming conventions" isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specify the tool or pattern, not the principle.&lt;/strong&gt; "Prefer &lt;code&gt;const&lt;/code&gt; over &lt;code&gt;let&lt;/code&gt;" is a ratio check. "Write immutable code" is philosophy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Include the file patterns your rules apply to.&lt;/strong&gt; "All &lt;code&gt;.ts&lt;/code&gt; files must use named exports" scopes the check. "Use named exports" is vague.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep rules and documentation separate.&lt;/strong&gt; Rules are instructions. Documentation explains why. Mixing them dilutes both.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/ruleprobe" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;RuleProbe on GitHub: parse your own instruction files and see what's actually verifiable&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool
&lt;/h2&gt;

&lt;p&gt;RuleProbe is the parser and verifier behind this analysis. It reads 7 instruction file formats, extracts machine-verifiable rules using 102 built-in matchers across 14 categories, and checks agent output against each one. Deterministic by default, no API keys needed for the core pipeline. Optional semantic analysis for pattern-matching and consistency rules.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx ruleprobe parse CLAUDE.md &lt;span class="nt"&gt;--show-unparseable&lt;/span&gt;
npx ruleprobe verify CLAUDE.md ./src &lt;span class="nt"&gt;--format&lt;/span&gt; summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;--show-unparseable&lt;/code&gt; flag shows you exactly which lines were skipped and why. That's often the most useful output: it tells you which of your "rules" aren't rules at all.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;
        ruleprobe
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Verify whether AI coding agents follow the instruction files they're given
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;RuleProbe&lt;/h1&gt;
&lt;/div&gt;


&lt;p&gt;&lt;br&gt;
    Verify whether AI coding agents actually follow the instruction files they're given&lt;br&gt;
  &lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;&lt;br&gt;
    &lt;a href="https://www.npmjs.com/package/ruleprobe" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1199dbe19c4156a1e14a33bbcbaf271326d89a3739c84e2a17a2cc0d068c1ffc/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f72756c6570726f62653f7374796c653d666c61742d737175617265" alt="npm version"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/moonrunnerkc/ruleprobe/actions/workflows/self-check.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ff9b01b442adefc6d03bc5bdf2af1ed0f3160758b9c78dd2114a9901f2a174cb/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6d6f6f6e72756e6e65726b632f72756c6570726f62652f73656c662d636865636b2e796d6c3f7374796c653d666c61742d737175617265266c6162656c3d6275696c64" alt="build status"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/moonrunnerkc/ruleprobe/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/44dfc291f8b8c5516e7a6ab17515f190126380286cd29fb022af6e7fe4da8713/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6d6f6f6e72756e6e65726b632f72756c6570726f62653f7374796c653d666c61742d737175617265" alt="license"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/5acfaf839ce324ab316a01bde552a836435cd15f0470d0e792751d8747cad5b2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c616e67756167652d547970655363726970742d3331373863363f7374796c653d666c61742d737175617265"&gt;&lt;img src="https://camo.githubusercontent.com/5acfaf839ce324ab316a01bde552a836435cd15f0470d0e792751d8747cad5b2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c616e67756167652d547970655363726970742d3331373863363f7374796c653d666c61742d737175617265" alt="TypeScript"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/24ba9715b0429d4e8b8993d1907ec1243ad3c0d642e3c3bd6c6a11701a01eadc/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334431382d3333393933333f7374796c653d666c61742d737175617265"&gt;&lt;img src="https://camo.githubusercontent.com/24ba9715b0429d4e8b8993d1907ec1243ad3c0d642e3c3bd6c6a11701a01eadc/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334431382d3333393933333f7374796c653d666c61742d737175617265" alt="Node.js &amp;gt;= 18"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/moonrunnerkc/ruleprobe/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/4854f9a0b2dd4a7cbc5f1868a09a9f66ee6dfbc02676c2fe0b31f223e1d651b7/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f6d6f6f6e72756e6e65726b632f72756c6570726f62653f7374796c653d666c61742d737175617265" alt="GitHub stars"&gt;&lt;/a&gt;&lt;br&gt;
  &lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why&lt;/h2&gt;

&lt;/div&gt;

&lt;p&gt;Every AI coding agent reads an instruction file. None of them prove they followed it.&lt;/p&gt;

&lt;p&gt;You write &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;AGENTS.md&lt;/code&gt; with specific rules: camelCase variables, no &lt;code&gt;any&lt;/code&gt; types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.&lt;/p&gt;

&lt;p&gt;RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Compliance scores with file paths and line numbers as evidence. Deterministic and reproducible by default. Optional semantic analysis for pattern-matching and consistency rules that require codebase-aware judgment.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;

&lt;/div&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;npm install -g ruleprobe&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Or run it directly:&lt;/p&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;npx ruleprobe --help&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Parse an instruction file&lt;/strong&gt; to see what rules RuleProbe can extract:&lt;/p&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;ruleprobe parse CLAUDE.md
ruleprobe parse AGENTS.md --show-unparseable&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Verify agent output&lt;/strong&gt;…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;






</description>
      <category>ai</category>
      <category>programming</category>
      <category>typescript</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Independent convergence on specification-first AI code verification</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sat, 11 Apr 2026 14:42:02 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/independent-convergence-on-specification-first-ai-code-verification-efj</link>
      <guid>https://dev.to/moonrunnerkc/independent-convergence-on-specification-first-ai-code-verification-efj</guid>
      <description>&lt;p&gt;On March 26, 2026, Christo Zietsman published "The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review" on arXiv.&lt;/p&gt;

&lt;p&gt;Paper: &lt;a href="https://arxiv.org/abs/2603.25773" rel="noopener noreferrer"&gt;arXiv:2603.25773&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The paper's core argument (direct quote from abstract):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The combined argument implies an architecture: specifications first, deterministic verification pipeline second, AI review only for the structural and architectural residual.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I noticed this because my own open-source project, Swarm Orchestrator, implements a very similar layered approach. I built it from real usage patterns with AI coding agents, not from the paper (neither of us referenced the other's work).&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;
        swarm-orchestrator
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;br&gt;
&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="36" height="36"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="52" height="52"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="Swarm Orchestrator" width="72" height="72"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="52" height="52"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="36" height="36"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Swarm Orchestrator&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Not an autonomous system builder: an accountability layer around agents you already trust enough to run, but not enough to merge blind. Each step runs on its own isolated branch. Each claim (tests pass, build clean, commit made) is cross-referenced against the transcript and the actual filesystem. Failures are auto-classified, repaired with targeted strategies, and re-verified. Nothing reaches main without passing both the verification engine and the quality gate pipeline. The metric that matters is &lt;strong&gt;cost per rubric point&lt;/strong&gt;, not wall-clock time.&lt;/em&gt;&lt;/p&gt;



&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/95c61c397ca3825757ec835268e50886b2c10ddc4f0676e1222b19037610927f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4953432d626c75652e737667" alt="License: ISC"&gt;&lt;/a&gt;
  
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
  
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/64f4f4edbcb5ce478ae77e5187d84186cf323fbea6d76a1c750dd4795ef574a3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d3134393725323070617373696e672d627269676874677265656e2e737667"&gt;&lt;img src="https://camo.githubusercontent.com/64f4f4edbcb5ce478ae77e5187d84186cf323fbea6d76a1c750dd4795ef574a3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d3134393725323070617373696e672d627269676874677265656e2e737667" alt="Tests: 1497 passing"&gt;&lt;/a&gt;
  
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/39e32f87e04dd49db4fa18b3878ad6cb24c09dbaea1af5cfaa8953d61cbdfab4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d32302532422d677265656e2e737667"&gt;&lt;img src="https://camo.githubusercontent.com/39e32f87e04dd49db4fa18b3878ad6cb24c09dbaea1af5cfaa8953d61cbdfab4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d32302532422d677265656e2e737667" alt="Node.js 20+"&gt;&lt;/a&gt;
  
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/dc63baa72c8d42e246e791f4e625fa55d7eec24c1332fa5ce0e0d64b459f96c3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d352e782d626c75652e737667"&gt;&lt;img src="https://camo.githubusercontent.com/dc63baa72c8d42e246e791f4e625fa55d7eec24c1332fa5ce0e0d64b459f96c3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d352e782d626c75652e737667" alt="TypeScript 5.x"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#what-is-this" rel="noopener noreferrer"&gt;What Is This&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#benchmarking" rel="noopener noreferrer"&gt;Benchmarking&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#usage" rel="noopener noreferrer"&gt;Usage&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#github-action" rel="noopener noreferrer"&gt;GitHub Action&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#recipes" rel="noopener noreferrer"&gt;Recipes&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#architecture" rel="noopener noreferrer"&gt;Architecture&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#contributing" rel="noopener noreferrer"&gt;Contributing&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/swarm.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fswarm.png" alt="Swarm Orchestrator TUI dashboard showing parallel agent execution across waves" width="700"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;/div&gt;

&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;See it run end-to-end&lt;/h3&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;npm install -g swarm-orchestrator
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; then set up any one of the agent CLIs below, and:&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  How the tool works (current state as of April 2026)
&lt;/h2&gt;

&lt;p&gt;Agents run as untrusted subprocesses on isolated git branches. Acceptance criteria are injected into each agent's prompt before generation.&lt;/p&gt;

&lt;p&gt;After execution, a deterministic verification pipeline checks claims against concrete evidence (commit SHAs, test output, build results, file diffs). No LLM is used as the primary gate.&lt;/p&gt;

&lt;p&gt;Eight configurable quality gates then run: scaffold leftovers, duplicate blocks, hardcoded config, README accuracy, test isolation, test coverage, accessibility, runtime correctness. All are regex/AST/diff/threshold checks.&lt;/p&gt;

&lt;p&gt;An optional &lt;code&gt;--governance&lt;/code&gt; Critic wave runs after the deterministic layers. It scores steps on weighted axes and pauses for human review on flags. Scores are advisory only.&lt;/p&gt;

&lt;p&gt;Full details and flow: &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;github.com/moonrunnerkc/swarm-orchestrator&lt;/a&gt; (80 stars, 50 passing tests across 95 files, latest release v4.2.0 on April 9.)&lt;/p&gt;

&lt;p&gt;The original Copilot-focused version went public on dev.to &lt;a href="https://dev.to/moonrunnerkc/copilot-swarm-orchestrator-oda"&gt;January 25, 2026&lt;/a&gt; with the core isolation + evidence-based verification already present.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this alignment matters
&lt;/h2&gt;

&lt;p&gt;Zietsman cites the DORA 2026 report showing that higher AI code generation correlates with higher throughput &lt;em&gt;and&lt;/em&gt; higher instability. Time saved writing code gets re-spent on auditing. His paper argues that simply adding more AI review does not fix the structural issue when there is no external specification layer.&lt;/p&gt;

&lt;p&gt;Swarm Orchestrator was built to address exactly that pattern. The deterministic gates catch the repeatable failure modes (security headers, test depth, config externalization) that standalone agents consistently miss in head-to-head runs. The Critic layer is available only for the residual judgment calls where human or AI insight can still add value.&lt;/p&gt;

&lt;p&gt;I am not claiming this proves or validates the paper. It is simply an independent practical example that landed on closely aligned principles at roughly the same time. If you are working with AI coding agents and wrestling with verification, the repo is open for review, issues, or contributions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;swarm-orchestrator on GitHub&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>typescript</category>
      <category>devops</category>
    </item>
    <item>
      <title>AI Coding Agents Can Verify Some of Their Work Now. Here's What They Still Miss.</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Thu, 09 Apr 2026 05:40:48 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/ai-coding-agents-can-verify-some-of-their-work-now-heres-what-they-still-miss-58mc</link>
      <guid>https://dev.to/moonrunnerkc/ai-coding-agents-can-verify-some-of-their-work-now-heres-what-they-still-miss-58mc</guid>
      <description>&lt;p&gt;Copilot and Claude Code both ship with verification features now. Copilot's Agent mode runs terminal commands, detects build failures, and iterates fixes. Claude Code plans changes across files and can run your test suite after modifications. Both have improved significantly since 2025.&lt;/p&gt;

&lt;p&gt;They're still not catching everything.&lt;/p&gt;

&lt;p&gt;Developers consistently report agents declaring tasks complete while skipping accessibility attributes, test isolation, config externalization, dark mode, responsive layout, and meta tags. The agent runs the build, sees green, and moves on. But "build passes" and "the output is production-ready" are different bars. The reprompt cycle for quality attributes the agent never attempted in the first place is still a significant time sink on any non-trivial project.&lt;/p&gt;

&lt;p&gt;That gap is where Swarm Orchestrator sits. Not replacing the agent's self-verification, but adding the checks it doesn't run.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;You give it a goal. It builds a dependency-aware plan, assigns steps to specialized agents, and launches them in parallel on isolated git branches. Each step runs through outcome-based verification (build, test, diff, expected files) and eight quality gates covering scaffold leftovers, duplicate code, hardcoded config, README accuracy, test isolation, test coverage, accessibility, and runtime correctness.&lt;/p&gt;

&lt;p&gt;Before the agent runs, the orchestrator injects acceptance criteria based on the project type. For web apps, that's 16 requirements: semantic HTML, responsive layout, dark mode via CSS custom properties, &lt;code&gt;prefers-reduced-motion&lt;/code&gt;, image alt attributes, heading hierarchy, ARIA labels, focus-visible styles, and more. For everything else, 6 baseline criteria covering error handling, documentation, input validation, logging, and test coverage.&lt;/p&gt;

&lt;p&gt;The agent sees these as hard requirements. After execution, the quality gates check whether they were met. The agent's own verification handles "does it compile and do tests pass." The orchestrator handles "did it actually do what was asked, completely."&lt;/p&gt;

&lt;p&gt;
  Benchmark context
  &lt;p&gt;Head-to-head runs against standalone Copilot CLI, Claude Code, and Codex on the same goals showed a consistent pattern: quality attributes the agent never attempted were absent from unassisted output. These aren't build failures the agent would catch on its own. They're requirements like skip-to-content links, &lt;code&gt;prefers-reduced-motion&lt;/code&gt; media queries, CSS custom properties on &lt;code&gt;:root&lt;/code&gt;, dual theme-color meta tags, module separation between logic and presentation, zero-dependency test runners. Each is at least one follow-up prompt. Several take 2-3 rounds.&lt;/p&gt;

&lt;p&gt;The orchestrator caught and enforced all of them in a single pass.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;Steps that fail don't get blindly retried. The orchestrator classifies the failure (build, test, missing artifact, dependency, timeout) and sends the agent back with the actual error output and context. This works alongside the agent's own retry capabilities, not instead of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's New in v4.2.0
&lt;/h2&gt;

&lt;p&gt;Three additions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Tool Adapters
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;--tool&lt;/code&gt; flag existed in previous versions. It parsed from the CLI, reached the options object, and then did nothing. The orchestrator always spawned Copilot CLI internally regardless of what you passed.&lt;/p&gt;

&lt;p&gt;That's fixed. &lt;code&gt;resolveAdapter()&lt;/code&gt; now routes through real adapter implementations with a shared process supervisor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"Add auth"&lt;/span&gt; &lt;span class="nt"&gt;--tool&lt;/span&gt; copilot              &lt;span class="c"&gt;# default, unchanged behavior&lt;/span&gt;
swarm run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"Add auth"&lt;/span&gt; &lt;span class="nt"&gt;--tool&lt;/span&gt; claude-code
swarm run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"Add auth"&lt;/span&gt; &lt;span class="nt"&gt;--tool&lt;/span&gt; claude-code-teams &lt;span class="nt"&gt;--team-size&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agent Teams mode spawns a team lead per wave for native multi-agent coordination. If the team lead fails, it falls back to per-step sequential execution automatically.&lt;/p&gt;

&lt;p&gt;Every adapter shares the same process supervisor: 5-minute stall timeout, 10-second heartbeat checking stdout activity, SIGTERM on stall, SIGKILL after 5-second grace. Previously only the Copilot path had stall detection. A hung &lt;code&gt;claude&lt;/code&gt; process would block your entire run indefinitely. That's gone.&lt;/p&gt;

&lt;h3&gt;
  
  
  OWASP ASI Compliance Mapping
&lt;/h3&gt;

&lt;p&gt;The orchestrator already enforced branch isolation (ASI-03: Excessive Agency), outcome-based verification (ASI-05: Improper Output Handling), and failure-classified repair. Those behaviors map directly to risks in the OWASP Top 10 for Agentic Applications.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--owasp-report&lt;/code&gt; formalizes that mapping. After every run, it generates a per-risk assessment with evidence pulled from actual execution metadata.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"Build REST API"&lt;/span&gt; &lt;span class="nt"&gt;--governance&lt;/span&gt; &lt;span class="nt"&gt;--owasp-report&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6 of 10 ASI risks are assessed. 4 are marked not-applicable with explicit rationale (the orchestrator doesn't store user data, doesn't communicate across networks, doesn't train models). If it doesn't apply, it says so and explains why.&lt;/p&gt;

&lt;p&gt;
  Which ASI risks are assessed?
  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ASI Risk&lt;/th&gt;
&lt;th&gt;Assessed&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ASI-01: Prompt Injection&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Agent prompts controlled by orchestrator, user goals parameterized into plan steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASI-02: Insecure Tool Use&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Tool invocations verified against transcript evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASI-03: Excessive Agency&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Scope enforcement via isolated worktrees and boundary declarations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASI-04: Unreliable Execution&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Failure classification, targeted repair, retry with error context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASI-05: Improper Output Handling&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Build/test/diff verification independent of agent self-reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASI-10: Uncontrolled Autonomy&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Governance mode with Critic scoring, human-in-the-loop approval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASI-06, 07, 08, 09&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;No model training, no data storage, no cross-network communication, no supply chain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured Run Reports
&lt;/h3&gt;

&lt;p&gt;Every run already produced artifacts: session state, metrics, cost attribution, per-step verification reports, and now OWASP compliance. Pulling a coherent picture from those files meant opening each one individually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm report runs/my-run-id                             &lt;span class="c"&gt;# generate from any completed run&lt;/span&gt;
swarm report &lt;span class="nt"&gt;--latest&lt;/span&gt; &lt;span class="nt"&gt;--stdout&lt;/span&gt;                          &lt;span class="c"&gt;# most recent run, print to terminal&lt;/span&gt;
swarm report runs/my-run-id &lt;span class="nt"&gt;--format&lt;/span&gt; json               &lt;span class="c"&gt;# JSON only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. Markdown and JSON. Missing sections (cost data, OWASP) are handled gracefully and just don't appear in the output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Sits
&lt;/h2&gt;

&lt;p&gt;The agents have gotten better at self-verification. That's a good thing. The orchestrator isn't competing with that. It's adding a layer the agents don't cover: acceptance criteria enforcement, quality gates for attributes agents don't check on their own, independent verification that doesn't rely on the agent's self-reporting, and an auditable trail of everything that happened.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Standalone Agent (2026)&lt;/th&gt;
&lt;th&gt;With Orchestrator&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Build/test verification&lt;/td&gt;
&lt;td&gt;Built-in (Copilot Agent, Claude Code)&lt;/td&gt;
&lt;td&gt;Independent check on isolated branch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality attributes&lt;/td&gt;
&lt;td&gt;Whatever you prompt for&lt;/td&gt;
&lt;td&gt;16 web-app / 6 baseline criteria injected and verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure handling&lt;/td&gt;
&lt;td&gt;Agent retries with some context&lt;/td&gt;
&lt;td&gt;Classified failure, targeted repair prompt with error output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;Chat history, some checkpoints&lt;/td&gt;
&lt;td&gt;Transcripts, verification reports, cost attribution, OWASP compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merge safety&lt;/td&gt;
&lt;td&gt;Agent says it's done&lt;/td&gt;
&lt;td&gt;Proof required across verification + 8 quality gates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;GitHub: moonrunnerkc/swarm-orchestrator&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;TypeScript. ISC license. Requires Node 20+ and Git.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>devops</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Same Instruction File, Same Score, Completely Different Failures</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Tue, 07 Apr 2026 00:24:41 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/same-instruction-file-same-score-completely-different-failures-46fp</link>
      <guid>https://dev.to/moonrunnerkc/same-instruction-file-same-score-completely-different-failures-46fp</guid>
      <description>&lt;p&gt;Two AI coding agents were given the same task with the same 10-rule instruction file. Both scored 70% adherence. Here's the breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Agent A&lt;/th&gt;
&lt;th&gt;Agent B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;camelCase variables&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No &lt;code&gt;any&lt;/code&gt; type&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No console.log&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Named exports only&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max 300 lines&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test files exist&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Agent A had a type safety gap. It used &lt;code&gt;any&lt;/code&gt; for request parameters even though it defined the correct types in its own &lt;code&gt;types.ts&lt;/code&gt; file. Agent B had a structural discipline gap. It used &lt;code&gt;snake_case&lt;/code&gt; for a variable, added a &lt;code&gt;default export&lt;/code&gt; following Express conventions over the project rules, and generated a 338-line file by adding features beyond the task scope.&lt;/p&gt;

&lt;p&gt;Same score. Completely different engineering weaknesses. That table came from &lt;a href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;RuleProbe&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;
  About this case study
  &lt;br&gt;
The comparison uses simulated agent outputs with deliberate violations, not live agent runs. Raw JSON reports are in the repo under &lt;code&gt;docs/case-study-data/&lt;/code&gt;. This is documented in the &lt;a href="https://github.com/moonrunnerkc/ruleprobe/blob/main/docs/case-study-v0.1.0.md" rel="noopener noreferrer"&gt;case study&lt;/a&gt;.&lt;br&gt;


&lt;/p&gt;




&lt;h2&gt;
  
  
  What RuleProbe is
&lt;/h2&gt;

&lt;p&gt;RuleProbe is an open source CLI that reads AI coding agent instruction files and verifies whether the agent's output followed the rules. It covers six formats: &lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;AGENTS.md&lt;/code&gt;, &lt;code&gt;.cursorrules&lt;/code&gt;, &lt;code&gt;copilot-instructions.md&lt;/code&gt;, &lt;code&gt;GEMINI.md&lt;/code&gt;, and &lt;code&gt;.windsurfrules&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Verification is deterministic. No LLM in the pipeline. The same input produces the same report every time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/ruleprobe" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  How it checks
&lt;/h2&gt;

&lt;p&gt;Three methods, depending on the rule:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AST analysis&lt;/strong&gt; via &lt;a href="https://github.com/dsherret/ts-morph" rel="noopener noreferrer"&gt;ts-morph&lt;/a&gt; handles code structure. Variable and function naming (camelCase), type and interface naming (PascalCase), type annotations (&lt;code&gt;any&lt;/code&gt; detection), export style (named vs default), JSDoc presence on public functions, and import patterns (path aliases, deep relative imports).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filesystem inspection&lt;/strong&gt; handles file-level rules. File naming conventions (kebab-case) and whether test files exist for source files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regex&lt;/strong&gt; handles content patterns like max line length.&lt;/p&gt;

&lt;p&gt;v0.1.0 has 15 matchers across those three methods, covering TypeScript and JavaScript. ts-morph is the AST engine, so other languages aren't supported.&lt;/p&gt;

&lt;p&gt;Output looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuleProbe Adherence Report
Rules: 14 total | 11 passed | 3 failed | Score: 79%

PASS  naming/naming-camelcase-variables-5
PASS  naming/naming-pascalcase-types-7
FAIL  forbidden-pattern/forbidden-no-any-type-1
      src/handler.ts:12 - found: req: any
      src/handler.ts:24 - found: data: any
FAIL  forbidden-pattern/forbidden-no-console-log-10
      src/handler.ts:18 - found: console.log("handling request")
FAIL  test-requirement/test-files-exist-11
      src/handler.ts - found: no test file found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;File, line, violation. No ambiguity.&lt;/p&gt;


&lt;h2&gt;
  
  
  The conservative parser
&lt;/h2&gt;

&lt;p&gt;This is a design choice worth explaining. When RuleProbe reads an instruction file, it only extracts rules it can map to a deterministic mechanical check. Everything else gets reported as unparseable.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruleprobe parse CLAUDE.md &lt;span class="nt"&gt;--show-unparseable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;"Write clean code" is unparseable. "Use the repository pattern" is unparseable. "Handle errors gracefully" is unparseable. These can't be verified without judgment, and judgment means variance between runs. RuleProbe doesn't do that.&lt;/p&gt;

&lt;p&gt;The tradeoff: a 30-rule instruction file might produce 12 verified rules and 18 unparseable ones. You see both counts so you know exactly what's being checked and what isn't.&lt;/p&gt;


&lt;h2&gt;
  
  
  Running it
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx ruleprobe &lt;span class="nt"&gt;--help&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;
  Parse an instruction file
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruleprobe parse CLAUDE.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Extracted 14 rules:

  forbidden-no-any-type-1
    Category: forbidden-pattern
    Verifier: ast
    Pattern:  no-any (*.ts)
    Source:    "- TypeScript strict mode, no any types"

  naming-kebab-case-files-4
    Category: naming
    Verifier: filesystem
    Pattern:  kebab-case (*.ts)
    Source:    "- File names: kebab-case"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;
  Verify agent output
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruleprobe verify CLAUDE.md ./agent-output &lt;span class="nt"&gt;--format&lt;/span&gt; text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Supports &lt;code&gt;--format json&lt;/code&gt;, &lt;code&gt;--format markdown&lt;/code&gt;, and &lt;code&gt;--format rdjson&lt;/code&gt; (reviewdog-compatible). Exit code 0 means all rules passed, 1 means violations found.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  Compare two agents
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruleprobe compare AGENTS.md ./claude-output ./copilot-output &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agents&lt;/span&gt; claude,copilot &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;


&lt;h2&gt;
  
  
  CI with the GitHub Action
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RuleProbe&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;check-rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
      &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moonrunnerkc/ruleprobe@v0.1.0&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;instruction-file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AGENTS.md&lt;/span&gt;
          &lt;span class="na"&gt;output-dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No external API keys. Posts results as a PR comment. Supports reviewdog rdjson for inline annotations if you use reviewdog in your pipeline. Exposes &lt;code&gt;score&lt;/code&gt;, &lt;code&gt;passed&lt;/code&gt;, &lt;code&gt;failed&lt;/code&gt;, and &lt;code&gt;total&lt;/code&gt; as step outputs, so you can gate merges on adherence thresholds in downstream steps.&lt;/p&gt;

&lt;p&gt;
  All action inputs
  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instruction-file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(required)&lt;/td&gt;
&lt;td&gt;Path to your instruction file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;output-dir&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;src&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Directory of code to verify&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ci&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Agent label for report metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;unknown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model label for report metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;format&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;text&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;text, json, or markdown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;severity&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;all&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error, warning, or all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fail-on-violation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fail the check if any rule is violated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;post-comment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Post results as a PR comment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;reviewdog-format&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Also output rdjson&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;/p&gt;


&lt;h2&gt;
  
  
  Programmatic API
&lt;/h2&gt;

&lt;p&gt;Five functions if you want to integrate verification into your own tooling:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;parseInstructionFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;verifyOutput&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;generateReport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;formatReport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;extractRules&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ruleprobe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;parseInstructionFile&lt;/code&gt; reads the instruction file. &lt;code&gt;verifyOutput&lt;/code&gt; runs the rules. &lt;code&gt;generateReport&lt;/code&gt; builds the adherence report with summary stats. &lt;code&gt;formatReport&lt;/code&gt; renders it as text, JSON, markdown, or rdjson. &lt;code&gt;extractRules&lt;/code&gt; works on raw markdown content if you don't have a file path.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  What it doesn't cover
&lt;/h2&gt;

&lt;p&gt;15 matchers is a starting point, not full coverage. Real instruction files have rules RuleProbe can't verify yet: architectural patterns, error handling conventions, dependency constraints, API design rules. The parser will tell you what it skipped.&lt;/p&gt;

&lt;p&gt;TypeScript and JavaScript only. ts-morph is the AST engine. Other languages would need a different parser.&lt;/p&gt;

&lt;p&gt;No automated agent invocation. You run the agent separately and point RuleProbe at the output directory.&lt;/p&gt;


&lt;h2&gt;
  
  
  Security and dependencies
&lt;/h2&gt;

&lt;p&gt;RuleProbe never executes scanned code, never makes network calls, never writes to the scanned directory. Paths are resolved and bounded to &lt;code&gt;process.cwd()&lt;/code&gt;. Symlinks outside the project are skipped by default.&lt;/p&gt;

&lt;p&gt;Four runtime dependencies: &lt;strong&gt;chalk&lt;/strong&gt; &lt;code&gt;5.6.2&lt;/code&gt;, &lt;strong&gt;commander&lt;/strong&gt; &lt;code&gt;12.1.0&lt;/code&gt;, &lt;strong&gt;glob&lt;/strong&gt; &lt;code&gt;11.1.0&lt;/code&gt;, &lt;strong&gt;ts-morph&lt;/strong&gt; &lt;code&gt;24.0.0&lt;/code&gt;. All pinned to exact versions. No semver ranges.&lt;/p&gt;



&lt;p&gt;npm: &lt;a href="https://www.npmjs.com/package/ruleprobe" rel="noopener noreferrer"&gt;ruleprobe&lt;/a&gt; | MIT license&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;
        ruleprobe
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Verify whether AI coding agents follow the instruction files they're given
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;RuleProbe&lt;/h1&gt;
&lt;/div&gt;


&lt;p&gt;&lt;br&gt;
    Verify whether AI coding agents actually follow the instruction files they're given&lt;br&gt;
  &lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;&lt;br&gt;
    &lt;a href="https://www.npmjs.com/package/ruleprobe" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1199dbe19c4156a1e14a33bbcbaf271326d89a3739c84e2a17a2cc0d068c1ffc/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f72756c6570726f62653f7374796c653d666c61742d737175617265" alt="npm version"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/moonrunnerkc/ruleprobe/actions/workflows/self-check.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ff9b01b442adefc6d03bc5bdf2af1ed0f3160758b9c78dd2114a9901f2a174cb/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6d6f6f6e72756e6e65726b632f72756c6570726f62652f73656c662d636865636b2e796d6c3f7374796c653d666c61742d737175617265266c6162656c3d6275696c64" alt="build status"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/moonrunnerkc/ruleprobe/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/44dfc291f8b8c5516e7a6ab17515f190126380286cd29fb022af6e7fe4da8713/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6d6f6f6e72756e6e65726b632f72756c6570726f62653f7374796c653d666c61742d737175617265" alt="license"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/5acfaf839ce324ab316a01bde552a836435cd15f0470d0e792751d8747cad5b2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c616e67756167652d547970655363726970742d3331373863363f7374796c653d666c61742d737175617265"&gt;&lt;img src="https://camo.githubusercontent.com/5acfaf839ce324ab316a01bde552a836435cd15f0470d0e792751d8747cad5b2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c616e67756167652d547970655363726970742d3331373863363f7374796c653d666c61742d737175617265" alt="TypeScript"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/24ba9715b0429d4e8b8993d1907ec1243ad3c0d642e3c3bd6c6a11701a01eadc/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334431382d3333393933333f7374796c653d666c61742d737175617265"&gt;&lt;img src="https://camo.githubusercontent.com/24ba9715b0429d4e8b8993d1907ec1243ad3c0d642e3c3bd6c6a11701a01eadc/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334431382d3333393933333f7374796c653d666c61742d737175617265" alt="Node.js &amp;gt;= 18"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/moonrunnerkc/ruleprobe/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/4854f9a0b2dd4a7cbc5f1868a09a9f66ee6dfbc02676c2fe0b31f223e1d651b7/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f6d6f6f6e72756e6e65726b632f72756c6570726f62653f7374796c653d666c61742d737175617265" alt="GitHub stars"&gt;&lt;/a&gt;&lt;br&gt;
  &lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why&lt;/h2&gt;

&lt;/div&gt;

&lt;p&gt;Every AI coding agent reads an instruction file. None of them prove they followed it.&lt;/p&gt;

&lt;p&gt;You write &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;AGENTS.md&lt;/code&gt; with specific rules: camelCase variables, no &lt;code&gt;any&lt;/code&gt; types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.&lt;/p&gt;

&lt;p&gt;RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Compliance scores with file paths and line numbers as evidence. Deterministic and reproducible by default. Optional semantic analysis for pattern-matching and consistency rules that require codebase-aware judgment.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;

&lt;/div&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;npm install -g ruleprobe&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Or run it directly:&lt;/p&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;npx ruleprobe --help&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Parse an instruction file&lt;/strong&gt; to see what rules RuleProbe can extract:&lt;/p&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;ruleprobe parse CLAUDE.md
ruleprobe parse AGENTS.md --show-unparseable&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Verify agent output&lt;/strong&gt;…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/ruleprobe" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>typescript</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
    <item>
      <title>AI coding agents lie about their work. Outcome-based verification catches it.</title>
      <dc:creator>Brad Kinnard</dc:creator>
      <pubDate>Sun, 29 Mar 2026 21:59:27 +0000</pubDate>
      <link>https://dev.to/moonrunnerkc/ai-coding-agents-lie-about-their-work-outcome-based-verification-catches-it-12b4</link>
      <guid>https://dev.to/moonrunnerkc/ai-coding-agents-lie-about-their-work-outcome-based-verification-catches-it-12b4</guid>
      <description>&lt;p&gt;AI coding agents have a consistency problem. Ask one to add authentication to your project and it'll tell you it's done. Commits made, tests passing, middleware wired up. Check the branch and you'll find a half-written JWT helper, no tests, and a build that doesn't compile.&lt;/p&gt;

&lt;p&gt;This isn't a hallucination problem. The agent did produce code. It just didn't verify that any of it worked before declaring victory. And neither did the tools sitting between the agent and your main branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The transcript trust problem
&lt;/h2&gt;

&lt;p&gt;Most orchestration tools that coordinate AI agents verify work by reading transcripts. The agent says "committed 3 files" or "all tests passing" and the verifier pattern-matches those strings as evidence of completion.&lt;/p&gt;

&lt;p&gt;That's trusting the agent's self-report.&lt;/p&gt;

&lt;p&gt;The issue isn't that agents are deliberately deceptive. It's that they generate completion language as part of their output pattern regardless of the actual state of the codebase. An agent will write "tests passing" into its response while the test suite has syntax errors. It'll claim files were created that only exist in the prompt's hypothetical, not on disk.&lt;/p&gt;

&lt;p&gt;Transcript parsing catches the obvious failures: agent errored out, produced no output, didn't mention anything about the task. It misses the subtle ones: agent produced code that looks right, described it correctly, but the code doesn't compile, doesn't pass tests, or doesn't do what was asked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Outcome-based verification
&lt;/h2&gt;

&lt;p&gt;The alternative is checking what actually happened instead of what the agent said happened.&lt;/p&gt;

&lt;p&gt;This is what Swarm Orchestrator 4.0 implements. After each agent step runs on its isolated git branch, the verifier executes a series of checks against the branch itself:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Fails when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;git_diff&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Diffs the branch against the recorded base SHA&lt;/td&gt;
&lt;td&gt;No file changes detected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;build_exec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runs the detected build command in the worktree&lt;/td&gt;
&lt;td&gt;Non-zero exit code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_exec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runs the detected test command in the worktree&lt;/td&gt;
&lt;td&gt;Non-zero exit code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;file_existence&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Checks that expected output files exist&lt;/td&gt;
&lt;td&gt;Expected files missing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transcript&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Parses agent output for completion evidence&lt;/td&gt;
&lt;td&gt;(supplementary only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Transcript analysis still runs. But when outcome checks are present, transcript-based checks get demoted to &lt;code&gt;required: false&lt;/code&gt;. The build and test execution results gate the merge decision.&lt;/p&gt;

&lt;p&gt;Stack detection is automatic. The verifier reads &lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;Makefile&lt;/code&gt;, &lt;code&gt;pyproject.toml&lt;/code&gt;, &lt;code&gt;Cargo.toml&lt;/code&gt;, or whatever project configuration exists and runs the appropriate commands. No per-repo configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens when verification fails
&lt;/h2&gt;

&lt;p&gt;Blind retry is the default across most agent tooling. Step fails, same prompt runs again, up to some retry limit. The agent has no idea what went wrong.&lt;/p&gt;

&lt;p&gt;Swarm Orchestrator's RepairAgent takes the structured output from the verification checks and feeds it back into the retry prompt. Which check failed, the last 20 lines of build or test output, which files were expected but aren't there. The failure gets classified (build failure, test failure, missing files, no changes) and the repair strategy adapts to the type.&lt;/p&gt;

&lt;p&gt;On the final attempt the prompt includes an explicit priority shift: get something working over getting something complete.&lt;/p&gt;

&lt;p&gt;The difference between "retry with context" and "blind retry" is measurable. An agent that knows the build failed on a missing import has a realistic path to fixing it. An agent re-running the same prompt that produced a broken build has roughly the same odds of producing another broken build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent-agnostic by design
&lt;/h2&gt;

&lt;p&gt;4.0 drops the Copilot CLI dependency. The adapter layer supports Copilot CLI, Claude Code, and Codex out of the box. The interface is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AgentAdapter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nl"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;workdir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;model&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;AgentResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;One flag at the CLI level (&lt;code&gt;--tool claude-code&lt;/code&gt;) or per-step in a plan file. The orchestrator treats the agent as an interchangeable subprocess. Verification doesn't change based on which agent ran. The branch either builds or it doesn't.&lt;/p&gt;

&lt;p&gt;This also means you can mix agents within a single plan. Use Claude Code for the architecture step, Codex for the boilerplate, Copilot for the tests. Each step gets verified the same way regardless of which agent produced it.&lt;/p&gt;
&lt;h2&gt;
  
  
  CI integration
&lt;/h2&gt;

&lt;p&gt;The tool ships as a GitHub Action:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moonrunnerkc/swarm-orchestrator@swarm-orchestrator&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tests&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;untested&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;modules"&lt;/span&gt;
    &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-code&lt;/span&gt;
    &lt;span class="na"&gt;pr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;review&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The Action outputs a JSON result with per-step verification status. You can gate downstream jobs on the verification outcome the same way you'd gate on any other CI check.&lt;/p&gt;

&lt;p&gt;Most orchestrators in this space are desktop-first or local-development tools. Even those that support remote execution do not run natively in CI with outcome-verified results. That's the gap Swarm fills.&lt;/p&gt;
&lt;h2&gt;
  
  
  Recipes for repeatable tasks
&lt;/h2&gt;

&lt;p&gt;Generating a plan from scratch for "add tests to this project" is wasteful when the plan structure is the same every time. 4.0 ships with seven parameterized recipes:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swarm use add-tests &lt;span class="nt"&gt;--tool&lt;/span&gt; codex &lt;span class="nt"&gt;--param&lt;/span&gt; &lt;span class="nv"&gt;framework&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;vitest
swarm use add-auth &lt;span class="nt"&gt;--param&lt;/span&gt; &lt;span class="nv"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jwt
swarm use security-audit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each recipe is a JSON file with &lt;code&gt;{{parameter}}&lt;/code&gt; placeholders. Custom recipes are one file in &lt;code&gt;templates/recipes/&lt;/code&gt;. The knowledge base tracks recipe outcomes across runs so success rates and failure patterns accumulate over time.&lt;/p&gt;
&lt;h2&gt;
  
  
  Current state
&lt;/h2&gt;

&lt;p&gt;1,112 tests passing, 1 pending. TypeScript strict mode. ISC license. Five phases of upgrades shipped in this release across the adapter layer, verification engine, repair pipeline, CI integration, and recipe system.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/moonrunnerkc" rel="noopener noreferrer"&gt;
        moonrunnerkc
      &lt;/a&gt; / &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;
        swarm-orchestrator
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;br&gt;
&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="36" height="36"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="52" height="52"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="Swarm Orchestrator" width="72" height="72"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="52" height="52"&gt;&lt;/a&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/wasp.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fwasp.svg" alt="" width="36" height="36"&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Swarm Orchestrator&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Not an autonomous system builder: an accountability layer around agents you already trust enough to run, but not enough to merge blind. Each step runs on its own isolated branch. Each claim (tests pass, build clean, commit made) is cross-referenced against the transcript and the actual filesystem. Failures are auto-classified, repaired with targeted strategies, and re-verified. Nothing reaches main without passing both the verification engine and the quality gate pipeline. The metric that matters is &lt;strong&gt;cost per rubric point&lt;/strong&gt;, not wall-clock time.&lt;/em&gt;&lt;/p&gt;



&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/95c61c397ca3825757ec835268e50886b2c10ddc4f0676e1222b19037610927f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4953432d626c75652e737667" alt="License: ISC"&gt;&lt;/a&gt;
  
&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/moonrunnerkc/swarm-orchestrator/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
  
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/64f4f4edbcb5ce478ae77e5187d84186cf323fbea6d76a1c750dd4795ef574a3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d3134393725323070617373696e672d627269676874677265656e2e737667"&gt;&lt;img src="https://camo.githubusercontent.com/64f4f4edbcb5ce478ae77e5187d84186cf323fbea6d76a1c750dd4795ef574a3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d3134393725323070617373696e672d627269676874677265656e2e737667" alt="Tests: 1497 passing"&gt;&lt;/a&gt;
  
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/39e32f87e04dd49db4fa18b3878ad6cb24c09dbaea1af5cfaa8953d61cbdfab4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d32302532422d677265656e2e737667"&gt;&lt;img src="https://camo.githubusercontent.com/39e32f87e04dd49db4fa18b3878ad6cb24c09dbaea1af5cfaa8953d61cbdfab4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d32302532422d677265656e2e737667" alt="Node.js 20+"&gt;&lt;/a&gt;
  
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/dc63baa72c8d42e246e791f4e625fa55d7eec24c1332fa5ce0e0d64b459f96c3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d352e782d626c75652e737667"&gt;&lt;img src="https://camo.githubusercontent.com/dc63baa72c8d42e246e791f4e625fa55d7eec24c1332fa5ce0e0d64b459f96c3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d352e782d626c75652e737667" alt="TypeScript 5.x"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#what-is-this" rel="noopener noreferrer"&gt;What Is This&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#benchmarking" rel="noopener noreferrer"&gt;Benchmarking&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#usage" rel="noopener noreferrer"&gt;Usage&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#github-action" rel="noopener noreferrer"&gt;GitHub Action&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#recipes" rel="noopener noreferrer"&gt;Recipes&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#architecture" rel="noopener noreferrer"&gt;Architecture&lt;/a&gt; · &lt;a href="https://github.com/moonrunnerkc/swarm-orchestrator#contributing" rel="noopener noreferrer"&gt;Contributing&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/moonrunnerkc/swarm-orchestrator/docs/media/swarm.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmoonrunnerkc%2Fswarm-orchestrator%2FHEAD%2Fdocs%2Fmedia%2Fswarm.png" alt="Swarm Orchestrator TUI dashboard showing parallel agent execution across waves" width="700"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;/div&gt;

&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;See it run end-to-end&lt;/h3&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;npm install -g swarm-orchestrator
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; then set up any one of the agent CLIs below, and:&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/moonrunnerkc/swarm-orchestrator" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
