DEV Community: Brad Kinnard

Your AI's tests pass. That doesn't mean the code works.

Brad Kinnard — Sun, 31 May 2026 22:21:05 +0000

You ask a coding agent to fix a bug. It writes the code, writes the tests, CI goes green, you merge. The bug's still there.

The agent's job was to turn the check green. The honest way to do that is to fix the code. The lazy way is to write a test that passes no matter what the code does. CI can't tell those two apart. A green check means the tests passed, not that the code is right.

It's easy to miss in review, because the test sits right there looking like proof:

test("parses the config", () => {
  const result = parseConfig(rawInput);
  expect(result).toBeDefined();
});

That passes whether parseConfig works perfectly or returns nothing useful on every input. It checks nothing. Adding more tests like it just raises your coverage number, not your odds of catching a bad change.

So I built ClaimCheck (https://github.com/moonrunnerkc/claimcheck). Instead of trusting the agent's tests, it tries to break them. If a test still passes after the supposedly fixed code is broken on purpose, the test was never really checking the fix, and it gets blocked. Same answer every time, no AI making the call. So far it's caught every cheat in a set of twelve hand-built cases. Twelve is small, and there's no public release yet, so treat that as a direction, not a finished result.

Some cheats slip through anyway. If the agent writes a real, solid test that locks in the wrong answer, every check passes. The only way to know the answer's wrong is to already know the right one, and nothing in the pull request can tell you that except the agent you're trying to catch. The one thing that helps is a clue from outside it, like a human-written bug report you can run the fix against.

There's a second, wider tool, Swarm Orchestrator (https://github.com/moonrunnerkc/swarm-orchestrator). It flags suspicious changes and keeps a tamper-evident record for audits. The record-keeping is the solid part. The catching is not: on real pull requests its accuracy is still low, and that's the half I'm hardening now.

The next step is comparing the old code's behavior to the new directly. The catch is that a wrong change and a harmless cleanup can look the same from the outside, and a tool that blocks good code is worse than one that lets a bad change through. That's the part I'm still working out.

Audit AI-Generated PRs Before You Merge Them (Swarm Orchestrator 10.3.0)

Brad Kinnard — Sun, 24 May 2026 20:54:59 +0000

If you let Claude Code, Cursor, Devin, Aider, Copilot, or any other coding agent open PRs against your repo, you already know the problem. The diff looks fine on a fast read. CI is green. You merge it. A week later you find the test that "passed" got deleted, or the error handling is a silent catch {}, or the "fix" was a comment swap that never touched the bug.

Swarm Orchestrator looks at those PRs and flags the suspicious bits before you click merge.

What it is

A CLI and a GitHub Action. Open source. Node 20 or later. You point it at a PR (or a local diff) and it scores the patch against a set of cheat-pattern detectors. It posts a comment back to the PR with what it found and why.

swarm audit moonrunnerkc/swarm-orchestrator#42

That's the whole interface for most people.

What it does

The default detector set has four checks, all aimed at patterns AI agents actually produce on real PRs:

error-swallow: a new empty or comment-only catch block in non-test code.
mock-of-hallucination: a jest.mock or vi.mock against a module that doesn't exist anywhere in the repo.
no-op-fix: tests changed without source, or source changed without tests, when the diff claims to fix something.
fake-refactor: an exported symbol renamed in source, with no caller in the diff updated.

Six more detectors live behind --detectors experimental for shadow runs. They're not scored well enough on real PRs to be on by default, and the README says so.

Every finding renders with its measured precision number inline, so a reviewer sees the false-positive rate every time the bot speaks.

If you need compliance artifacts, --emit-aibom cyclonedx-ml writes a CycloneDX 1.6 ML-BOM and an SPDX 3.0 AI-Profile per audit. That covers the EU AI Act Annex IV and CISA SBOM-for-AI minimums without bolting on a separate vendor.

Who it's for

Teams that let AI agents open PRs and want a second pair of eyes that runs in CI, costs nothing per call, and produces a deterministic comment instead of vibes. Also useful for procurement and security folks who need an AI-BOM next to their SBOM and don't want another tool in the chain.

If you have one developer eyeballing every line of every AI PR by hand, you probably don't need this yet. If you have ten agents pushing diffs to a queue at 2am, you do.

What's new in 10.3.0

Four things:

no-op-fix got a v2.0 with a gated LLM judge. The judge is off by default and only fires when you set --enable-llm-judge (or SWARM_AUDIT_LLM_JUDGE=1) and have an Anthropic key. Verdicts are content-addressed and cached, so the same diff and title always gets the same answer. The model id is pinned in the ledger so replay stays deterministic.
--shadow-output <path>. One JSON file per audit with detector verdicts, judge call count, and the rendered comment. Drops into a directory you can jq later. The existing --shadow <repo> per-repo rollup still works.
Public leaderboard on GitHub Pages. Fetches the real-corpus score snapshot and renders precision, recall, F1, and a sortable per-detector table. No build step, no CDN, just an HTML page and one JS file: moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/.
Real-corpus headline rescored against the v2.0 detectors. F1 moved from 0.109 (P 0.067, R 0.300) to 0.167 (P 0.100, R 0.500). mock-of-hallucination picked up two true positives the v1 shape missed.

The honest part

The real-corpus F1 is 0.167 across 205 AI-labeled PRs (10 broken, 195 clean, eight agent vendors). Precision is 0.100. Recall is 0.500.

That precision number is exactly why the default mode is advise and not gate. Most flags will be false positives. The tool is calibrated to be useful as a reviewer-assist signal, not a merge blocker. If you want it to block, opt in: --mode gate.

The 205-PR corpus is currently labeled by an AI judge with "pending human review" stamped on every entry. That's the largest credibility hole in the project and the next milestone closes it. The labeling rubric, the kappa script, and the labels-v2 scaffold already live in the repo.

Don't read this as "ship this into your release gate today." Read it as "here's a tool you can run in shadow mode, look at what it flags, and decide for yourself if those flags are useful."

Try it

git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
cd swarm-orchestrator
npm install
npm run build
npm link

# audit a PR (advisory, never blocks)
GITHUB_TOKEN=... swarm audit owner/repo#PR

Or wire it into a workflow with uses: moonrunnerkc/swarm-orchestrator@main and audit-mode: true.

Sources:

Repo: https://github.com/moonrunnerkc/swarm-orchestrator
Leaderboard: https://moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/
Real-corpus score snapshot: https://github.com/moonrunnerkc/swarm-orchestrator/blob/main/benchmarks/real-corpus/scores/latest.json
CycloneDX 1.6 ML-BOM spec: https://cyclonedx.org/specification/overview/
SPDX 3.0 AI Profile: https://spdx.dev/use/specifications/
EU AI Act Annex IV: https://artificialintelligenceact.eu/annex/4/

Cryptographic Forensics for AI Coding Agent Sessions

Brad Kinnard — Wed, 20 May 2026 14:58:25 +0000

A Claude Code or Codex CLI session writes a JSONL file to disk. If the agent runs rm -rf on a training-data directory or terraform destroy -auto-approve on production, that file is where an incident review starts.

A JSONL file is not evidence. Anyone with shell access can rewrite it. To a third party who doesn't trust the machine it came from, it proves nothing.

That gap matters once agents have credentials to real infrastructure. Most agent observability tooling is built for debugging and quality, not for the moment after damage is done. This post is about the three cryptographic properties that turn a transcript into something an auditor or regulator can verify, and how the DEPOSE project wires them together.

Three properties

Assume the machine that produced the bundle can't be trusted. Three things need to hold at once:

Tamper-evident. Any byte change has to be detectable. Hash chain over events: change a byte, replay fails.
Authenticated. The record has to be bound to a key the producer controls and publishes a fingerprint for. Ed25519 signatures over a manifest.
Anti-backdated. A party other than the producer has to anchor the record in time. RFC 3161 tokens from a public TSA.

The primitives are old and well understood. The hard part is wiring them through a normalized event schema and shipping a verifier that doesn't depend on the producer's runtime.

No LLM in the signed path

Every event is captured at execution time or normalized from the session JSONL, then committed to the hash chain. The human-readable narrative is generated separately, from deterministic Handlebars templates over the signed events. It's excluded from the root hash.

If generated prose became part of the signed record, verification would depend on model behavior staying stable and reproducible. DEPOSE avoids that dependency. The signed record is event data and hashes. The prose is templated commentary with [#evt-<ulid>] citations back to the signed events. You can rewrite the narrative without affecting verification. Change an event and verification fails.

What's in a bundle

A DEPOSE bundle is a directory, not an opaque archive:

incident-01JABC.../
├── manifest.json            bundleId, rootHash, eventsJsonlSha256, sigs, timestamps
├── events.jsonl             every event in canonical JSON, byte-pinned by manifest
├── rules/destructive.yaml   ruleset used at reconstruction time
├── narrative.md / .html     templated prose with per-event citations
├── verify.txt               human-readable verification summary
├── artifacts/               captured file diffs, payloads
├── attestations/            Ed25519 signatures, RFC 3161 timestamp tokens
└── raw/                     source JSONL, shell history, capture records

Change a byte of events.jsonl, manifest.json, or rules/destructive.yaml and verification fails. Canonical JSON follows RFC 8785 (JCS), which is what lets a Go verifier check a TypeScript-produced bundle without either side trusting the other's serializer.

Two binaries

The producer is TypeScript. The verifier is a separately-built static Go binary, depose-verify. The separation is deliberate: you hand the binary to whoever needs to check the bundle (auditor, opposing counsel, regulator, a customer's security team) and they run it on their own machine. No producer stack required.

A passing run prints:

parse        OK
signature    OK
chain-replay OK
artifacts    OK
timestamp    OK
PASS  bundleId=...  rootHash=...

The cryptography here is mostly off-the-shelf. The actual engineering work is in normalization: getting Go and Node to serialize identically, getting timing and ordering right across capture sources, deciding what counts as one event versus two. Canonical JSON is the unsexy part. Float formatting, key ordering, unicode escapes: Go and Node have to agree byte-for-byte or the verifier rejects a bundle the producer thinks is fine. That's what the cross-language conformance vectors in tests/conformance/ are for.

Verifiers can pin a producer's expected key fingerprint and consult a revocation list, both at the command line. The RFC 3161 timestamp does double duty here: a bundle stamped before a key is revoked stays time-anchored, so "when was this signed" remains answerable even if the key is later compromised.

Capture modes

Two modes, different coverage.

Reconstruction reads the Claude Code session JSONL after the fact, compares it against shell history (bash, zsh, fish) and git reflog where available, and builds a bundle. Lower-bound mode. It can verify integrity after packaging. It can't prove the original session file was complete before capture.

Active capture installs a Claude Code PreToolUse hook and POSIX shell shims for the binaries that tend to do destructive things: terraform, aws, gh, kubectl, psql, gcloud, railway, rm. Records land under ~/.depose/captures/ at execution time. A later depose package merges them with the session JSONL so every covered event has a verified pre-execution intent on record.

DEPOSE can prove integrity of captured events. It can't prove an uninstrumented system captured everything. An agent that shells out to a binary not in the shim list, or hits an API directly, still shows up in the JSONL but won't have an active-capture record. The coverage matrix is in the repo.

macOS and Linux only. Windows isn't supported (POSIX 0600 on the key store, POSIX shell scripts for the shims). WSL2 works.

Release pipeline

Releases ship with SBOMs, provenance attestations, and signed checksums. The specifics: CycloneDX for both halves, SLSA L3 provenance, and SHA256SUMS signed via cosign keyless. CI rebuilds the two checked-in example bundles (an rm -rf on training data, a terraform destroy on infrastructure) on every push and runs three semantic tamper rejections to confirm the verifier fails closed.

Right now most coding-agent session logs are treated like disposable debug output. That assumption gets weaker the moment an agent can modify infrastructure.

https://github.com/Aftermath-Technologies-Ltd/depose

Gemma.Witness - Offline Multimodal Evidence Capture with Gemma 4

Brad Kinnard — Sun, 17 May 2026 03:43:46 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Gemma.Witness is an offline-first multimodal evidence capture system built for environments where cloud access, trust, or chain-of-custody assumptions fail.

The system records audio alongside supporting images, runs local multimodal analysis through Gemma 4, and produces a signed evidence bundle containing:

Structured incident reports
Timestamped evidence metadata
Local reasoning traces
Hash-linked verification artifacts
Exportable forensic bundles

The focus was reliability and local verification instead of "AI assistant" behavior.

Most evidence tooling today assumes internet access, centralized APIs, or mutable storage. Gemma.Witness was designed around the opposite assumption: the network may be unavailable, the machine may be isolated, and every generated output may eventually need independent verification.

The application runs fully local through a desktop interface using Rust, Tauri, and local inference orchestration.

Demo

github: https://github.com/moonrunnerkc/gemma-witness

Code

Source code is available at the repository above.

How I Used Gemma 4

Gemma 4 is the reasoning layer behind the entire evidence pipeline.

I used Gemma 4's multimodal capabilities to process:

Audio-derived transcripts
Scene images
Cross-evidence consistency analysis
Structured incident extraction
Reasoning trace generation

The model is used in a multi-pass workflow instead of a single prompt-response cycle. Each pass validates or expands on the previous stage before the final signed bundle is emitted.

This matters because evidence systems fail quietly when models hallucinate details, merge assumptions into facts, or overstate certainty. The pipeline was intentionally designed to separate:

Raw observations
Inferred conclusions
Confidence scoring
Verifiable artifacts

Gemma 4 was a strong fit because it could operate locally while still handling multimodal reasoning tasks without requiring cloud APIs or external orchestration services.

The project prioritizes:

Offline operation
Verifiable outputs
Local ownership of evidence
Minimal trust assumptions
Reproducible forensic artifacts

A surprising challenge was not getting the model to generate reports. That part was easy.

The difficult part was building guardrails around evidence integrity so the system does not quietly become a very confident fiction generator wearing a necktie.

Tech Stack

Gemma 4
Rust
Tauri
Node.js
Local multimodal inference
Cryptographic hashing and bundle verification

Repository

moonrunnerkc / gemma-witness

Offline multimodal evidence capture that emits a signed, locally verifiable .witness bundle. Tauri + Rust + Gemma 4 + Ed25519. Static HTML verifier runs with no server.

Gemma.Witness

Offline, tamper-evident evidence capture for field journalism. Signed in your hand, verified in a browser, with no server in the loop

demo.mov

Why · Status · Install · Usage · Configuration · Threat model · Limitations · Verify yourself · Contributing

Why this matters

A reporter is working in a country where journalists are detained for their reporting. She records a witness account. She attaches the photos she just took. She seals the file before she leaves the room.

A week later, an editor on another continent opens a single static HTML page in any browser and drags the file in. Three checks turn green:

the signature comes from the reporter's device
the audio and the photos have not been altered by a single byte
the AI model in the chain is bit-for-bit the published Gemma model her manifest names, by model_id, revision, and model.safetensors SHA-256

In…

View on GitHub

Swarm Orchestrator v8.0.2

Brad Kinnard — Tue, 12 May 2026 02:31:18 +0000

v8.0.2 is out now and it cleans up several rough edges that kept showing up under heavy tournament and falsification workloads.

The biggest operational change is that all four previously documented architectural limitations are now closed in the same release (7b68867).

Notable Changes

Tournament mode now streams through the same pipeline as single mode. If one candidate fails streaming verification, it gets aborted independently instead of poisoning the whole run.
Live cost-cap enforcement is now real-time. Concurrent streams project cumulative USD usage continuously and abort the moment projected spend crosses the configured cap.
Snapshot cleanup is automatic now and supports retention policies like:

  retain-last:N
  max-age:<dur>
  max-disk:<sz>

Adaptive falsifier dispatch using UCB1 is available behind:

  --falsifier-scheduler ucb1

ARIES-style rollback support landed for falsified obligations. If a counter-example appears after apply, the workspace restores from the pre-apply snapshot and verifies the rollback by hashing the restored bytes against the original SHA.

New Command

There is also a new command:

swarm v8 stats <run-id>

That surfaces persisted falsifier metrics directly from:

.swarm/falsifier-stats.json

including regression discoveries, false positives, success counts, and latency.

Replay Determinism

One important detail: replay determinism remains intact across all of this. Every scheduler decision and abort event still lands in the ledger so replay reproduces the same winner consistently. That part was non-negotiable.

Release

https://github.com/moonrunnerkc/swarm-orchestrator/

How Swarm Orchestrator v8 Tries to Break Its Own AI Patches

Brad Kinnard — Sun, 10 May 2026 02:10:05 +0000

Most AI coding tools commit when their own checks pass. Swarm Orchestrator v8 adds a second adversarial layer: independent falsifier adapters that try to break each patch before it merges. v8.0.1 is on main with that subsystem on by default.

This post walks through the v8 architecture, the four verification points, the producer/falsifier adapter split, and the limitations that haven't been solved in v8.0 yet.

What is Swarm Orchestrator? A contract-first AI coding swarm with hash-chained evidence and verifier-gated commits. It compiles a natural-language goal into a typed contract, dispatches it to a population of personas inside one cached Anthropic session, races candidate diffs per obligation, and commits only what passes verification. It wraps an LLM; it doesn't replace one.

The shape of a run

You hand it a goal in plain English. The contract compiler turns that into contract.jsonl plus a manifest.json carrying the goal, repo context, extractor provenance, and a SHA-256 of the canonical contract bytes. Identical inputs produce identical contract hashes.

goal (text)
   |
   v
contract compiler  ->  contract.jsonl + manifest.json
   |
   v
+-------------------------------------------------+
|        population manager (single session)      |
|                                                 |
|  ledger (jsonl, hash-chain) <- personas (8)     |
|       ^                          |              |
|       | tournament + verifier scoring           |
|       |                                         |
|  WASM deterministic floor (zero-LLM obligs)     |
+-------------------------------------------------+
   |                              |
   v                              v
streaming verifier      post-merge integration
   |                              |
   +--------------+---------------+
                  v
       falsifier adapters (Codex, Copilot)
                  |
                  v
            committed diffs

The population manager opens one cached Anthropic session and walks each obligation. It picks the persona whose trigger predicate matches the obligation type. In tournament mode, N candidates run in parallel; the verifier scores them, the top scorer is a commit candidate, and losers get logged but never merge.

Two adapter subsystems

The most common confusion in v6 was treating the coding CLIs and the falsifiers as one thing. v8 splits them cleanly.

Producer adapters (src/adapters/) wrap third-party coding CLIs as the worker in the v6 verified-branch pipeline. Backends: Copilot, Claude Code, Codex, Claude Code Teams. All four are opt-in via swarm run --v6.

Falsifier adapters (src/falsification/adapters/) take a patch the producer's verifier already accepted and try to falsify the obligation by surfacing a counter-example, regression fixture, or property-violation trace. A confirmed counter-example flips the obligation back to failed.

Falsifier	Default	Obligation types
`CodexFalsifier`	on	`property-must-hold`
`CopilotFalsifier`	on	`import-graph-must-satisfy`, `function-must-have-signature`
`ClaudeCodeFalsifier`	off (per-adapter opt-in)	all three

The CLI surface is one flag: --falsifiers <on|off> (default on). Per-adapter selection happens at the API layer via defaultAdapterRegistry({ includeCopilot, includeClaudeCode }).

Four verification points

A patch has to survive these before it merges:

Pre-generation memoization. Skip generation if the obligation result is already cached.
Mid-stream abort. During generation, the streaming verifier can abort the call. Works in --mode single only; tournament mode skips it.
Post-generation per-obligation verifier. Scores the candidate diff. In tournament mode, top scorer wins; in single mode it's pass/fail.
Post-merge integration check. After the diff lands, the integration check confirms the broader system still holds.

The architectural rule from the README: nothing commits without passing the obligation's verifier. Then the falsifiers get a shot.

The hash-chained ledger

Every action lands in .swarm/ledger/<run-id>.jsonl with the SHA of the prior entry. Tampering is detectable; runs resume from any prior state. If a process is killed mid-run, swarm v8 resume <run-id> walks the ledger and picks up where it left off.

The ledger format is shared with v6, but v8 writes more granular events (per-persona dispatch, per-candidate score, falsifier verdict) so a run can be replayed or audited end-to-end.

Quick start

git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
cd swarm-orchestrator && npm install && npm run build && npm link

# Compile a goal, then run it
swarm v8 compile "add a /health endpoint that returns 200 OK" --yes
swarm v8 run .swarm/contracts/<contract-id>

# Or both in one step (defaults to v8)
swarm run --goal "add a /health endpoint that returns 200 OK"

# Resume a killed run
swarm v8 resume <run-id>

Requires Node >= 20, git >= 2.40, and ANTHROPIC_API_KEY. Pass --extractor stub --session stub to run offline.

There's also a GitHub Action:

- uses: moonrunnerkc/swarm-orchestrator@v8
  with:
    goal: "add a /health endpoint"
    contract-only: false
    cost-cap: "5.00"
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

What v8.0 doesn't do

Limitations worth reading before adopting

Tournament mode doesn't stream. --mode tournament plus --forbid-import skips the streaming abort; streaming verification is --mode single only.
Post-merge failure doesn't auto-rollback. The run is marked failed; per-obligation worktree snapshots are post-v8.0.
--cost-cap is enforced post-obligation, not mid-call. Cumulative spend is checked at the end of each obligation against estimated Sonnet 4 pricing. Exit code 6 if exceeded.
Bandit dispatch is not built (Phase 5). Codex and Copilot have disjoint obligation types, so there's nothing to arbitrate between yet.
Cross-vendor producer race is deferred (Phase 6).

Full list with rationale lives in docs/v8-architecture-deviations.md.

Repo

moonrunnerkc / swarm-orchestrator

Contract-first AI coding swarm with hash-chained evidence. Compiles a goal into typed obligations, races persona candidates per obligation in a single cached inference session, verifies before commit, and logs every action in an append-only ledger.

Swarm Orchestrator

Contract-first AI coding swarm with hash-chained evidence and verifier-gated commits.

swarm compiles a natural-language goal into a typed contract, dispatches it to a population of personas inside one cached Anthropic session, races candidate diffs per obligation, and commits only the diffs that pass verification. After the producer's verifier accepts a patch, registered falsifier adapters get a chance to break it before it merges. Every action lands in an append-only hash-chained ledger you can audit, resume, or replay.

It wraps an LLM; it does not replace one. The model writes the code, the orchestrator decides what reaches your repo.

Status

Version 8.0.1 on main. Node >= 20 (CI matrix: 20, 22). License ISC. The v8 architecture is the default for swarm run; the v6 verified-branch pipeline is preserved under swarm run --v6 and the swarm swarm / swarm execute commands Falsifier subsystem: Codex on, Copilot on, ClaudeCode…

View on GitHub

How to Write a CLAUDE.md Rule That Actually Gets Enforced

Brad Kinnard — Sun, 10 May 2026 02:06:45 +0000

Open a CLAUDE.md file at random and you'll find build commands, architecture notes, and rules. The rules tend to be the unenforceable kind. "Write clean code." "Be careful with types." "Follow our conventions." The author meant every word. The agent reads them. And nothing checks whether the agent followed them, because nothing can.

In a corpus of 580 CLAUDE.md, AGENTS.md, and .cursorrules files from public GitHub repos with 10+ stars, 74% contained zero machine-extractable rules. Not because the authors didn't care about rules. Because most rules were written in a form no parser could pull out as a deterministic check.

This post is about the difference. Specifically: how to phrase a rule so a parser can extract it and a verifier can check it, without sacrificing what you actually meant.

The principle

Enforceability comes from a verifiable surface. A rule is verifiable when there's a concrete pattern in code that either matches it or doesn't. "Use camelCase for function names" is verifiable: read the AST, list the function names, check the casing. "Name things consistently" is not: there's no concrete pattern to check, only a judgment to make.

The gap between the two is the gap between intent and enforcement. You meant the same thing in both cases. But only one of them survives translation into a check.

Here's the heuristic I use: could a junior engineer with no context mechanically check whether code follows this rule, just by reading the rule and looking at the code? If yes, the rule is enforceable. If they'd have to ask "what does 'consistent' mean here?", it isn't.

Three pairs, walked through

Take a few common intents and look at how they fail or succeed.

Type safety. You want strong typing.

Bad:  Be careful with types.
Good: No `any` types in `src/`. Async functions require explicit return types.

The bad version has no surface. "Careful" isn't a check. The good version has two: a forbidden token (any) and a structural property (return type annotation on async function declarations). Both check directly against the AST.

Module structure. You want predictable imports and exports.

Bad:  Prefer clean module structure.
Good: Named exports only. No default exports. Filenames in kebab-case.

"Clean" is meaningless to a parser. Named-only and kebab-case are both binary properties of code that exist or don't. The first version sounds like more guidance because it's broader, but breadth is the problem: it covers everything and enforces nothing.

Preferences over alternatives. You want React functional components, not class components.

Bad:  Write modern React.
Good: Prefer functional components over class components.

"Modern" is a moving target with no fixed surface. The "prefer X over Y" pattern, on the other hand, has a clean check: count instances of each, compute a ratio, score against a threshold. This is one of the most useful patterns in instruction files because it captures real-world preference (not absolute prohibition) in a measurable way.

The reference table

Twelve common intents, paired:

Intent	Unenforceable	Enforceable
Naming functions	Name things consistently	Use `camelCase` for function names
Filenames	Pick reasonable filenames	All filenames in kebab-case
Type safety	Be careful with types	No `any` types
Async return types	Make types clear	Async functions require explicit return types
Module exports	Prefer clean module structure	Named exports only, no default exports
File size	Keep files manageable	Maximum 300 lines per file
Logging	Be mindful of logging	Never use `console.log`; use `src/logger.ts`
Component style	Write modern React	Prefer functional components over class components
Package manager	Use the right package manager	Use `pnpm`, not `npm`
Test files	Keep tests organized	All test files end with `.test.ts`
Error handling	Handle errors properly	Async functions must use `try/catch` or return a `Result` type
Commit format	Write clear commit messages	Use conventional commits (`feat:`, `fix:`, `chore:`)

Every right-hand cell points at a concrete check: a token, a casing rule, a count, a file pattern, a configured tool. Every left-hand cell points at a judgment.

Real-world rules usually carry scope: "no any in src/," "named exports outside index.ts files," "no console.log in production code paths." Scope makes a rule narrower and more accurate without making it less enforceable. The interop layer that genuinely needs any keeps it; the rest of the codebase doesn't.

What kinds of checks exist

Worth knowing what's available, because it shapes what's writable. Static analysis tools targeting AI instruction files generally support a few classes of check:

AST-level: function names, type annotations, import patterns, forbidden tokens, structural properties
Filesystem: file existence, naming conventions, directory layout, file size limits
Regex: literal strings, content patterns, conventional formats
Tooling: presence and configuration of linters, formatters, package managers, test runners
Config-file: contents of .eslintrc, tsconfig.json, .prettierrc, etc.
Git-history: commit message formats, branch naming conventions
Preference ratios: "prefer X over Y" with a compliance percentage instead of a binary verdict

If your rule maps to one of these classes, it's enforceable. If it doesn't, it isn't. The trick when writing instruction files is to keep that map in mind: when you're about to write "be careful with X", ask which of these classes "carefulness with X" lives in. Usually the answer points at a concrete reformulation.

The unenforceable rules aren't worthless

Here's a real tension: most of what makes a CLAUDE.md useful isn't enforceable at all. Project context (what the repo does, where the architecture lives), agent behavior directives (be succinct, ask before deleting, don't touch /legacy), and onboarding instructions are all valuable. None of them extract as rules.

Don't try to make them enforceable. They're a different kind of content with a different purpose. Project context grounds the agent. Behavior directives shape its style. Neither is supposed to be checked against output; they're checked against the agent's process, which is a different problem.

The mistake worth avoiding is letting unenforceable prose crowd out enforceable rules. Anthropic's Claude Code best practices doc recommends deleting any instruction the model already follows correctly without it. Most "write clean code" style rules fail that test: the model already does its version of clean code, so the line is taking up attention budget your agent could be spending on the specific, verifiable rules that actually distinguish your codebase from a generic project. Cut what the model already does. Keep the checks.

A test for your own files

Pull up your CLAUDE.md or AGENTS.md right now. For each line that looks like a rule, ask:

Could a junior engineer check this without asking clarifying questions?
Does it name a specific pattern, file, token, casing, or value?
Would 5 different reviewers all agree on whether a piece of code passes this rule?

If a rule fails 1 or 2, it's not a rule, it's a wish. If it fails 3, it's ambiguous. Rewrite or delete.

If you want a mechanical version of this test, RuleProbe parses CLAUDE.md, AGENTS.md, .cursorrules, .windsurfrules, GEMINI.md, and copilot-instructions.md against 102 matchers and tells you which lines extracted as rules and which didn't:

npm install -g ruleprobe
ruleprobe parse ./CLAUDE.md --show-unparseable

The --show-unparseable flag is the interesting one. It surfaces every line that looked rule-shaped but didn't map to a check. That list is your rewrite queue.

RuleProbe on GitHub

What this leaves out

The hardest case is rules like "follow the existing error handling pattern in this codebase." That's enforceable in principle (compare new code's structural shape against the codebase's dominant pattern), but not by simple AST or regex matching. It needs codebase-aware analysis. Some tools handle that; most don't. If you find yourself writing those kinds of rules, know that they'll either need a tool that does pattern profiling or they'll stay aspirational.

The other thing enforceability doesn't catch: an agent that follows every rule and still writes broken code. Static rules reduce variance, they don't eliminate it. A function with any removed and an explicit return type can still have wrong logic. Treat passing rule checks as a floor, not a ceiling.

I Dropped Multi-Agent Coordination for a 5-Layer Falsification Battery

Brad Kinnard — Sat, 02 May 2026 15:00:11 +0000

Swarm Orchestrator just lost its swarm. Dropped the multi-agent parallel coordination layer. Running one agent now and putting all the weight on a five-layer post-merge falsification battery instead.

This is an experiment, not an endpoint. v8 will bring proper multi-agent swarming back. The reason for cutting it temporarily: I want to know whether the value I was getting from coordinated parallel agents was the coordination itself, or the verification pressure that coordination produced. Easier to measure with one variable. Intended side effect: cost reduction, since the previous architecture spun up multiple CLI agent instances per run. Real benchmarks pending.

TL;DR: every patch survives a five-layer post-merge battery before the orchestrator declares success. Layers 1 and 2 are hard gates. Layers 3, 4, 5 are advisory and feed a composite score. Hard-gate failure throws before attestation, before final gates, before any external success signal.

Pipeline Order

The battery runs once per orchestrator execution against the merged working tree, not per-step branches. The per-step verifier is a separate component. Layers fire in fixed order:

Differential gate (hard)
Mutation gate (hard)
Cheat detector (advisory)
Property gate (advisory)
Attestation (advisory on first run, signed after)

If the hard gate fails, the composite is forced to 0 and the orchestrator throws falsification battery blocked the patch before any external success signal can fire.

Layer 1: Differential Gate (Hard)

Before any agent touches the repo, a synthesizer generates a regression test against the goal. Layer 1 then runs that test in two detached worktrees: one at the base commit, one at the patch commit.

The contract: the test must fail at base and pass at patch.

If the test passes at base, the layer returns INVALID_TEST. This catches the specific failure mode where an agent writes a tautological test that passes against any code. Without this gate, that pattern slips through every other check downstream.

If no command can be synthesized and the caller doesn't pass --differentialTestCommand, the layer fails closed. Deliberate policy.

Layer 2: Mutation Gate (Hard)

Runs Stryker for JS/TS, mutmut for Python, PITest (Gradle/Maven) for Java, against changed files only. First runs the regression command; if that fails, the layer fails immediately. On pass, dispatches the mutation tool and parses the reporter output, including Stryker's clear-text table layout.

Score	Status
< 0.6	FAIL
< 0.8	WARNING
≥ 0.8	PASS

Patches with no meaningful test coverage of the changed lines fail here even when npm test is green.

Layer 3: Cheat Detector (Advisory)

Diff-static scan for known agent-shortcut patterns. Each rule maps to a real category of behaviour I've watched agents fall into under pressure.

Cheat detector rule IDs

test-modification: agent edited the test instead of the implementation
complexity-mismatch: trivial implementation paired with a non-trivial test
mock-mutation: real call replaced with a mock that returns the expected value
exception-swallowing: caught and ignored exceptions to make tests pass
hardcoded-answer: literal return of the expected output

Optional Semgrep integration loads any registered cheat-rule packs into a transient config. The scan reads added and removed lines and patterns; it does not execute the patch. An agent that hides a hardcoded answer behind a small computation can still pass this layer. Honest about that.

Layer 4: Property Gate (Advisory)

Discovers modified TS/JS/Python functions, parses their parameter types, maps each to a fast-check arbitrary or Hypothesis strategy, generates a harness, runs it. Counterexamples surface as findings. Untyped or unsupported types degrade to a low-severity advisory finding rather than blocking.

Layer 5: Attestation (Advisory on First Run)

Reads the refs/notes/swarm-attestation git note for the patch commit, validates the in-toto SLSA v1.0 envelope's subject SHA against the patch commit, then verifies the cosign signature. On the first run for a commit there's no note yet, so this layer reports advisory-warn and the post-battery attestation step writes the note.

The note is verifiable later via swarm attest verify <commit>. A downstream consumer can verify the patch survived the battery without trusting the running orchestrator process.

Composite Scoring

When the hard gate passes, a weighted composite is computed across the three advisory layers and any optional advisory quality-gate results. Failed advisory gates each subtract a fixed penalty.

Default scoring (overridable via .swarm/gates.yaml)

composite threshold: 0.7
weights: cheat detector 0.4, property gate 0.4, attestation 0.2
advisory gate penalty: 0.02 per failure

humanReviewRequired is true when the composite score is below threshold or any advisory layer is in advisory-warn status.

Where It Actually Runs

Three real call sites, not just unit tests:

Production orchestrator on every swarm run
Synthetic calibration corpus (36 paired test specs across 6 broken-category families) executing in CI on every push
SWE-bench harness using Layer 1 and Layer 4 as standalone spot-check eval drivers

Honest Caveats

These are in docs/known-gaps.md and I won't hide them:

Differential gate is host-Python-sensitive on legacy codebases. The synth-eval can reflect import-chain errors rather than assertion outcomes. The authoritative resolution gate in the per-instance Docker image is unaffected.
Mutation gate skips quietly when no changed files match supported languages. YAML, Markdown, Rust, Go diffs don't get mutation-tested.
Cheat detector is diff-static, not behavioural. The hidden-computation-around-hardcoded-answer pattern can pass it.
Attestation signing is best-effort. Cosign-not-installed errors get logged and the run proceeds without a note. The note's absence is reflected in Layer 5's advisory-warn on subsequent runs but does not block.

Why Run This Experiment

If the falsification battery alone produces patches that survive scrutiny at acceptable quality, then a lot of the apparent value of multi-agent coordination was actually the verification pressure it created, not the agent diversity itself. If the battery alone isn't enough, then v8 multi-agent gets a clearer mandate: the swarm is the value, not the side effect.

Either result is useful. The point of the rewrite is to make the answer measurable.

View on GitHub

swarm-orchestrator v7.0.0-alpha.0

Brad Kinnard — Thu, 30 Apr 2026 05:28:06 +0000

The agent generates code. The orchestrator tries to find reasons not to trust it.

That sentence is the entire pivot. Earlier versions of swarm-orchestrator coordinated multiple agents working on the same task. v7 wraps a single agent CLI (Copilot, Claude Code, or Codex) and runs five independent checks on the patch before allowing a merge.

TL;DR

Five-layer verification battery sits between any agent CLI and your main branch. Two layers are hard gates. Three feed a composite score. Every verified merge gets a signed SLSA attestation as a git note.

The five checks

#	Layer	What it catches	Gate type
1	Intent verification	Patch doesn't actually fix the stated problem	Hard
2	Regression verification	Patch breaks existing behavior, or test coverage is too weak to know	Hard
3	Solution quality	Agent gamed the test (hardcoded values, swallowed exceptions, modified tests)	Composite
4	Behavioral verification	Patch works on the happy path, crashes on edge cases	Composite
5	Provenance	No signed attestation produced for the merge	Composite

1. Intent verification

The patch must make a previously-failing test pass. For SWE-bench instances, that's the FAIL_TO_PASS test from the instance JSON. For user-facing goals, a reviewer synthesizes a regression test before the worker runs and confirms it fails against the base commit first.

2. Regression verification

Existing tests must pass. Then mutation testing runs on the modified files to check whether coverage around the change is actually strong enough to catch regressions. A patch that works but lives in weakly-tested code gets flagged.

Mutation testing tooling per language

JS / TS: Stryker
Python: mutmut
Java: PITest

Mutation score thresholds are configurable in .swarm/gates.yaml. Defaults: below 0.6 fails, 0.6 to 0.8 warns, above 0.8 passes.

3. Solution quality

A Semgrep rule pack scans for the specific shortcuts agents take when they're being graded:

Hardcoded values matching test expectations
try/catch blocks swallowing the exact exception a failing test was asserting on
Modifications to test files outside the stated scope
Mock mutations that make tests pass without changing the implementation

4. Behavioral verification

Property-based testing runs against modified functions for 60 seconds each, using Hypothesis (Python) or fast-check (TypeScript). Counterexamples that crash the patched code or violate type contracts get reported.

5. Provenance

A signed SLSA v1.0 attestation is generated for each verified merge and attached to the commit as a git note. Signed via cosign keyless OIDC. The attestation contains agent identity, model version, per-layer results, and the composite score.

swarm attest verify <commit>

That command pulls the note and verifies the signature. Useful when something breaks in production three months later and someone asks which agent wrote the offending code and what was checked at merge time.

Status

Alpha. SWE-bench Verified 50-instance sweeps across Copilot CLI, Claude Code, and Codex are running now. Headline metric is the falsification catch rate: of the patches each agent claimed succeeded, what percentage failed at least one layer. Numbers drop in a follow-up post when the sweeps complete.

Where this goes next

v8 brings parallel execution back, applied to verification instead of generation.

The orchestrator will compute a risk score for each patch, then spawn a population of independent falsifiers sized to that risk. Falsifiers share findings through a coordination channel so a discovery from one steers the targeting of others. A bandit selects which falsifier types to spawn based on past outcomes.

The v7 five-layer battery becomes the seed pool that v8 grows from. The project name finally fits.

moonrunnerkc / swarm-orchestrator

Independent verification battery for patches written by AI coding agents. Wraps Copilot, Claude Code, and Codex; applies a five-layer falsification battery (intent, mutation, cheat detection, property tests, signed attestation) to gate merges.

Swarm Orchestrator

Independent verification battery for patches written by AI coding agents.

Quick Start · How It Works · Documentation · Contributing

Wraps third-party coding-agent CLIs (Copilot, Claude Code, Codex), runs worker and reviewer steps on isolated git branches, and applies a five-layer falsification battery to each agent-authored patch. Hard gates block patches that fail intent or regression checks; advisory layers feed a composite score.

You run this around an agent CLI, not instead of one. The agent produces the patch; the orchestrator tries to break it. Patches that survive merge to main; patches that don't are rolled back with a verification report.

Features

Five-layer falsification battery. Intent verification, regression and mutation testing, cheat detection, property-based testing, and signed attestation. Layers 1 and 2 are hard gates; layers 3 to 5 feed an advisory composite score. Implementations live under src/verification/.
Isolated worker and reviewer steps. Each step runs…

View on GitHub

94% of Published SKILL.md Files Skip the Spec's Two Most Basic Patterns

Brad Kinnard — Wed, 29 Apr 2026 02:30:37 +0000

The agentskills.io spec recommends two things in every description: start with an action verb, and include a trigger phrase like "use when..." that tells the routing layer when to fire the skill. They take five seconds to add and they're the difference between a skill an agent picks up and a skill that sits unused in the catalog.

I sampled 500 skills at random from a 1,436-skill public corpus and measured both. 5.8% follow both recommendations. 61.8% follow neither.

The full breakdown of what the SKILL.md ecosystem actually looks like in production, as of late April 2026.

Methodology

Corpus: sickn33/antigravity-awesome-skills at HEAD on April 29, 2026. This is the largest publicly bundled SKILL.md collection in a single repo (1,436 indexed skills with metadata for category, source, and risk classification).

Sample: 500 skills, random with seed 42 for reproducibility.

Tool: skillcheck v1.2.0 from PyPI.

Per-skill features captured: every skillcheck diagnostic (rule, severity, message), description quality score, body line count, body and metadata token estimates, activation entropy and top-hypothesis score from --activation-hypotheses, structural features computed locally (description length in chars and words, action verb in first position, trigger-phrase presence, presence of resources//scripts//references/ subdirectories, frontmatter field count and which fields), plus the antigravity-supplied category, source, and risk metadata.

Caveat one: skillcheck's description quality score is a heuristic that includes action-verb and trigger-phrase detection as positive signals. So the correlation between these two features and the score is partly mechanical. The headline finding is not "we discovered these patterns predict quality." It's "the spec recommends these patterns, the linter that encodes the spec rewards them, and almost nobody is using them."

Caveat two: antigravity's bundler injects risk, source, date_added, and category fields into the SKILL.md frontmatter when packaging skills. The author-original frontmatter analysis below excludes these injected fields.

Reproduce in five commands:

pip install skillcheck==1.2.0
git clone --depth 1 https://github.com/sickn33/antigravity-awesome-skills.git
cd antigravity-awesome-skills
# Then sample from skills_index.json with seed 42 and run skillcheck against each
# Full analysis script: see the dataset link at the bottom

The two-pattern adoption gap

Every skill description was classified on two binary features: does it start with an action verb (Generates, Validates, Creates, Builds, Analyzes, etc., from a 90-verb allowlist), and does it contain a trigger phrase (use when, use this skill when, when the user, when working with, whenever, etc.)?

Pattern	Count	%
Has both action verb and trigger phrase	29	5.8%
Action verb only	108	21.6%
Trigger phrase only	54	10.8%
Neither	309	61.8%

The same four groups, scored against skillcheck's description quality metric:

Group	n	Median score	% scoring 70+
Has both	29	90.0	100.0%
Action verb only	108	70.0	72.2%
Trigger phrase only	54	70.0	94.3%
Neither	309	50.0	8.4%

The 100% rate in the both-features group isn't magic. It reflects that skillcheck's heuristic was designed around the spec's recommendations and rewards skills that follow them. What's actually striking is the bottom line: 309 of 500 published skills skip both recommendations. That's the working majority of the ecosystem leaving easy quality on the floor.

What authors actually fill in

Outside name and description, frontmatter is mostly empty. The median author-original frontmatter (excluding the bundler's injected fields) has just two fields. Two.

Field	Adoption
name	99.6%
description	99.6%
author	10.8%
tags	10.6%
tools	8.8%
license	3.8%
allowed-tools	2.8%
version	2.2%
triggers	0.6%
user-invokable	0.6%
capabilities	0.2%

The spec offers version, author, tags, allowed-tools, model, agent, hooks, user-invocable, disable-model-invocation, skills, mode. Almost none of them are being used. 80% of authors stop after name and description. There's an entire optional metadata layer the spec defines and the ecosystem ignores.

Progressive disclosure adoption is 16%

The spec's load-bearing concept is progressive disclosure: keep metadata tiny so the routing layer scans it cheaply, keep the body lean so it fits the agent's context window, push heavy material into resources/, scripts/, or references/ subdirectories that load only when needed.

Subdirectory	Adoption
`resources/`	6.4%
`scripts/`	4.4%
`references/`	8.2%
Any of the three	16.0%

84% of skills inline everything in SKILL.md. The whole architectural promise of progressive disclosure (multiple skills can sit in the agent's catalog without overwhelming context) requires authors to actually use the pattern. Most don't.

Body bloat is real

23% of skills triggered disclosure.body-bloat warnings, meaning they contain code blocks over 50 lines or tables over 20 rows in the SKILL.md body itself. These are exactly the things the progressive disclosure pattern was designed to push out into references/.

13.6% exceeded the spec's 500-line soft cap on body length. 8.4% exceeded the 5,000-token body budget when skillcheck's tokenizer flagged them (the rest weren't measured because they didn't trip the warning threshold).

Description length sweet spot

Quality scores rise with description length up to about 175-225 characters, then plateau:

Length range (chars)	n	Median quality
25-49	16	50.0
50-99	90	50.0
100-149	158	60.0
150-199	131	70.0
200-249	62	67.5
250-299	38	60.0

The spec's character cap is 1,024. Almost nobody's pushing it. The ecosystem clusters between 100 and 200 chars (median 145), which is roughly the bottom edge of the quality plateau. Authors writing 150+ char descriptions get noticeably better routing signal density.

Cross-source patterns

Antigravity's index classifies each skill's source. Quality patterns by source class:

Source class	n	Median quality	% action verb	% trigger	% progressive disclosure
community	394	60.0	26.6%	17.5%	16.2%
external_repo	38	65.0	34.2%	31.6%	18.4%
official_org	9	60.0	77.8%	0.0%	33.3%
personal	14	50.0	0.0%	0.0%	0.0%

Three observations. Skills from official org repos (Anthropic, Hugging Face, etc.) hit 77.8% action-verb adoption, miles above the community baseline, but zero trigger-phrase use; their descriptions are direct and verb-led without the "use when" preamble. Skills from individual external repos (someone's personal GitHub project) actually hit the highest trigger-phrase rate (31.6%), suggesting individual maintainers writing for their own activation problem think harder about it than community contributors writing for a shared list. Skills tagged "personal" (someone's curated set of their own work) hit 0% on both patterns, which is the cleanest signal that "I made this for me" doesn't translate to "an agent will pick this up."

Skillcheck v1.2.0 against the corpus

The new version was released April 28, 2026. The skillcheck rule set found:

1 of 500 skills produced an actual ERROR (0.2%): android_ui_verification, which has invalid characters in its name.
499 of 500 produced WARNINGs (99.8%).
0 skills passed completely clean.

Most-fired rules:

Rule	Count
`frontmatter.field.unknown`	500
`description.quality-score`	499
`disclosure.body-bloat`	115
`compat.unverified`	81
`disclosure.metadata-budget`	70
`sizing.body.line-count`	68
`disclosure.body-budget`	42
`frontmatter.description.person-voice`	27
`frontmatter.field.ecosystem`	19
`sizing.body.token-estimate`	14
`frontmatter.name.reserved-word`	11

The frontmatter.field.unknown warning fires on every file because antigravity injects bundler-only fields into the frontmatter (risk, source, date_added); strip those and the genuine unknown-field rate drops dramatically. Worth knowing if you're running skillcheck against bundled corpora versus author-original repos.

What this means if you publish skills

Four things, all reversible in a single commit per skill:

Start the description with an action verb (Generates, Validates, Creates, Analyzes, Refactors, Audits, etc.). Not Expert in, not Comprehensive, not One-stop. The verb tells the routing layer what the skill does in two syllables.
Include a trigger phrase (Use when ..., Trigger when ..., Use this skill when the user ...). The agent's routing decision is "should I activate this." A trigger phrase answers it directly.
Aim for 175-225 characters in the description. Short descriptions don't carry enough routing signal; long ones bury it.
Push large code blocks (>50 lines), large tables (>20 rows), and detailed reference material out of SKILL.md and into resources/, scripts/, or references/. The body should describe the work; the reference files should hold the work.

That's it. Four changes that move a skill from the 61.8% of the ecosystem ignoring spec recommendations to the 5.8% following them.

Methodology, for anyone who wants to push back

Tool: skillcheck v1.2.0 from PyPI (released April 28, 2026)
Corpus: sickn33/antigravity-awesome-skills at HEAD on April 29, 2026 (1,436 indexed skills)
Sample: 500 skills, drawn with random.seed(42) then random.sample
Per-skill processing: skillcheck path --format json --skip-ref-check plus skillcheck path --activation-hypotheses --format json
Feature extraction: action-verb match against a 90-verb allowlist (gerund and base forms); trigger-phrase match against 9 regex patterns; structural facts computed from filesystem and parsed frontmatter
Quality score: pulled from skillcheck's description.quality-score info diagnostic (a published heuristic whose source is at src/skillcheck/rules/description.py in the skillcheck repo)
Frontmatter analysis: bundler-injected fields (risk, source, date_added, category, id) excluded from the author-original counts above

The full dataset (500 skills, all features, all diagnostics) and the analysis output are in the skillcheck repo under docs/. Anyone who wants to verify a finding, slice it differently, or run the same pipeline against a different corpus has everything they need.

What's next

This study used skillcheck's symbolic mode and the activation-hypotheses generator. The agent-native critique mode (--ingest-critique) and capability graph extraction (--ingest-graph) weren't run here because they require a real agent in the loop and would have made the corpus run significantly longer. A follow-up study using those modes on a smaller subset (50-100 skills) would tell us what an agent actually sees in a skill versus what a static linter can measure. That's the next post.

moonrunnerkc / skillcheck

Cross-agent skill quality gate for SKILL.md files. Validates frontmatter, scores description discoverability, checks file references, enforces three-tier token budgets, and flags compatibility issues across Claude Code, VS Code/Copilot, Codex, and Cursor.

Cross-agent skill quality gate for SKILL.md files.

What This Does

skillcheck validates SKILL.md files against the agentskills.io specification: frontmatter structure, description quality, body size, file references, and cross-agent compatibility. New in v1.0: agent-native semantic self-critique, heuristic capability graph extraction with five structural analyzers, and a per-skill validation history ledger. It does not call any LLM API, execute skill instructions, or modify files.

Why This Exists

Analysis of 580 AI instruction files found that 96% of their content cannot be verified by any static tool. A separate survey found that 22% of SKILL.md files fail basic structural validation. Skills get written, committed, and published to catalogs; nobody proves they work.

skillcheck addresses both gaps with a two-mode design. When a calling agent is present, it uses that agent for semantic self-critique and capability graph extraction: the agent reads the skill's instructions and reports whether they are clear, complete, and internally…

View on GitHub

The Jupyter notebook bug that only crashes for other people

Brad Kinnard — Tue, 28 Apr 2026 04:53:59 +0000

Cell 0 uses df. Cell 1 defines df.

Notebook works for you because your kernel ran the cells in some other order and the variable's still in memory. You commit. Someone clones the repo, hits Restart and Run All, dies on cell 0.

Standard Python linters can't catch this. ruff, flake8, mypy operate on one source file at a time. A notebook is N cells whose execution order in your kernel may have nothing to do with their order on disk. The bug isn't inside any single cell. It's in the relationship between cells.

nborder is a static linter for that relationship.

Rules

Code	Flags
NB101	`execution_count` decreases in source order
NB201	Name used in cell N, only defined in cell M where M > N
NB102	Name used somewhere, never defined anywhere
NB103	Stochastic call (numpy, torch, tensorflow, stdlib random) before any seed

How the cross-cell analysis works

Each cell gets parsed with libCST. A visitor extracts symbol definitions (assignments, function defs, class defs, imports) and symbol uses (name references, attribute roots) per cell. Connect them across cells in source order, you get a dataflow graph at notebook scope.

NB201 findings are uses whose nearest matching definition lives in a later cell. NB102 findings are uses with no matching definition anywhere.

The graph also makes the auto-fix safe. When NB201 fires, the fixer runs a topological sort over cell dependency edges. Sort succeeds, cells get reordered to respect dataflow and execution counts get cleared. Cycle detected, fixer bails with an explicit message naming the cycle.

NB201 fix example

Input:

# cell 0
result = df.head()

# cell 1
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})

Run nborder check --fix notebook.ipynb:

notebook.ipynb:cell_0:1:10: NB201 Variable `df` used in cell 0 is only defined in cell 1. The notebook will fail on Restart-and-Run-All. [*]
Fix outcomes:
  reorder: applied (reordered 2 cells and cleared execution counts)

Output:

# cell 0
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})

# cell 1
result = df.head()

Cell IDs preserved. Execution counts cleared. Second nborder check exits 0.

NB103 and seed injection

NB103 walks the same graph for stochastic calls (np.random.rand, torch.rand, tf.random.normal, random.random) firing before any matching seed. The fix injects a single seed cell at the right position. Multi-library notebooks get one cell:

import numpy as np
np.random.seed(42)
rng = np.random.default_rng(42)
import torch
torch.manual_seed(42)

Alias-aware. import numpy as numpy_lib produces a seed line using numpy_lib, not a redundant fresh import. After fixing a NumPy notebook, computed cell outputs are byte-identical across consecutive jupyter nbconvert --execute runs.

JAX and scikit-learn get diagnostic-only handling. JAX needs PRNGKey threading through call signatures. sklearn random_state=None needs a value chosen against your testing strategy. Neither is a single line you can inject.

Byte-stable writer

Parse a notebook, modify nothing, write it back, bytes match exactly. Verified against nbformat v4.0, v4.4, v4.5 fixtures plus a real-world notebook corpus. When the writer does mutate during a fix, only the cells that actually changed get rewritten. Cell IDs, metadata, and unrelated cells stay verbatim.

Outputs

Four reporters:

text: ruff-style path:cell:line:col: NB### message
json: machine-readable
github: ::error file=...,line=...,title=NB201:: annotations for PR inline comments
sarif: SARIF 2.1.0, schema-validated

Pre-commit hook and a composite GitHub Action included:

- uses: moonrunnerkc/nborder@v0.1.4
  with:
    path: notebooks/
    select: NB201,NB103

What it doesn't do

Doesn't execute notebooks. Pair with nbval or papermill for kernel-level validation.
Doesn't lint cell-internal style. That's nbqa.
Dynamic name resolution (exec, getattr, **kwargs, monkey-patching) is invisible. Same limitation as any static analyzer.
Cell magics are stripped before analysis. Names introduced by %%capture get tracked. Anything magic-internal does not.

Install

pip install nborder
nborder check path/to/notebooks/

Python 3.10+.

moonrunnerkc / nborder

A fast, opinionated linter and auto-fixer for Jupyter notebook hidden-state and execution-order bugs.

nborder

A fast, opinionated linter and auto-fixer for Jupyter notebook hidden-state and execution-order bugs.

What this catches

Code	Name	One-line example
NB101	Non-monotonic execution counts	Cell 1 ran with `In [3]:` after cell 0 ran with `In [5]:`.
NB102	Won't survive Restart-and-Run-All	`print(df)` references a name no cell in the notebook defines.
NB201	Use-before-assign across cells	Cell 0 uses `df`; `df = ...` only appears in cell 1.
NB103	Stochastic library used without seed	`np.random.rand(3)` runs with no seed call before it.

Each rule has a docs page under docs/rules/ explaining the bug class, a bad and good example, and the auto-fix behaviour. The four sections below walk through each rule with the diagnostic nborder actually emits.

NB101: out-of-order execution

The execution_count field on each cell records the order Jupyter actually ran cells in, not the order they appear in the file. When those orders disagree, the recorded…

View on GitHub

Four Security Bugs That Shipped in AI-Generated Code (and How They Got Caught)

Brad Kinnard — Wed, 15 Apr 2026 18:36:15 +0000

A single Copilot CLI run against a FastAPI application produced four distinct security issues. The code worked. Tests passed. The endpoint did what was asked. None of the issues would surface during a demo or a code review focused on functionality.

User input rendered as raw HTML

The application tracks satellite data. Satellite names come from user input. The agent rendered them directly into HTML templates in four separate locations:

html += f"<strong>{t.risk}</strong>: {t.sat1} vs {t.sat2}"

No escaping. Four blocks, same pattern. A single-purpose security scanning agent found all four and applied markupsafe.escape(). A general-purpose agent reviewing the same code caught three of four, missing one buried in a conditional branch.

The difference isn't model quality. The security-focused agent had a narrower scope and explicit instructions to scan for unescaped user input in template rendering. Scope and prompt specificity determined the outcome.

Health endpoint that lies to the load balancer

The agent built a /health endpoint. It returned HTTP 200 unconditionally, including when the database was unreachable.

Kubernetes liveness and readiness probes interpret 200 as "this instance is healthy, keep routing traffic." An instance that returns 200 with a dead database stays in the rotation. Users hit it. Requests fail. The cluster thinks everything is fine.

The correct response is 503 (Service Unavailable). The orchestrator's verification caught this because runtime behavior checks are part of the quality gate surface, not just static analysis.

This one's subtle. The endpoint "works" in every test environment where the database is actually running. It only fails in the exact production scenario it was designed to protect against.

Exception details returned to clients

Error handlers used str(e) as the response body:

except Exception as e:
    return {"error": str(e)}

Database connection strings, file paths, internal state. All returned directly to whoever triggered the error. In a security audit this is an information disclosure finding. In a FastAPI app behind an API gateway, it's a path to mapping internal infrastructure.

Deprecated datetime API

datetime.utcnow() has been deprecated since Python 3.12. The replacement is datetime.now(timezone.utc). The agent also used time.time() for uptime tracking, which is affected by NTP clock adjustments and can report negative uptime if the system clock steps backward. time.monotonic() exists specifically for this case.

Neither of these will cause a production outage today. Both are the kind of technical debt that accumulates when generated code isn't checked against current language standards.

Why this matters

None of these bugs required a sophisticated analysis to find. They're patterns: unescaped user input in templates, unconditional success responses in health checks, raw exception strings in error responses, deprecated stdlib usage. Each one is a known category with a known fix.

The problem is attention. A general-purpose agent optimizing for "make this feature work" doesn't allocate attention to these categories unless explicitly prompted. The feature works. The tests pass. The agent moves on.

This is where orchestration changes the economics. Instead of one agent covering everything, specialized agents with narrow scopes check specific categories. A security auditor scans for injection and information disclosure. A runtime checker validates health endpoint semantics. Each agent's prompt is focused enough that known bug patterns get caught.

The alternative is what most developers do today: manually reprompt. "Now check for XSS." "Now add proper error handling." "Now fix the health check to actually check health." We measured this on the same codebase. 14 follow-up prompts to bring the standalone output to the same level. Each prompt required reading the previous output, identifying what was wrong, and writing a specific correction. About 45 minutes of continuous supervision.

The orchestrated run took 22 minutes, unattended. 7 premium requests vs 15. Zero human review cycles.

Swarm Orchestrator v5.0.0

The tool that caught these is open source. It wraps existing agent CLIs (Copilot, Claude Code, Codex) and adds verification, quality gates, and parallel execution. It doesn't generate code. It delegates code generation and verifies the output against outcome-based checks: git diff, build success, test pass, runtime behavior.

v5.0.0 adds three features relevant to this problem:

Spec-aware planning reads the quality gate configuration before generating agent prompts. Security requirements, test coverage thresholds, and configuration standards get injected before agents write code, not discovered through iteration afterward.

SARIF output exports quality gate violations as SARIF 2.1.0 JSON compatible with GitHub code scanning. Same PR annotation workflow teams already use for CodeQL.

Per-project gate configuration via .swarm/gates.yaml lets teams override thresholds and disable gates that don't apply to their project type.

1,386 passing tests, 84 source files, 7 documented benchmarks. The release notes include commit hashes for every bug fix.

Swarm Orchestrator on GitHub

What categories of bugs do you consistently find in AI-generated code that could be caught by a specialized check rather than manual review?