Make your AI code review refuse to fake a PASS

dualform — Sun, 14 Jun 2026 04:22:29 +0000

When you ask a coding agent to "review this change," you often get a confident PASS — one that skipped axes, cited no evidence, and never ran the tests. A green light you can't trust is worse than no review: it tells you to ship.

review-audit is a small Claude Code skill built around one rule: an axis is "audited" only when it can show evidence.

How it works

A single, read-only pass over your change across six axes:

correctness — boundary / null / type / timezone mistakes, with a reproducing input where possible
wiring (anti-Potemkin) — new code that's defined but never imported or called is dead, not done; it greps the call sites and counts references
security — secrets, command/SQL injection, unsafe eval or deserialize
test efficacy — tests that would pass even against an unimplemented stub (tautologies, no asserts, happy-path only)
spec compliance — when a spec exists, it checks the acceptance criteria
regression — it actually runs the tests/build and reports the exit code

For each axis it writes down how it checked: a file:line, a grep result, a command and its exit code. "I didn't check this" is a first-class output, not a silent gap. And an unexamined axis cannot be part of a PASS.

Read-only, by design

It proposes fixes but does not apply them. A before/after checksum confirms your tree was not touched — the audit can't quietly "fix" something and then call it reviewed.

Why evidence matters

The failure mode of AI review isn't being wrong — it's being plausibly right. "Looks fine" reads like a pass. So the skill won't PASS regression or wiring on "looks fine": regression needs a real run with an exit code; wiring needs concrete grep or file:line. No evidence, no PASS on that axis.

It runs in the calling agent's own context — no sub-agent fan-out — so it stays cheap enough to run on every change. When one pass genuinely isn't enough (a release gate, a high-risk change), it tells you, in its own output, to escalate.

Try it

git clone https://github.com/dualform-labs/review-audit.git
cp -r review-audit/skills/review-audit ~/.claude/skills/

Then give Claude Code a change that looks fine but has an unverified path — say, a function that's defined but never called — and run /review-audit. Watch it flag the wiring, list each axis as audited / partial / not-audited, and refuse to PASS what it didn't verify.

One prompt file, no dependencies, Apache-2.0. No network calls, no telemetry, no bypass-permissions.

Companion skill: spec — decide the build before the agent writes code.

— a dualform project

Two tiny Claude Code skills that fixed my two biggest agent problems

dualform — Sun, 14 Jun 2026 03:57:16 +0000

Two open-source skills for Claude Code. Each is a single prompt file, Apache-2.0, no dependencies. Repos at the bottom.

Working with a coding agent, I kept hitting the same two failure modes. Not "the model can't write code" — it writes code fine. The failures were upstream and downstream of the code: the agent guessing on an ambiguous task, and me trusting a review that hadn't actually checked anything.

So I built one small skill for each. Here's what they do and why they're shaped the way they are.

Problem 1: agents guess, then you redo the work

Hand a vague task to an agent and you watch the same thing happen. It guesses. It drifts. It quietly makes a call you'd have made differently — and you find out after the code is written, when changing your mind is expensive. The cost isn't the typing. It's the rework.

spec moves the decisions to the front. You run /spec <one-line idea>, it reads your repo and the conversation, then asks only what it couldn't infer — a short batch of multiple-choice questions, each with a recommended default you can reject in a tap. The answers go into one spec file with a fixed 13-section template. Then it builds straight through, pausing only when a choice is genuinely yours to make.

Two things keep it honest:

A self-check before you see anything. The draft is handed to a fresh-context sub-agent with only the spec file, asked "could you build this as-is?" If something's still ambiguous, that section gets fixed before you're shown the spec. (No sub-agents available? It falls back to a self-audit that has to quote a grounding line per section.)
Anti-Potemkin completion. Every acceptance criterion must be an executable command or a numbered visual check — not "file exists" or "imports OK." A feature is done only after it ran on a real case and the output was shown.

The whole skill is one prompt file (SKILL.md). No build, no dependencies.

Problem 2: "review this" returns a PASS that checked nothing

Ask an AI to "review this change" and you can get a confident, plausible PASS — that skipped half the checks, cited no evidence, and never ran the tests. A green light you can't trust is worse than no review.

review-audit is a read-only, single-pass audit over your change across six axes: correctness, wiring (built-but-never-called / dead code), security, test efficacy, spec compliance, and regression. The discipline is simple and strict:

An axis is marked audited only when the report shows concrete file:line + grep/run evidence. "I didn't check this" is a first-class output, not a silent gap.
An unexamined axis cannot be part of a PASS.
It's read-only: it proposes fixes but doesn't apply them, and a before/after checksum confirms your tree wasn't touched.
Regression needs a real run with an exit code; wiring needs concrete grep / file:line. "Looks fine" isn't evidence.

Install

Both are one prompt file each.

# spec
git clone https://github.com/dualform-labs/spec-skill.git
cp -r spec-skill/skills/spec ~/.claude/skills/

# review-audit
git clone https://github.com/dualform-labs/review-audit.git
cp -r review-audit/skills/review-audit ~/.claude/skills/

Then in Claude Code: /spec a menu-bar app that warns me when my Mac is thermally throttled, or /review-audit on a change before you call it done. Output language is auto / ja / en.

No network calls (Claude Code only), no telemetry, no bypass-permissions.

Honest notes

These are prompt-file skills, not magic. Single-pass detection in review-audit is model-dependent; if you need per-run proof of detection power or fresh-context adversarial verification, that's the heavier review-audit-pro tier (coming soon). And spec won't make a bad idea good — it just makes the decisions explicit before code is written, where they're cheap to change.

If you try them, I'd genuinely like to hear where they break or annoy you.

spec: https://github.com/dualform-labs/spec-skill
review-audit: https://github.com/dualform-labs/review-audit
dualformai.com/spec-skill · dualformai.com/review-audit