Mike

Posted on Mar 19

I Built a Claude Code Plugin That Stops It from Shipping Broken Code

#agents #ai #showdev #testing

dev-process-toolkit is a Claude Code plugin that forces a repeatable workflow on your AI coding agent: specs as source of truth → TDD → deterministic gate checks → bounded self-review → human approval. Instead of the agent deciding whether its code is correct, the compiler decides.

/plugin marketplace add nesquikm/dev-process-toolkit
/plugin install dev-process-toolkit@nesquikm-dev-process-toolkit

Then run /dev-process-toolkit:setup. It reads your package.json, pubspec.yaml, pyproject.toml (or whatever you have), generates a CLAUDE.md with your actual gate commands, configures tool permissions, and optionally scaffolds spec files.

Works with any stack that has typecheck/lint/test commands: TypeScript, Flutter/Dart, Python, Rust, Go, Java — same methodology, different compilers. Battle-tested on three production projects: a TypeScript/React web dashboard, a Node/MCP server, and a Flutter retail app.

Why I Built This

AI coding agents are probabilistic systems making deterministic claims. An agent can review its own code, reason about it, and conclude it's correct — while tsc catches three type errors in under a second.

The problem isn't that agents are bad at coding. It's that they're bad at knowing when they're wrong. The same confident reasoning that makes them useful makes them dangerous as their own quality gate. Without external enforcement, the agent evaluates the agent, finds nothing wrong, and ships.

How It Works

The plugin enforces a four-phase cycle. Three layers of defense are baked in — specs constrain what to build, deterministic gates catch what reasoning misses, and bounded review prevents infinite loops.

Phase 1 — Understand. The agent reads the spec (or issue, or task description), extracts every acceptance criterion as a binary pass/fail checklist, and presents a plan. No code yet. If you're using full SDD, specs live in a hierarchy:

specs/
├── requirements.md     # WHAT to build (FRs, ACs, NFRs)
├── technical-spec.md   # HOW to build it (architecture, patterns)
├── testing-spec.md     # HOW to test it (conventions, coverage)
└── plan.md             # WHEN to build it (milestones, task order)

Spec precedence: requirements > testing > technical > plan. If they contradict each other, the higher one wins. You can also skip specs entirely and use GitHub issues or plain task descriptions — the plugin adapts.

Phase 2 — Build (TDD). For each change: write the test first, confirm it fails (RED), implement the minimum code to pass (GREEN), run the full gate check (VERIFY):

npm run typecheck && npm run lint && npm run test
# or: flutter analyze && flutter test
# or: mypy . && ruff check . && pytest

Non-zero exit code = stop. Fix. Re-run. The gate check is deterministic — compiler output overrides the agent's judgment about whether the code "looks correct." This is the single most important constraint. Without it, self-review becomes an echo chamber.

Phase 3 — Self-review (max 2 rounds). The agent walks the AC checklist: ✓ pass, ✗ fail, ⚠ partial. Then audits for logic bugs, edge cases, pattern violations. If round 1 finds problems → fix, re-run gates. If round 2 finds the same issue classes → the agent is going in circles. It stops and escalates to a human instead of burning tokens on round 3.

Why cap at 2? If the agent couldn't fix a category of issue in two passes, more context-identical attempts won't either. Same principle from Your Agents Run Forever — the kill switch must be deterministic code, not a prompt asking "should we continue?"

Phase 4 — Report. AC checklist with pass/fail status, files changed, test coverage, gate results, any spec deviations. The agent never commits without a human saying "go ahead."

SPEC_DEVIATION Markers

When reality disagrees with the spec, the agent doesn't silently diverge — it drops a marker:

// SPEC_DEVIATION: Using client-side filtering instead of server-side
// Reason: All data is already in memory from the mock generator

The self-review phase collects these and surfaces them in the report. You see exactly where and why the implementation diverged from the plan.

What You Get

Command	What it does
`setup`	Detect stack, scaffold process, generate CLAUDE.md
`implement`	Full cycle: understand → TDD → review → report
`tdd`	RED → GREEN → VERIFY cycle
`gate-check`	Run typecheck + lint + test, report pass/fail
`spec-write`	Guided spec authoring (requirements → technical → testing → plan)
`spec-review`	Audit code against spec requirements
`simplify`	Code quality cleanup on changed files
`pr`	Pull request creation

Plus two specialist agents (code-reviewer, test-writer) that Claude spawns as subagents when needed. All commands are namespaced under /dev-process-toolkit: (e.g., /dev-process-toolkit:implement).

Start small: install, run setup, then try gate-check on your repo. That alone — making the agent run your compiler/linter/tests as a non-negotiable phase — fixes the most common failure mode. Add tdd and implement when you want the full workflow.

Why a Plugin Beats a Prompt

You can tell Claude Code "always run tests before committing." But it will forget, or decide the tests aren't relevant, or skip them because it's "confident." A plugin encodes the workflow as structured phases with hard gates. The agent can't skip Phase 2 to get to Phase 4. The gate check runs real commands and checks real exit codes — not "does the agent think the tests passed."

If you'd rather not install a plugin, you can also copy the skills and agents manually into your project's .claude/ directory — the adaptation guide walks through it step by step.

The bounded loop and convergence detection patterns come from Your Agents Run Forever. The contract testing approach is covered in I Test My Agents Like Distributed Systems.

Top comments (1)

Mehmet Can Farsak • Jun 14

Great take on enforcement at the infrastructure level. I've been thinking about the same problem but on the ideation side — agents don't have quality gates for brainstorming, so they'll rush straight to implementation when asked to explore ideas. Built Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that uses PreToolUse hooks to enforce ideation mode, preventing premature tool calls. Three modes (divergent, actionable, academic) help keep the agent in the right headspace. Similar philosophy: don't trust the agent to self-regulate, enforce it at the infrastructure layer.