DEV Community

Mike
Mike

Posted on

I Built a Claude Code Plugin That Stops It from Shipping Broken Code

Rubber duck factory with specs, deterministic gate, and bounded review stations

dev-process-toolkit is a Claude Code plugin that forces a repeatable workflow on your AI coding agent: specs as source of truth → TDD → deterministic gate checks → bounded self-review → human approval. Instead of the agent deciding whether its code is correct, the compiler decides.

/plugin marketplace add nesquikm/dev-process-toolkit
/plugin install dev-process-toolkit@nesquikm-dev-process-toolkit
Enter fullscreen mode Exit fullscreen mode

Then run /dev-process-toolkit:setup. It reads your package.json, pubspec.yaml, pyproject.toml (or whatever you have), generates a CLAUDE.md with your actual gate commands, configures tool permissions, and optionally scaffolds spec files.

Works with any stack that has typecheck/lint/test commands: TypeScript, Flutter/Dart, Python, Rust, Go, Java — same methodology, different compilers. Battle-tested on three production projects: a TypeScript/React web dashboard, a Node/MCP server, and a Flutter retail app.

Why I Built This

AI coding agents are probabilistic systems making deterministic claims. An agent can review its own code, reason about it, and conclude it's correct — while tsc catches three type errors in under a second.

The problem isn't that agents are bad at coding. It's that they're bad at knowing when they're wrong. The same confident reasoning that makes them useful makes them dangerous as their own quality gate. Without external enforcement, the agent evaluates the agent, finds nothing wrong, and ships.

How It Works

The plugin enforces a four-phase cycle. Three layers of defense are baked in — specs constrain what to build, deterministic gates catch what reasoning misses, and bounded review prevents infinite loops.

Phase 1 — Understand. The agent reads the spec (or issue, or task description), extracts every acceptance criterion as a binary pass/fail checklist, and presents a plan. No code yet. If you're using full SDD, specs live in a hierarchy:

specs/
├── requirements.md     # WHAT to build (FRs, ACs, NFRs)
├── technical-spec.md   # HOW to build it (architecture, patterns)
├── testing-spec.md     # HOW to test it (conventions, coverage)
└── plan.md             # WHEN to build it (milestones, task order)
Enter fullscreen mode Exit fullscreen mode

Spec precedence: requirements > testing > technical > plan. If they contradict each other, the higher one wins. You can also skip specs entirely and use GitHub issues or plain task descriptions — the plugin adapts.

Phase 2 — Build (TDD). For each change: write the test first, confirm it fails (RED), implement the minimum code to pass (GREEN), run the full gate check (VERIFY):

npm run typecheck && npm run lint && npm run test
# or: flutter analyze && flutter test
# or: mypy . && ruff check . && pytest
Enter fullscreen mode Exit fullscreen mode

Non-zero exit code = stop. Fix. Re-run. The gate check is deterministic — compiler output overrides the agent's judgment about whether the code "looks correct." This is the single most important constraint. Without it, self-review becomes an echo chamber.

Phase 3 — Self-review (max 2 rounds). The agent walks the AC checklist: ✓ pass, ✗ fail, ⚠ partial. Then audits for logic bugs, edge cases, pattern violations. If round 1 finds problems → fix, re-run gates. If round 2 finds the same issue classes → the agent is going in circles. It stops and escalates to a human instead of burning tokens on round 3.

Why cap at 2? If the agent couldn't fix a category of issue in two passes, more context-identical attempts won't either. Same principle from Your Agents Run Forever — the kill switch must be deterministic code, not a prompt asking "should we continue?"

Phase 4 — Report. AC checklist with pass/fail status, files changed, test coverage, gate results, any spec deviations. The agent never commits without a human saying "go ahead."

SPEC_DEVIATION Markers

When reality disagrees with the spec, the agent doesn't silently diverge — it drops a marker:

// SPEC_DEVIATION: Using client-side filtering instead of server-side
// Reason: All data is already in memory from the mock generator
Enter fullscreen mode Exit fullscreen mode

The self-review phase collects these and surfaces them in the report. You see exactly where and why the implementation diverged from the plan.

What You Get

Command What it does
setup Detect stack, scaffold process, generate CLAUDE.md
implement Full cycle: understand → TDD → review → report
tdd RED → GREEN → VERIFY cycle
gate-check Run typecheck + lint + test, report pass/fail
spec-write Guided spec authoring (requirements → technical → testing → plan)
spec-review Audit code against spec requirements
simplify Code quality cleanup on changed files
pr Pull request creation

Plus two specialist agents (code-reviewer, test-writer) that Claude spawns as subagents when needed. All commands are namespaced under /dev-process-toolkit: (e.g., /dev-process-toolkit:implement).

Start small: install, run setup, then try gate-check on your repo. That alone — making the agent run your compiler/linter/tests as a non-negotiable phase — fixes the most common failure mode. Add tdd and implement when you want the full workflow.

Why a Plugin Beats a Prompt

You can tell Claude Code "always run tests before committing." But it will forget, or decide the tests aren't relevant, or skip them because it's "confident." A plugin encodes the workflow as structured phases with hard gates. The agent can't skip Phase 2 to get to Phase 4. The gate check runs real commands and checks real exit codes — not "does the agent think the tests passed."

If you'd rather not install a plugin, you can also copy the skills and agents manually into your project's .claude/ directory — the adaptation guide walks through it step by step.


The bounded loop and convergence detection patterns come from Your Agents Run Forever. The contract testing approach is covered in I Test My Agents Like Distributed Systems.

Top comments (0)