dev-process-toolkit is a Claude Code plugin that forces a repeatable workflow on your AI coding agent: specs as source of truth → TDD → deterministic gate checks → bounded self-review → human approval. Instead of the agent deciding whether its code is correct, the compiler decides.
/plugin marketplace add nesquikm/dev-process-toolkit
/plugin install dev-process-toolkit@nesquikm-dev-process-toolkit
Then run /dev-process-toolkit:setup. It reads your package.json, pubspec.yaml, pyproject.toml (or whatever you have), generates a CLAUDE.md with your actual gate commands, configures tool permissions, and optionally scaffolds spec files.
Works with any stack that has typecheck/lint/test commands: TypeScript, Flutter/Dart, Python, Rust, Go, Java — same methodology, different compilers. Battle-tested on three production projects: a TypeScript/React web dashboard, a Node/MCP server, and a Flutter retail app.
Why I Built This
AI coding agents are probabilistic systems making deterministic claims. An agent can review its own code, reason about it, and conclude it's correct — while tsc catches three type errors in under a second.
The problem isn't that agents are bad at coding. It's that they're bad at knowing when they're wrong. The same confident reasoning that makes them useful makes them dangerous as their own quality gate. Without external enforcement, the agent evaluates the agent, finds nothing wrong, and ships.
How It Works
The plugin enforces a four-phase cycle. Three layers of defense are baked in — specs constrain what to build, deterministic gates catch what reasoning misses, and bounded review prevents infinite loops.
Phase 1 — Understand. The agent reads the spec (or issue, or task description), extracts every acceptance criterion as a binary pass/fail checklist, and presents a plan. No code yet. If you're using full SDD, specs live in a hierarchy:
specs/
├── requirements.md # WHAT to build (FRs, ACs, NFRs)
├── technical-spec.md # HOW to build it (architecture, patterns)
├── testing-spec.md # HOW to test it (conventions, coverage)
└── plan.md # WHEN to build it (milestones, task order)
Spec precedence: requirements > testing > technical > plan. If they contradict each other, the higher one wins. You can also skip specs entirely and use GitHub issues or plain task descriptions — the plugin adapts.
Phase 2 — Build (TDD). For each change: write the test first, confirm it fails (RED), implement the minimum code to pass (GREEN), run the full gate check (VERIFY):
npm run typecheck && npm run lint && npm run test
# or: flutter analyze && flutter test
# or: mypy . && ruff check . && pytest
Non-zero exit code = stop. Fix. Re-run. The gate check is deterministic — compiler output overrides the agent's judgment about whether the code "looks correct." This is the single most important constraint. Without it, self-review becomes an echo chamber.
Phase 3 — Self-review (max 2 rounds). The agent walks the AC checklist: ✓ pass, ✗ fail, ⚠ partial. Then audits for logic bugs, edge cases, pattern violations. If round 1 finds problems → fix, re-run gates. If round 2 finds the same issue classes → the agent is going in circles. It stops and escalates to a human instead of burning tokens on round 3.
Why cap at 2? If the agent couldn't fix a category of issue in two passes, more context-identical attempts won't either. Same principle from Your Agents Run Forever — the kill switch must be deterministic code, not a prompt asking "should we continue?"
Phase 4 — Report. AC checklist with pass/fail status, files changed, test coverage, gate results, any spec deviations. The agent never commits without a human saying "go ahead."
SPEC_DEVIATION Markers
When reality disagrees with the spec, the agent doesn't silently diverge — it drops a marker:
// SPEC_DEVIATION: Using client-side filtering instead of server-side
// Reason: All data is already in memory from the mock generator
The self-review phase collects these and surfaces them in the report. You see exactly where and why the implementation diverged from the plan.
What You Get
| Command | What it does |
|---|---|
setup |
Detect stack, scaffold process, generate CLAUDE.md |
implement |
Full cycle: understand → TDD → review → report |
tdd |
RED → GREEN → VERIFY cycle |
gate-check |
Run typecheck + lint + test, report pass/fail |
spec-write |
Guided spec authoring (requirements → technical → testing → plan) |
spec-review |
Audit code against spec requirements |
simplify |
Code quality cleanup on changed files |
pr |
Pull request creation |
Plus two specialist agents (code-reviewer, test-writer) that Claude spawns as subagents when needed. All commands are namespaced under /dev-process-toolkit: (e.g., /dev-process-toolkit:implement).
Start small: install, run setup, then try gate-check on your repo. That alone — making the agent run your compiler/linter/tests as a non-negotiable phase — fixes the most common failure mode. Add tdd and implement when you want the full workflow.
Why a Plugin Beats a Prompt
You can tell Claude Code "always run tests before committing." But it will forget, or decide the tests aren't relevant, or skip them because it's "confident." A plugin encodes the workflow as structured phases with hard gates. The agent can't skip Phase 2 to get to Phase 4. The gate check runs real commands and checks real exit codes — not "does the agent think the tests passed."
If you'd rather not install a plugin, you can also copy the skills and agents manually into your project's .claude/ directory — the adaptation guide walks through it step by step.
The bounded loop and convergence detection patterns come from Your Agents Run Forever. The contract testing approach is covered in I Test My Agents Like Distributed Systems.

Top comments (0)