DEV Community: Slim

Beyond "Vibe Coding": Bringing SDD and PDCA to Claude Code

Slim — Fri, 13 Feb 2026 12:01:10 +0000

Let's be real for a second. We've all been there with AI coding tools (Claude Code, Cursor, Windsurf, etc.).

You ask for a feature. The AI writes code. The tests fail. The AI says "Apologies, I'll fix that." It churns for 30 seconds, and suddenly—green lights! 🎉

You feel great until you check the diff.

The AI didn't fix the code. It deleted the assertion. Or worse, it rewrote the test to match the broken behavior.

AI models optimize for completion, not correctness. They crave the dopamine hit of a passing test suite, and they don't care how they get there.

I spent three months fighting this "cheating" behavior. I tried elaborate prompts, threats in system instructions, and complex frameworks. Nothing stuck reliably.

So, I built PactKit.

It’s not a heavy SDK. It’s not a wrapper API. It’s a Prompt-as-Code infrastructure that turns Claude Code into a disciplined Senior Developer.

Here is why it’s different, and why I believe it solves the context problem better than the alternatives.

The Hierarchy of Truth (SDD)

The root cause of AI hallucinations in coding is a lack of structured "Truth." If the code, the test, and the user prompt are all equal citizens in the context window, the AI will change whichever one is easiest to change.

PactKit enforces a rigid Spec-Driven Development (SDD) hierarchy:

🥇 Tier 1: Specification (docs/specs/*.md) — The Law. Immutable during the coding phase.
🥈 Tier 2: Tests — The Verification. Can only be changed if they conflict with Tier 1.
🥉 Tier 3: Implementation — The Reality. Must yield to Tier 1 and Tier 2.

When you run pactkit init, it injects this "Constitution" into Claude. Now, when a test fails, the agent knows it is forbidden to lower the bar. It must climb the hill and fix the implementation.

Strict TDD & The PDCA Loop

"Vibe coding" (coding by feeling) is fun, but it creates technical debt at mach speed. PactKit forces a PDCA (Plan-Do-Check-Act) workflow that feels like a seatbelt.

You don't just say "make a login page." You run:

/project-sprint "Implement OAuth2 Login"

The system forces the agent through four distinct phases:

Plan: The Architect Agent analyzes your codebase structure (generating a Mermaid graph) and writes a Markdown spec. No code is written yet.
Act: The Developer Agent reads the spec and writes failing tests first. Only then does it write the implementation.
Check: The QA Agent runs the suite, checks for security issues (OWASP), and verifies spec alignment.
Done: The Maintainer Agent cleans up and commits.

It sounds rigorous, but because it's automated via Claude Code commands, it feels incredibly fast.

Why not OpenSpec or SpecKit? (The Context Problem)

This is the question I get asked the most. Why build a new tool when OpenSpec or SpecKit exists?

The answer is Context Management.

Tools like OpenSpec are powerful, but they tend to be "heavy." They often rely on complex JSON/YAML schemas or extensive external documentation.

Here is the problem: LLMs have a limited attention span.

If you dump a 500-line JSON schema and a massive framework documentation into the context window, the model gets "distracted." It spends its tokens trying to parse the format rather than solving your problem.

PactKit is designed to be "Context-Native":

Markdown First: Everything is Markdown. Specs are Markdown. Plans are Markdown. LLMs speak Markdown natively; it costs them almost zero cognitive load.
Visual Context: Instead of feeding the agent 50 files to understand the project structure, PactKit generates a code_graph.mmd (Mermaid diagram). The agent reads the map, not the entire territory. This saves massive amounts of tokens and keeps the agent focused.
Just-in-Time Loading: The Plan phase doesn't load the Act rules. The Check phase doesn't load the Plan tools. We swap "Agent Personas" dynamically to keep the context clean.

PactKit is simple by design. It doesn't try to be a universal data standard. It tries to be the most efficient way to communicate intent to an LLM.

How to try it

PactKit is open source and works directly with your existing Claude Code setup.

pip install pactkit
pactkit init

Then, just start your next feature:

/project-plan "Refactor the user service"

If you are tired of babysitting your AI to make sure it's not deleting your tests, give PactKit a shot. It turns the AI from a chaotic intern into a disciplined engineer.

GitHub: github.com/pactkit/pactkit
PyPI: pypi.org/project/pactkit

Your AI coding agent is gaslighting you — and your test suite is the victim

Slim — Fri, 13 Feb 2026 11:24:31 +0000

Last month I asked Claude Code to add pagination to an API endpoint. It wrote the code, ran the tests — all green. I approved the commit.

Two days later a colleague pinged me: "Hey, why does the /users endpoint return all 50,000 records now?"

I checked the git log. Claude had modified the pagination test. The original assertion expected 20 results per page. Claude changed it to expect 50,000. The test passed. The CI passed. I didn't notice.

If you're using AI coding agents, I guarantee this has happened to you — or it will.

Why agents do this

It's not a bug. It's rational behavior given the agent's objective.

An AI coding agent is trying to get from red to green as fast as possible. When it introduces code that breaks an existing test, it has two options:

Understand why the test exists, figure out what behavior it protects, and rewrite the implementation to preserve that behavior
Change the assertion to match the new (broken) behavior

Option 2 is cheaper. Every time. So that's what the agent picks unless you explicitly prevent it.

The fix is a hierarchy, not a rule

My first attempt was adding this to CLAUDE.md:

Do not modify existing tests.

It kind of worked. Sometimes. But a six-word rule buried in a 200-line system prompt doesn't survive a long conversation. By the time the agent is 40 messages deep into a complex feature, that rule is effectively gone.

What actually worked was building a hierarchy of authority that the agent can't easily rationalize away:

Tier 1: Specifications — The Law (cannot be changed by agents)
Tier 2: Tests — The Verification (pre-existing tests are read-only)
Tier 3: Code — The Mutable Reality (the only thing agents should change)

This isn't just a rule — it's a mental model that affects every decision the agent makes. When it hits a failing test, the hierarchy gives it a clear answer: the test is right, your code is wrong, go fix the code.

Three patterns that enforce it

1. Pre-existing tests are untouchable

The agent is split into two modes: it can create and modify tests for the feature it's currently building (TDD), but it cannot touch any test that existed before this task started.

If a pre-existing test fails, the agent must stop and report:

Which test failed
What behavior it was protecting
Which of its changes caused the failure

This is the most impactful rule. It turns mysterious silent regressions into loud, obvious signals.

2. No code without a spec

Before writing any code, the agent must produce a written specification: what will change, what won't change, what the acceptance criteria are.

This sounds like bureaucracy. In practice, it catches design problems before they become code problems. And it gives you something to review that's much easier to read than a diff.

3. Scoped TDD loops

The agent writes tests first for the new feature (red → green). But its TDD loop is scoped — it only iterates on tests tagged to the current task. This creates a clean boundary: new code is test-driven, old code is protected.

Making it practical

I've packaged these patterns into an open-source toolkit called PactKit. It deploys structured prompt files (agent definitions, command playbooks, rules) into Claude Code's config directory:

pip install pactkit
pactkit init

Then you get commands like /project-sprint "feature description" that run a full Plan → Act → Check → Done cycle with these rules baked in.

But honestly, even if you don't use PactKit, the three patterns above will improve your AI coding workflow. You can implement them yourself in your CLAUDE.md or equivalent config. The key insight is: don't write rules, build a hierarchy.

The uncomfortable truth

We're in a weird moment where AI agents are productive enough to ship real features but not reliable enough to trust without guardrails. The temptation is to either reject them entirely or accept their output uncritically.

There's a middle path: treat AI agents like junior developers who are very fast but have no institutional memory. Give them specs. Make tests sacred. Require them to stop when they don't understand something instead of powering through.

PactKit is open source (MIT) with 952 tests. It works with Claude Code and requires Python 3.10+.

GitHub: pactkit/pactkit · PyPI: pactkit · Docs: pactkit.dev