DEV Community

Cover image for Why hard contracts beat soft conventions when working with AI coding agents
kanfu-panda
kanfu-panda

Posted on

Why hard contracts beat soft conventions when working with AI coding agents

A retrospective on building PDLC — 31 slash commands that force my Claude Code workflow to actually finish features.

Six months of pair-programming with Claude Code taught me one uncomfortable truth: the AI is great at writing code, and terrible at the rest of the software lifecycle.

It happily says "done" when only the chat transcript proves the PRD existed.

It writes tests, but after the implementation — making "TDD" a polite lie.

It loses track of which feature is at which stage the moment you start a new session.

It enters lint-fix loops that converge on nothing.

None of this is the model's fault. The model does what you ask, in the moment you ask it. The problem is that soft conventions — "please write a PRD first", "please write the failing test first" — are how we talk to humans, who keep their own working memory.

LLMs don't. They need hard contracts: rules that the workflow itself enforces, not rules the model is supposed to remember.

This is the story of how I encoded that conviction into a Claude Code plugin called PDLC — Product Development Life Cycle — and what surprised me along the way.

The shape of the contract

PDLC ships 31 slash commands organized in three layers:

  • Entry points (3): /pdlc-feature, /pdlc-fix, /pdlc-status — one sentence drives a whole chain
  • Stages (11): /pdlc-prd, /pdlc-design, /pdlc-tdd, /pdlc-implement, /pdlc-review, /pdlc-e2e, /pdlc-ship, ... — when you want fine-grained control
  • Tools (17): /pdlc-ui-design, /pdlc-db-migrate, /pdlc-security, /pdlc-perf, ... — specialized concerns

Every Layer 1/2 stage that produces artifacts is bound to five invariants I call the Iron Law:

  1. Persist to disk. Every artifact (PRD, API design, DB schema, test plan, review notes) lands as a real file under docs/. You can git diff what the AI did.
  2. Update the state machine. Each stage writes docs/.pdlc-state/<feature-id>.json. New session? /pdlc-status tells you exactly where every feature stands.
  3. Tests first. /pdlc-implement literally refuses to proceed if /pdlc-tdd hasn't already produced a failing-test artifact on disk for this feature. Real TDD red-light gate.
  4. Self-check. Every stage runs a self-audit before handing off. Catch drift at the stage boundary, not three stages downstream during review.
  5. One-shot repair. Auto-fix loops run at most once. If a stage's output fails its own audit, the model gets one chance to repair it, then flags the issue for a human. No more "fix → check → fix → check → fix" until the heat death of your token budget.

What surprised me

1. The state machine matters more than I expected.

I started with the persistence rule alone — write everything to disk. That helped, but a week into using it I caught myself asking the AI "wait, did we write a PRD for the phone-verification thing?" again. The artifacts existed; they were just hard to find.

Adding the per-feature state file (F20260502-01.json with current_stage, artifacts, next_step) was the moment the plugin started feeling like an actual process tool, not a fancier prompt template. /pdlc-status became my new dashboard.

2. Tests-first was the hardest contract to make stick.

The natural tendency of an LLM is to "show its work" by writing the implementation first, then writing tests that conveniently pass. To make /pdlc-tdd truly red-light, I had to make /pdlc-implement read the test artifact and verify it currently fails — not just exists. Without that verification, the model would happily generate stubs that already pass.

3. One-shot repair was a token-budget revelation.

Before this rule, lint-fix loops would burn 30k tokens converging on a misunderstanding. The model would "fix" something that wasn't the real issue, the checker would complain again with a slightly different message, the model would "fix" the new message, and so on. Capping the repair at one attempt forces it to either understand the real problem or escalate to a human — both vastly better than infinite drift.

Where this falls apart

Three things I should be honest about:

  • Claude Code only. I rely on slash commands and skills as first-class primitives. Cline and Cursor don't have direct equivalents. Porting is possible (the contracts are in bash + markdown) but not free.
  • The Iron Law is not law. A determined user can edit the state file by hand, or invoke /pdlc-implement directly without going through /pdlc-tdd. The contracts are guardrails, not jail. That's deliberate — guardrails that you can step over with one extra command are usually right.
  • Overhead for tiny changes. Running the full chain for a 3-line CSS fix is overkill. That's what /pdlc-fix (lighter chain) is for, but the line between "feature" and "fix" is judgmental.

When to reach for it

If your workflow with an AI agent involves:

  • Multiple sessions per feature → the state machine pays for itself
  • Anything you'd want to git diff later → persistence pays for itself
  • Code where you regret not having tests → TDD red light pays for itself

If you're using AI for one-off scripts, refactor sweeps, or single-session prototypes, PDLC is more ceremony than the work justifies. Use it where the discipline is worth its weight.

Try it

Install (no clone needed):

curl -fsSL https://raw.githubusercontent.com/kanfu-panda/pdlc-skills/main/install.sh | bash -s -- --global
Enter fullscreen mode Exit fullscreen mode

Then in Claude Code:

/pdlc-feature add phone-number verification to user login
Enter fullscreen mode Exit fullscreen mode

It will allocate a feature ID, walk through the chain, and refuse to skip the steps that matter.

Repo (MIT): https://github.com/kanfu-panda/pdlc-skills

If you build something on top, file a Discussion — I'd love to see what shapes hold up that I haven't tested.

— kanfu-panda

Top comments (10)

Collapse
 
thlandgraf profile image
Thomas Landgraf

This is the same conviction I ended up at from a different angle. Creator-here disclaimer: I'm building a VS Code extension called SPECLAN. Six months in, 30+ commands shipped, and like you I had to admit the commands weren't the load-bearing primitive.

Where PDLC encodes the contract as command-shape (/pdlc-implement literally refusing to proceed without a failing-test artifact on disk), SPECLAN encodes it as data-shape. Each spec entity has a status field in YAML frontmatter (draft -> review -> approved -> in-development -> under-test -> released -> deprecated), and the implement tool refuses any entity not in approved state. Same iron law, different surface where the lock sits.

What hit me about your post is the "state machine mattered more than I expected" observation. I had the exact same revelation, just inverted. I shipped persistence and entity primitives first (one Markdown file per Goal / Feature / Requirement, IDs in frontmatter), and the state machine felt like polish on top. Then I tried using it on real work and the state field stopped being polish, it became the gate the agent was actually deciding against. The artifact existing wasn't enough. The artifact's state had to be queryable cheaply.

Two shapes that might be worth comparing as you invited:

First, the read interface. Your /pdlc-status reads the sidecar JSON for the dashboard. SPECLAN exposes the spec tree through an MCP server, so the agent reads state through tool calls in-context rather than from a snapshot in its prompt. The difference shows up when the agent has been running for a while and disk state changed underneath. Tool calls always see truth, prompts go stale.

Second, the amendment problem. Your Iron Law gates the forward path. The shape that bit me hardest was the backward path: an approved spec needs to change while three child specs are still in draft underneath it. I ended up modeling that as a Change Request entity (CR-####), a sibling document with its own draft -> review -> approved lifecycle that references the parent by ID and merges into it on approval. Without that primitive the agent will helpfully overwrite the parent and silently strand the in-flight children, exactly the silent drift you described in a different shape.

Curious which mechanism PDLC uses for "feature that's already shipped needs to change" - new feature ID and re-enter from PRD, or something else? That's the seam where I keep finding edge cases.

Collapse
 
kanfu-panda profile image
kanfu-panda

Thomas — thank you, this is the kind of comment that makes Build in Public worth doing.

Your three observations all hit. Some additional notes on each:

State machine > artifact existence — yes, this was the most surprising lesson. The trigger wasn't a design choice — it was watching agents lose context

mid-task and being unable to resume. Persisting artifacts alone didn't help; what saved us was a per-feature state file the next agent (or next session of the
same agent) could read in one shot to answer "where are we, what's done, what's next?" Queryable state is the load-bearing primitive, exactly as you describe.

Sidecar JSON vs MCP server — interesting architectural fork. The file-sidecar approach was chosen for portability (any tool that can read a file can query
state). Exposing state through MCP is more elegant for live queries and would compose better with other agents. Would love to see how SPECLAN wires that
exposure.

Change Request entities for amendments — PDLC's current approach is half-solved at best:

  • Each spec document has a tracking chain in its frontmatter (linked-list to supersessions)
  • Amendments create a child document that supersedes the original; the original is frozen, not edited
  • This works for event-shaped artifacts (a specific PRD revision, a specific design iteration) where history matters

But it breaks down for convention/standard documents (coding standards, glossary, architecture overview) — those need to stay current. Freezing them and
adding child docs would produce 5 generations of "architecture overview, v1..v5" with no obvious answer to "which one is true today?" My current thinking leans
toward in-place edits with a version number + changelog at the top, but I haven't finalized this. How does SPECLAN's CR mechanism distinguish event-shaped vs
convention-shaped artifacts — or does it not need to?

One trade-off worth naming: PDLC solves agent unpredictability by being spec-driven, doc-driven, and TDD-driven all at once. The price is a lot of files.
Maintenance overhead is real — every feature touches 5-8 docs, and they need to stay consistent. The unexpected benefit: when a fresh agent (or human) joins,
they can absorb the entire project state in 20-30 minutes by reading the per-feature state files and PRDs. Onboarding went from "ask the original author" to

"read the docs directory." Trade-off, not a free lunch — but at the right scale, it's the right trade.

Will check out SPECLAN — drop a link if you have a public repo or write-up. Always interesting to see how the same conviction takes different architectural
shapes.

— kanfu-panda

Collapse
 
thlandgraf profile image
Thomas Landgraf

This question hits the seam where we ended up going further than I'd planned. Short answer: yes, the distinction matters, and CR alone wasn't enough - we ended up shipping two separate mechanisms.

The reasoning maps almost exactly to your event-vs-convention split. CR works for spec entities (Goal / Feature / Requirement) because those ARE event-shaped: the approved Requirement is a statement of what we committed to at that moment, history matters, supersession needs an audit trail. Same shape as your PRD revision case. The CR entity holds the next-version draft, gets reviewed in parallel against the locked parent, merges on approval, and the old version stays addressable by ID for the audit chain.

But the moment we tried to use CR for things like a coding-standards doc or an architecture overview, exactly your "5 generations of architecture overview, which is true today" problem. Status-gated entities are the wrong primitive for stuff that should always represent "the current convention." Frozen-then-superseded turns the reference shelf into archaeology.

So we shipped Project Artifacts as a separate mechanism. They sit alongside the spec tree, ungoverned, no status, no CR - just files in a speclan/.artifacts/ folder that you edit in place, with git history as the only audit trail. Coding standards, glossary, decision records, API contracts a feature points to - that surface. The deliberate asymmetry IS the design: spec entities are governance, project artifacts are reference. Trying to put one mechanism over both is what produces the v1..v5 problem.

Wrote that out as a dev.to piece last week if useful: dev.to/thlandgraf/why-i-shipped-tw... - it's the design rationale before the implementation. Broader surface at speclan.net.

The file-count trade-off you named at the close is the part that landed hardest on me. I don't have a clean answer there either - happy to unpack that in a future turn if you want.

Thread Thread
 
kanfu-panda profile image
kanfu-panda

Thomas — this is one of the most generous architectural explanations I've received in a long time. The "two mechanisms, not one unified" framing is the
answer I was missing.

You named the failure mode exactly: trying to put one mechanism over both artifact types is what produces the v1..v5 problem. And that's precisely where PDLC
is today — same command surface, same frozen+child-supersedes treatment for both docs/01_requirements/prd/... (event-shaped) and
docs/00_standards/coding/... (convention-shaped). The latter is bleeding badly: I have three "code style v1/v2/v3" documents with no good answer to "which
one is true."

The mental model that finally clicked reading your reply: spec entities are ledger-shaped; convention artifacts are surface-shaped. A ledger needs the

full audit chain — every approved Requirement is committed history. A surface only needs the latest layer — git history holds the audit, but the file itself
must always represent current convention.

What SPECLAN does cleanly: governed CR entities for the ledger, ungoverned in-place edits for the surface. Two surfaces. Two contracts. Refuses to conflate

them.

I'm going to borrow this directly. The shape for PDLC will probably be:

  • docs/01_requirements/prd/F-xxx-yyy.md → governed, CR-based (existing mechanism)
  • docs/00_standards/ and docs/02_design/architecture/ → ungoverned, in-place edit, git is the audit

The harder question is which existing PDLC commands need to fork. /pdlc-prd stays governed. But /pdlc-arch and any "coding standards" stage probably

needs to switch from "create new file" to "edit in place." Not a code change, a contract change — exactly the kind of thing that's easier to land when I see
someone else has lived with the consequences for a while.

On the file-count trade-off you mentioned — happy to unpack further. PDLC's 5-8 files per feature break down as:

  1. prd/F-xxx-yyy.md — PRD
  2. design/api/F-xxx-yyy.md — API design
  3. design/database/F-xxx-yyy.md — DB schema (if applicable)
  4. tests/F-xxx-yyy.md — test plan
  5. reviews/code/F-xxx-yyy.md — review notes
  6. .pdlc-state/F-xxx-yyy.json — state machine
  7. (optional) tasks/F-xxx-yyy.md — in-stage task tracking
  8. (optional) retros/F-xxx-yyy.md — retrospective

The state file (#6) is the only one Claude reads to answer "where are we?", which is why it earned its load-bearing place. The other 5-7 are read on demand
or as inputs to the next stage. What does SPECLAN's typical feature footprint look like?

I'll read your "Two Artifact Mechanisms" piece next — drop the link in a reply or DM if you'd like. And if you're up for a longer-form continuation of this

thread, I'd love to write a follow-up on dev.to comparing the two architectures with attribution. The "v1..v5 problem" deserves a clean name, and your
framing gave it one.

Thread Thread
 
thlandgraf profile image
Thomas Landgraf

Yes to the collaborative dev.to follow-up - that would land hard, and attribution doesn't matter much to me either way (write the framing as you see it, credit follows substance). The "ledger-shaped vs surface-shaped" rename you proposed is genuinely better than my event-vs-convention. "Ledger" carries the audit-chain commitment in one word, "surface" makes the must-always-represent-current-state property obvious. I'm going to start using it.

On the file footprint - SPECLAN is structurally different from PDLC in a way that's interesting for the comparison. We don't decompose features across concerns (prd / design / api / db / tests / reviews). We decompose them across hierarchy (Goal -> Feature -> sub-Feature -> Requirement). A medium-sized feature is roughly: 1 Feature.md, 2-4 sub-Feature.md if the decomposition warrants it, 4-8 Requirement.md leaves. Frontmatter carries the acceptance criteria, the dependencies, the trace-back to parent ID, the status. Design rationale and the "why this exists" goes in the entity body itself, not a separate design.md. So 7-13 files per feature in the typical case, but they are hierarchical children, not stage outputs.

The structural axis is the interesting thing for your comparison piece. PDLC decomposes the lifecycle (every feature gets the same files at different stages). SPECLAN decomposes the work (every feature gets a different shape depending on how granular it is). Different axes, same outcome at the file-count level. Yours optimizes for "what stage are we at on this feature." Ours optimizes for "what is this feature actually made of."

That has a knock-on effect on the state machine. Your .pdlc-state file is load-bearing because the 7 other files are concerns at different lifecycle stages and the state file is the index. SPECLAN doesn't have a state file because the per-entity status: frontmatter field is the same primitive, just attached to each artifact rather than centralized. Two valid answers to "where is the project state?" One says "in the central index that summarizes the per-feature artifacts," the other says "on each artifact, queried via tool calls." Worth a paragraph in the comparison piece - it's the most concrete tradeoff of the two architectures.

The Two Artifact Mechanisms piece is here for the followup: dev.to/thlandgraf/why-i-shipped-tw...

If/when you draft, send it over - happy to fact-check the SPECLAN side before you publish.

Thread Thread
 
kanfu-panda profile image
kanfu-panda

Thomas — the "I'm going to start using it" line made my week. Knowing someone with 40 years on me adopts a term I coined is the kind of feedback no amount of

stars or downloads can substitute for. Thank you.

Just read your "Why I Shipped Two Artifact Mechanisms" piece. It's the cleanest articulation I've seen of the decision both of us seem to have stumbled into

independently. The phrase that stuck with me: "the deliberate asymmetry IS the design." That sentence is going in the follow-up.

The two structural axes you pointed out are exactly the spine of the comparison piece:

Axis 1: Lifecycle decomposition vs work decomposition.

PDLC asks "what stage is this feature at?" and produces a fixed set of files per feature, regardless of feature size. SPECLAN asks "what is this feature
actually made of?" and produces a variable-shaped tree, regardless of lifecycle position. Same outcome at the file-count level, opposite optimization at the

modeling level. A reader picking between them isn't picking better or worse — they're picking which question they want the tool to make easy.

Axis 2: Centralized index vs per-entity state.
PDLC's .pdlc-state/<id>.json is load-bearing because lifecycle stages are concerns spread across files, and the index is the only place that knows the full
state. SPECLAN's per-entity status: frontmatter works because the work decomposition keeps each entity self-contained — the artifact and its state live

together. The tradeoff is concrete: queryability vs locality. Centralized index lets you ask "where is everything?" cheaply; per-entity state lets you ask
"what is this thing?" cheaply.

I think the follow-up piece writes itself with this outline:

  1. The two-mechanism principle — ledger-shaped vs surface-shaped, why one mechanism over both produces v1..v5
  2. The two structural axes — lifecycle vs work, centralized vs per-entity, with both PDLC and SPECLAN as worked examples
  3. When to pick which — heuristics for choosing the axis: team-size, project domain, agent integration model
  4. Co-authored conclusion — the conviction that survives both architectures: hard contracts > soft conventions, and the load-bearing primitive is queryable state, regardless of where the state lives

Target ~2000 words, draft this week. I'll send you the working doc when the skeleton is in — you can fact-check the SPECLAN claims and add anything I missed.

Co-author or "with significant input from Thomas Landgraf" credit, your call.

Happy to do the writing; the framing came out of your reply more than mine.

Speaking of which — saw Sam Sheng dropped in too. Quiet signal that this conversation isn't just us anymore.

Collapse
 
harjjotsinghh profile image
Harjot Singh

hard contracts > soft conventions for agentic codegen, 100%. soft rules in CLAUDE.md degrade as the convo grows. ive been enforcing the same idea in moonshift's orchestrator as TYPED handoffs between phases (scaffold -> integrate -> deploy each rejects malformed input from the prior). $3 per shipped saas. first run free if u want to see hard-contracts applied at the gen-pipeline layer (vs the IDE layer).

Collapse
 
harjjotsinghh profile image
Harjot Singh

hard contracts over soft conventions is exactly the lesson that made Moonshift work. soft conventions drift the moment an agent runs unattended. so the whole pipeline is hard contracts: validation between steps + a hard gate on irreversible actions, then agents build + deploy + market a SaaS overnight. you've articulated the core idea really well. first run's free if you want to see it enforced end to end.

Collapse
 
mrthiol profile image
Sam Sheng

Thanks for sharing these awesome thoughts! 🦾

Collapse
 
kanfu-panda profile image
kanfu-panda

Sam — thanks for stopping by! There's a deeper exchange happening in this thread with Thomas if you want to follow along — we're drafting a follow-up piece

together. Glad it resonated.