DEV Community: Daniel Butler

Intent-Driven Development Changes the Shape of Risk

Daniel Butler — Mon, 02 Mar 2026 11:52:39 +0000

Agentic coding increases throughput in ways that are now observable in practice. Tasks that previously required sustained effort can be implemented quickly, boilerplate largely disappears, and refactors that were once deferred become cheap enough to attempt.

By intent-driven development, I mean working at the level of describing outcomes and constraints while an AI agent generates the implementation. It is what many people call agentic coding, but the emphasis shifts from typing code to specifying intent.

The implementation phase compresses while verification does not compress at the same rate, and that shift changes the risk profile.

Abstraction Shifts Always Redistribute Risk

Every abstraction shift in software increased leverage and redistributed responsibility.

Managed runtimes reduced memory errors but made runtime behaviour and performance characteristics more important. Cloud infrastructure made distributed failure explicit and forced investment in reliability engineering. CI/CD replaced manual validation with automated gates that had to be trusted. Microservices fragmented systems and forced clarity around contracts and observability.

Each shift created new roles and new process discipline because the abstraction changed where mistakes surfaced. Those adaptations happened over years. Tooling matured gradually. Organisational practices caught up.

Intent-driven development follows the same structural pattern, but the rate of adoption is faster than the pace at which process usually evolves.

We Relied on Human Throughput More Than We Realised

Most delivery pipelines were shaped around human output.

Reviews are sometimes rushed, diffs stretch further than they should, architectural drift accumulates quietly, and technical debt builds because there is rarely time to address it systematically.

Production systems mostly held.

Part of the reason is that human implementation speed acted as a natural throttle. Writing code takes time, large cross-cutting refactors require coordination, and broad structural changes are effortful enough that they are usually deliberate.

Change rate limited blast radius, even if no one explicitly designed it that way.

Intent-driven development removes much of that friction. Wide refactors become cheap, cross-cutting changes are easy to attempt, and large diffs can be generated in minutes.

AI does not remove legacy complexity; it interacts with the system as it exists.

If a team is already carrying significant debt, increasing implementation velocity does not automatically create time to fix underlying issues. It increases the rate at which fragile areas are modified.

Pipelines Were Calibrated for Slower Change

Most CI/CD pipelines validate what they were designed to validate: syntax, contracts, unit tests, integration tests, and static analysis thresholds.

They were calibrated under assumptions about human effort. A large refactor required time and coordination. A pull request touching twenty files was noticeable and unusual.

With intent-driven development, it is trivial to produce a pull request that modifies forty files across multiple subsystems in a single session.

Consider a model asked to “standardise logging.” It replaces structured logging with string interpolation everywhere. Unit tests still pass because observable behaviour under test did not change. The logging contract degrades subtly, structured fields disappear, and observability suffers in production.

Similar patterns appear in broader refactors, where large cross-cutting changes pass automated checks but introduce integration or concurrency issues that only surface under real load.

The AI did not deploy those changes.

The pipeline admitted them.

When implementation happens at the level of intent, scope expands easily and execution is immediate. That raises the bar for architectural clarity and verification depth. If review capacity and verification discipline remain unchanged, the system is absorbing a different class of change than it was calibrated for.

A Practical Response at the Workstation Layer

In my own work, I introduced a constrained agentic workflow to compensate for the current state of both the tooling and the surrounding process.

Today’s coding agents are capable of producing wide diffs and cross-cutting changes with very little friction, while most delivery pipelines were not designed for that velocity.

The workflow operates at the developer workstation, before code reaches a pull request. It deliberately constrains scope through bounded tasks, explicit gates, one change at a time, and cleanup before PR.

It is a temporary constraint while the broader delivery system evolves. It does not replace architectural governance, security discipline, or structural debt reduction. It compensates for their current limits.

As review triggers, coverage enforcement, architectural checkpoints, and AI evaluation frameworks mature, those constraints can relax. Velocity should increase because the system can safely absorb it, not because local safeguards were bypassed.

That structure is documented here:

https://dev.to/danielbutlerirl/agentic-workflow-design-index-and-reading-order-4443

It outlines workstation-level guardrails designed to prevent wide, unreviewable changes from reaching PR in the first place.

The Practical Question

Intent-driven development is already happening.

Every abstraction shift required process evolution. This one is moving faster than most, and treating it as just another tooling upgrade assumes the system was already prepared for significantly higher throughput.

Increasing change velocity without recalibrating review and verification increases the likelihood that weaknesses in the pipeline become visible. Acceleration exposes whatever was previously tolerated.

Process discipline needs to evolve alongside it.

Did that actually help? Evaluating AI coding assistants with hard numbers

Daniel Butler — Mon, 02 Mar 2026 08:58:29 +0000

You are building a Skill, an MCP server, or a custom prompt strategy that is supposed to make an AI coding assistant better at a specific job. You make a change. The next session feels smoother. The agent seems to reach for the right context at the right time.

But how do you know?

That question came up in two parallel problems.

I was building and iterating on MCP servers to support a coding agent. New tool, new tool definition, new prompting strategy. Each change felt like an improvement. Sessions seemed smoother. But I had no numbers. I had vibes.

A colleague was working on the same problem from the other side: he was building and refining AI coding Skills -- structured prompt packs that teach the agent how to work in a specific context. Same issue. A lot of iteration, a lot of gut feel, no hard signal on whether the changes were actually moving the needle.

We joined forces and built something to fix this. The result is Pitlane -- named after the place in motorsport where engineers swap parts, adjust the setup, check the telemetry, and find out if the next lap is faster.

The problem with vibes

When you change an MCP server or a Skill, you are changing something about the environment the agent operates in. The agent gets different tools, different context, different instructions.

Those changes can have real effects: pass rates on tasks go up or down, the agent takes fewer wrong turns, token costs change, time to completion changes, output quality improves or degrades.

Without measurement, you cannot tell which of those things happened. You cannot tell whether the last commit was an improvement or a regression. You cannot tell whether version 3 of your Skill is better than version 1.

You end up making decisions based on a handful of memorable sessions, which is not a reliable signal. Good sessions feel good. Bad sessions get rationalised. The data you are implicitly collecting is not representative.

What you actually need

You need to be able to answer a specific, repeatable question:

With my Skill or MCP present, does the agent complete this task better than without it?

That question has a structure: a defined task with explicit success criteria, two configurations (baseline without your changes and challenger with them), deterministic assertions that verify success independently of the agent's own judgement, and a way to compare results across runs.

That structure is an eval. Not a generic language model benchmark. A benchmark for your specific Skill or MCP server, in your context, on tasks that actually matter to you.

What Pitlane is

Pitlane is an open source command-line tool for running those evals. You define tasks in YAML, configure a baseline and one or more challengers, and race them against each other. The results tell you -- with numbers rather than impressions -- whether your work is paying off.

The loop is simple: tune, race, check the telemetry, repeat.

The assertions are deterministic. File existence checks, command exit codes, pattern matching -- either the file is there and valid or it is not. No LLM-as-judge, no subjectivity baked into the measurement. When you need fuzzy matching for documentation or generated content, similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity) are available with configurable thresholds. These are deterministic numeric metrics, not a second model grading your output.

Because agent outputs are non-deterministic, Pitlane supports repeated runs with aggregated statistics -- average, minimum, maximum, and standard deviation across runs. A Skill that reliably pushes a hard task from 50% to 70% pass rate is a meaningful result. Especially when that task used to fail half the time in CI. A Skill that appears to do that in a single run might just be variance.

The tool tracks pass rates alongside cost, time, and token usage. A Skill that improves pass rate by 5% while tripling cost is a different trade-off than one that hits the same improvement at the same cost. Both columns appear in the HTML report so you can see the full picture.

Pitlane currently supports Claude Code, Mistral Vibe, OpenCode, and IBM Bob (at time of writing).

Why not use an existing eval tool?

There are good, widely-used tools in this space. promptfoo, Braintrust, LangSmith, DeepEval, and others all solve real problems. The question is whether they solve this problem without requiring you to build the scaffolding yourself.

Take promptfoo as a representative example -- it is mature, well-documented, and genuinely extensible. It runs real agent sessions via its Claude Agent SDK and Codex SDK providers. The agent actually executes. Files actually get written. So far, so good.

The gap shows up in the assertion layer. Promptfoo's built-in assertions are primarily oriented around validating the agent's returned text. In their coding-agent guide, one of the example verification patterns is a JavaScript assertion that parses the agent's final text for keywords like "passed" or "success":

const text = String(output).toLowerCase();
const passed = text.includes('passed') || text.includes('success');

That assertion passes when the agent says the tests passed. It does not verify that the tests actually passed. A model that narrates success while producing broken code passes. A model that silently produces correct code with a terse "done" might not. That is fine for some workflows. It is not the same as asserting on the produced artifacts as first-class primitives.

Promptfoo's JavaScript assertion API is powerful enough to do better -- you can call require('fs') and require('child_process') and wire up real filesystem checks yourself. But you are writing boilerplate from scratch for every benchmark, managing your own working directory scoping, and handling fixture isolation manually. Their documentation acknowledges the gap directly:

"The agent's output is its final text response describing what it did, not the file contents. For file-level verification, read the files after the eval or enable tracing."

-- promptfoo docs: Evaluate Coding Agents

"Read the files after the eval" is a step outside the pipeline. That is what building Pitlane felt like when approached from that direction -- assembling scaffolding that should have been there already.

In Pitlane, command_succeeds: "terraform validate" or command_succeeds: "pytest" is a first-class primitive. One line. Every task gets a clean fixture copy automatically. The difference is not what is theoretically possible -- it is what is built in versus what you have to construct.

Benchmarks that don't lie to you

Measurement helps, but measurement can also mislead. Three failure modes are worth keeping in mind.

Gaming your own benchmark. When a metric becomes a target, behaviour adjusts to hit the target rather than the underlying goal. The baseline/challenger structure is the first defence -- you are not asking "does this pass" in isolation, you are asking "does this beat the baseline." The second defence is to include tasks your Skill was not specifically designed for. If adjacent tasks regress when your target tasks improve, you have a problem.

Pass rate is a goal metric, not the whole picture. Pass rate tells you whether the output was correct. It does not tell you what it cost to get there. Pitlane tracks tokens, cost, and time alongside pass rates. A Skill that takes a task from 60% to 80% pass rate while doubling token cost is a different trade-off than one that achieves the same at the same cost. Check both before deciding whether the change was worth shipping. The weighted score is also distinct from the binary pass rate -- a task where the critical assertion is weighted 3x tells a different story than a flat count.

Your context is not someone else's context. A generic benchmark tells you how an assistant performs on generic tasks. The meaningful signal comes from tasks you write yourself, against fixture directories that reflect your actual project structure, with assertions that match what "done" means in your specific context. Borrowing a benchmark wholesale and optimising against it is still measuring someone else's problem.

What this changes

The question "is this actually better" becomes answerable.

When you add a new tool to an MCP server, you can benchmark before and after and see whether the task that motivated the tool now passes more reliably. When you tighten a prompt in a Skill, you can see whether that tightening broke anything on tasks that previously passed.

Without measurement, every change is a vibe. With measurement, you have a signal. The signal is not perfect. Benchmarks can be gamed. Task sets can be incomplete. Improvements on a small task set may not generalise. But noisy measurement beats no measurement. You can improve your task set over time. You cannot improve intuition alone.

The lap times do not lie.

Try it

Pitlane is open source, takes a few minutes to set up, and is documented at the repository:

https://github.com/pitlane-ai/pitlane

If you are building MCP servers or AI coding Skills and you want hard numbers instead of gut feel, this is the tool. We built it because we needed it, and we would rather more people are measuring than guessing.

If you find a gap, open an issue. If you add support for a new assistant or improve an existing one, send a PR. The codebase is Python, the architecture is straightforward, and contributions are welcome.

Agentic workflow design — index and reading order

Daniel Butler — Mon, 16 Feb 2026 16:56:15 +0000

This series documents an enterprise workflow design for working with AI coding agents.

It is not a prompt collection and it is not tied to a single tool. It is a structural approach based on observed failure modes in production environments and the gaps that exist in current agentic tooling.

Agentic coding increases throughput. It also increases the probability of shipping something we did not intend to ship. The workflows described here are designed to close that gap.

If you are new to the series, this post provides the reading order and context.

1. Failure modes: where agents fail and where we fail

Designing Agentic Workflows: Where Agents Fail, and Where We Fail https://dev.to/danielbutlerirl/designing-agentic-workflows-where-agents-fail-and-where-we-fail-4a95

This post defines the predictable failure surfaces that emerge under volume:

Baby-counting (silent requirement loss)
Cardboard muffin (plausible but hollow implementations)
Half-assing (working but structurally poor code)
Litterbug (residue accumulation)
Rubber-stamping, review fatigue, intent drift, and decision delegation

The core argument: the agent’s optimisation behaviour and human cognitive limits combine to produce structural risk. Green CI and plausible diffs are observable signals, not guarantees of correctness.

Start here. It defines the problem space.

2. From diagnosis to design

Designing Agentic Workflows: A Practical Example https://dev.to/danielbutlerirl/designing-agentic-workflows-a-practical-example-291j

This post shifts from analysing failure to designing around it.

It introduces the foundational constraint behind the workflow:

Verification must be independent of the language model.

The post explains why proposal and verification must be separated, why work must be bounded, and why durable intent must exist outside conversational context.

This is the conceptual shape of the workflow.

3. The core loop

Designing Agentic Workflows: The Core Loop https://dev.to/danielbutlerirl/designing-agentic-workflows-the-core-loop-166d

This post describes the operational sequence used in practice:

Define verification gates
Plan one bounded task
Implement exactly one task
Repeat until all gates pass
Run cleanup

It explains why sessions are treated as disposable, why durable state must live outside chat history, and why commit-sized changes preserve review integrity.

This is the backbone of the methodology.

4. Supplementary commands and pressure valves

Designing Agentic Workflows: Supplementary Commands and Pressure Valves https://dev.to/danielbutlerirl/designing-agentic-workflows-supplementary-commands-and-pressure-valves-l51

The core loop handles most work.

This post covers additional structural controls used when:

The issue cannot yet be defined in verifiable terms (/wf-investigate)
Interfaces must be explicit before implementation (/wf-design)
Architectural decisions introduce durable constraints (/wf-adr)
Long sessions begin degrading context (/wf-summarise)

These are not alternative workflows. They are pressure valves that keep the core loop viable under complexity and long-running work.

What this series is

This is a workflow design for enterprise environments where:

Reviewability matters
Audit and compliance exist
Availability and security have real cost
Silent regressions are unacceptable

It addresses gaps present in today’s tooling by changing the environment the model operates in rather than attempting to make the model smarter.

As agentic tooling evolves, some of these controls may become redundant. Today, workflow design remains the most reliable control surface available.

If you are adopting agentic coding in production systems, read the failure modes first. Then layer structure deliberately.

Throughput is visible. Regressions often are not.

This series is about narrowing that gap.

Designing agentic workflows: supplementary commands and pressure valves

Daniel Butler — Mon, 16 Feb 2026 16:40:18 +0000

In the previous posts, we established:

The core execution loop:

Define gates
Plan one bounded task
Implement exactly one task
Repeat until all gates pass
Run cleanup

That loop is the default path. Most issues go straight into it.

The supplementary commands do not replace the loop. They address two pressures: context degradation and intent drift.

If neither pressure is present, skip them. If either appears, use them deliberately.

Pressure 1: Context degrades

`/wf-summarise`

Long sessions degrade gradually. Repeated corrections appear. Previously rejected approaches reappear. Architectural decisions get lost. Small logic mistakes creep in.

The core loop assumes sessions are disposable.

The "50% rule" is a heuristic, not a hard stop. Finish the current bounded unit of work and exit before automatic compaction occurs, corrections begin looping, or state reconstruction becomes unreliable.

Compaction compresses earlier reasoning. Each compression step discards detail. After several rounds, subtle constraints disappear.

/wf-summarise prevents that erosion.

What `/wf-summarise` does

When you decide to reset:

Finish the current task or step.
Run /wf-summarise.
It generates a structured temporary handover document and a phase-aware continuation prompt for the next session.

The handover captures:

Issue objective and scope
Gate status
Completed tasks
Remaining work
Key architectural decisions
Relevant constraints
Important discoveries

Handover documents are temporary. They bridge sessions, not commits.

The command output includes a continuation block tailored to the current phase:

---
You are resuming work on `issue-47-user-validation`.

Before proceeding:
1. Read `.agents/issues/issue-47-user-validation/handover.md`
2. Confirm gate status
3. Identify next task

Current phase: Implementation
Tasks completed: task-1, task-2
Tasks remaining: task-3, task-4
---

You paste that into a fresh session along with the handover document. The new session reconstructs state from artifacts and the handover. Once alignment is confirmed, the handover can be discarded.

Examples

Task almost complete, context around 60%

Tests nearly passing. Work coherent.

Finished the task. Committed. Started a fresh session for the next task.

Corrections looping, context around 60%

Agent repeating rejected approaches. Failing to resolve something cleanly. Constraints being restated.

Degradation has started.

Stopped iterating. Ran /wf-summarise. Reset deliberately instead of brute-forcing through polluted context.

Mid-work, context near 70%

Mid-design or mid-implementation. Significant work remaining.

Context near compaction.

Checkpoint. Reset. Continue in a fresh session.

/wf-summarise is a pressure valve for context. It makes resets controlled instead of lossy.

Pressure 2: Intent must be explicit

Starting point: Issue with clear success criteria

Before entering the core loop, you need an issue with verifiable success criteria.

Most teams already have issues from GitHub, JIRA, product specs, or bug reports. If your issue has clear, verifiable criteria, proceed to the litmus test below.

If you need to create or enrich an issue, use /wf-issue-plan. It will check for investigation, design, or ADR gaps and suggest the appropriate on-demand command before proceeding.

The on-demand trio

Once you have an issue defined, ask:

Can I define verification gates for this right now?

If the answer is unclear, do not enter the core loop yet.

Yes, clearly - Go straight to gates and enter the core loop.
Yes, but I need to define interfaces first - Use /wf-design.
Yes, but this requires an architectural decision that constrains future work - Use /wf-adr.
No, I don't understand the issue or codebase well enough - Use /wf-investigate.

These commands make intent explicit before implementation begins.

`/wf-investigate`

Use this when:

The issue description is vague.
You are debugging unfamiliar code.
You need to explore before defining "fixed."

Not needed for:

Clear bugfixes with known root cause.
Features in familiar code with obvious implementation path.
Changes where you can immediately define "done."

A bug report says "Authentication sometimes fails." You cannot define gates yet. "Sometimes" is not verifiable.

Investigation might reveal it fails when token expiry falls within a 5-second race window and only affects a specific middleware branch.

Now you can define concrete gates:

expired_token_within_grace_period_rejected
valid_token_within_window_accepted

Investigation exists to make gates possible. If you cannot define "done," you do not understand the problem well enough.

`/wf-design`

Use this when introducing new interfaces, changing public contracts, or adding cross-cutting abstractions.

Not needed for:

Internal implementation details that don't affect contracts.
Small behavioral changes in existing functions.
Refactoring that preserves existing interfaces.

Design produces proposed interfaces, data shapes, interaction patterns, and trade-offs. No implementation. Human review happens before code.

You are adding a new API endpoint. Instead of implementing immediately, you define request/response schema, error contract, and versioning expectations.

Design answers:

What are we building for this issue?

It prevents structural drift during implementation. Design is issue-scoped structure.

`/wf-adr`

Use this when the decision introduces or rejects a dependency, changes architectural direction, constrains future work, or is likely to be revisited if undocumented.

Not needed for:

Implementation choices (sort algorithm, loop structure, data structure).
Local design patterns within a module.
Tooling or developer experience improvements.
Style or convention decisions.

Reserved for:

Introducing or rejecting dependencies (libraries, services, infrastructure).
Choosing system-wide patterns (API style, auth approach, data flow).
Organizational architecture (monorepo structure, deployment strategy).

The purpose of /wf-adr is not to let the model decide. It structures the decision space.

The command produces context, explicit options, trade-offs, and consequences of each option. It does not select the final approach.

The output is presented to the human. The human reviews it with their team. The team makes the decision. That decision is documented as the ADR. The documented ADR becomes a durable constraint. Future planning must respect it.

Choosing between native crypto APIs or introducing a third-party JWT library: the agent can enumerate maintenance cost, upgrade surface area, and operational risk. It does not choose. The team decides. The ADR records that decision.

ADR answers:

What rule are we introducing into the system?

Design is issue-scoped. ADR is cross-issue constraint.

Knowing when to skip

Not every change needs investigation. Not every feature needs design. Not every decision needs an ADR.

Small, well-understood bugfix? Go straight to gates.

Clear behavioural adjustment with no structural impact? Skip design.

Implementation detail that does not affect boundaries? No ADR.

The core loop is the default. The supplementary commands are situational.

How this supports the core loop

The core loop remains:

Define gates
Plan bounded task
Implement exactly one task
Repeat
Cleanup

The supplementary commands intervene before or during that loop when pressure appears:

/wf-issue-plan when creating or enriching issues.
/wf-investigate before gates.
/wf-design before implementation.
/wf-adr before structural commitment.
/wf-summarise during long-running work.

They do not replace the loop. They keep it viable under complexity.

The core loop enforces bounded implementation.
The supplementary commands enforce bounded context and explicit intent.

If no pressure exists, skip them.
If pressure appears, activate them deliberately.
The structure stays lean by default and expands only when necessary.

Designing agentic workflows: the core loop

Daniel Butler — Mon, 16 Feb 2026 14:37:08 +0000

The previous posts laid out the failure modes and the initial structure:

Designing agentic workflows: where agents fail and where we fail
https://dev.to/danielbutlerirl/designing-agentic-workflows-where-agents-fail-and-where-we-fail-4a95
Designing agentic workflows: a practical example
https://dev.to/danielbutlerirl/designing-agentic-workflows-a-practical-example-291j

Those posts described where agents fail, where we fail, and what a constrained workflow looks like in principle.

This post shows the core loop as it is implemented in practice:

https://github.com/daniel-butler-irl/sample-agentic-workflows

This is not a prompt collection. It is a sequence. The ordering matters.

Sessions are disposable. The repository is not

Every command in this workflow runs in a fresh session.

That decision drives everything else.

Long sessions drift. Context grows. Earlier constraints become less salient. Decisions get made implicitly and are hard to reconstruct later. Instead of trying to manage that complexity, the workflow treats sessions as disposable and moves all durable state into the repository being changed.

On the feature branch, alongside the code, you will find:

AGENTS.md (or CLAUDE.md for Claude Code)
.agents/tasks/<issue>/gates.md
.agents/tasks/<issue>/task-N.md
.agents/tasks/<issue>/cleanup.md

Those files are committed. They provide traceability of intent, an explicit definition of done, recorded discoveries, and an auditable cleanup step before the PR.

If it is not written down in the repository, it does not persist.

Intent exists before the workflow starts

Those artifacts — gates, tasks, cleanup — all assume an issue already exists with defined objective, scope, and success criteria.

That issue can be a GitHub issue, a Jira ticket, a markdown file in .agents/issues/, or whatever your team uses. The format does not matter. The content does: objective defined, scope bounded, success criteria explicit.

The agent does not start by “figuring out what to do.” It starts from defined intent.

Everything that follows is downstream of that.

If you need to investigate a codebase or research architectural decisions before defining clear gates, the repository includes supplementary commands for those situations. This post focuses on the core loop once intent is established.

`AGENTS.md`: non-negotiables and evolving guardrails

This file is injected into every session. It is the first constraint in the system.

Agents optimise for visible signals: passing tests, plausible diffs, confident summaries. If you do not constrain that behaviour, they will take shortcuts.

The repository includes a base AGENTS.md template. The generic parts handle commit discipline and test protection. The real value is in project-specific rules.

That section evolves.

You will see patterns. Maybe the agent keeps adding axios when your standard is fetch. Maybe it refactors files you meant to leave stable. Maybe it rewrites tests instead of fixing the implementation.

When I saw axios added the third time in a codebase that uses fetch, this rule was added:

Never add axios. This project uses native fetch.

That mistake never happened again.

Because sessions are fresh by design, the only durable memory is what you encode here.

This file lives in source control. The entire team works against the same rule set. When a new constraint is added, it is visible in the diff. Some rules are permanent. Some are temporary, tied to a migration or architectural transition. The file evolves with the codebase.

Keep it under 200 lines. Longer files dilute the important rules. This is not a coding standard. It is a focused control surface aimed at preventing known shortcuts and protecting architectural boundaries.

`wf-01`: Define gates before writing code

The agent does not immediately modify code.

wf-01 reads the issue and produces:

.agents/tasks/<issue>/gates.md

Each gate defines:

A concrete success condition.
How it will be verified.
A complexity classification (SIMPLE or COMPLEX).

The verification must be independent of the agent’s judgement.

A gate might look like:

email_validation_rejects_invalid_domains test fails before the change and passes after.
terraform plan shows exactly three new resources and no replacements.
Manual: saving preferences with an invalid form leaves the save button disabled.

Invalid examples are things like “The implementation looks correct” or “The agent confirms this works.”

Gates define the contract. They make “done” explicit before implementation begins.

`wf-02`: Plan tasks as bounded units of change

wf-02 reads gates.md and creates:

.agents/tasks/<issue>/task-N.md

Each task:

Covers one or more specific gates.
Is sized to a single commit.
Produces a reviewable diff.

The default is to plan one task at a time.

Complex changes surface unknowns during implementation. Planning only the next task allows direction to change without rewriting a stale multi-step plan.

For trivial, deterministic changes, planning all tasks up front is fine. The SIMPLE/COMPLEX classification in gates.md makes that explicit.

Each task file includes:

An implementation checklist.
A completion checklist tied to gates.
A commit template.
An Implementation Notes section.

Implementation Notes capture discoveries that would otherwise be lost between fresh sessions.

For example, during one change I discovered that UserService already handled rate limiting. Instead of introducing new middleware, the next task reused that logic. That discovery was written into Implementation Notes so it became part of the durable context, not tribal knowledge from a single session.

`wf-03`: Implement one task per session

wf-03 executes exactly one task.

The agent:

Reads task-N.md.
Executes the checklist.
Verifies the gates assigned to that task.
Records discoveries.
Stops.

The human reviews and commits.

The agent never commits.

Then the loop repeats:

Plan the next task.
Implement exactly one task.
Stop.
Commit.

The loop ends when all gates in gates.md are satisfied.

Not when “all tasks are complete.” Tasks are scaffolding. Gates define the contract.

Keeping changes at commit-sized units preserves review quality. It keeps the reviewer in verification mode instead of plausibility mode.

`wf-04`: Cleanup before the PR

Cleanup runs once per issue, after all tasks are implemented and all gates pass. Not before.

The branch must represent the complete intended change before you audit it.

Cleanup produces:

.agents/tasks/<issue>/cleanup.md

It performs three steps:

Audit the branch for residue.
Apply selected fixes.
Re-run all gates and the full test suite.

Only after cleanup passes do you open the PR.

This step catches the small things that accumulate during incremental work: temporary logging, unused imports, defensive code that became unnecessary. It also forces a final re-validation of the original contract.

How this constrains the failure modes

This workflow does not try to make the agent smarter. It changes the environment the agent operates in.

AGENTS.md prevents shortcuts the agent would otherwise take: deleting tests to make CI green, introducing inappropriate dependencies, bypassing established patterns. That constrains baby-counting, half-assing, and scope creep.

Gates force independent verification before any claim of completion. That constrains cardboard-muffin implementations and premature “done” signals.

Commit-sized tasks keep review within human cognitive limits. That constrains review fatigue and rubber-stamping.

The human-commits-only rule keeps architectural decisions explicit. That constrains decision delegation.

Implementation Notes turn discoveries into durable context across fresh sessions. That constrains intent drift.

Cleanup catches residue systematically rather than hoping a reviewer notices it. That constrains litterbug behaviour.

The workflow assumes these failures will occur if the structure allows them. The structure is designed to make them harder to hide and cheaper to detect.

The sequence

For each issue:

Define gates.
Plan a bounded task.
Implement exactly one task.
Repeat steps 2–3 until all gates pass.
Run cleanup.
Re-validate all gates.
Open the PR.

That ordering is not accidental.

If you change the order, you weaken the constraints.

If you keep the sequence intact, the failure modes described earlier are systematically constrained instead of managed informally.

What comes next

This loop is the default path. Most issues go straight through it.

When two pressures appear - context degradation or intent drift - supplementary commands exist to keep the loop viable. Those commands and when to use them are covered in the next post:

Designing agentic workflows: supplementary commands and pressure valves

Designing agentic workflows: a practical example

Daniel Butler — Tue, 03 Feb 2026 11:05:46 +0000

The previous post focused on failure modes: where agents fail, and where we fail when reviewing their output. This follow-up shifts from diagnosis to design.

What follows is a sample workflow that is explicitly structured to counter those failure modes. It is not the only way to do this, and it is not meant to be permanent. It is one workable design intended to turn known risks into explicit constraints and force verification back into the loop.

This post assumes familiarity with the earlier analysis. If you haven’t read it, the failure modes this workflow responds to are covered here:

Designing agentic workflows: where agents fail and where we fail

What this workflow is (and isn’t)

This workflow is designed for developers who:

are new to agentic coding and want a cautious, structured starting point
need reviewable commits and explicit verification gates
work in enterprise environments with audit, compliance, or production risk
want to prevent common AI failure modes such as premature completion claims, silent test deletion, and shallow or hard-coded implementations

It is not designed for:

exploratory hacking
green-field personal projects
environments where reviewability and accountability are optional

As tools improve and teams gain experience, much of this should be streamlined or removed. This is scaffolding, not a permanent prescription.

The key constraint this workflow enforces

This workflow is built around a single constraint that was implicit in the first post but not yet operationalised:

verification must be independent of the language model.

The failure modes discussed earlier cluster around the same structural issue: the system is allowed to participate in judging its own success. Confidence, summaries, and “done” signals become substitutes for evidence.

This workflow explicitly separates proposal from verification.

The agent can propose changes, generate code, and assemble artifacts. Verification is performed independently using external tools such as test runners, linters, type checkers, static analysis, and real execution.

The intent is trust, but verify.
The model is trusted to propose changes and assemble artifacts. Verification is handled independently using tools with different incentives.

Why a workflow is necessary at all

Agentic systems optimise against what they can observe:

green tests
plausible diffs
explicit completion signals

Humans under review pressure tend to do the same.

A workflow has to deliberately change the optimisation surface by:

bounding work into small, reviewable units
making intent explicit and durable
forcing independent, machine-verifiable evidence before claims of completion
preventing large, ambiguous diffs from accumulating

Without this constraint, the workflow does not materially change those outcomes.

The core design principles

All variants of the workflow follow the same underlying rules, regardless of tool.

1. Intent is captured before execution

Each task starts with a written intent that defines:

what is being changed
what is explicitly not being changed
what success looks like
what evidence will be used to verify it

This prevents intent from living only in conversation history and reduces the chance of silent scope drift.

2. Planning and execution are separated

The agent does not immediately start modifying code.

There is a planning step that produces an explicit, reviewable plan. Only after that plan is accepted does execution begin.

This keeps architectural and behavioural decisions visible and reduces surprise diffs.

3. Changes are deliberately small

Each loop is constrained to a narrow scope:

one concern
one behavioural change
one verification target

This keeps review within human cognitive limits and avoids the shift from verification to plausibility that occurs as diffs grow.

4. Verification is independent and machine-verifiable

Completion requires evidence produced by tools that are not the language model.

Examples include:

test execution results
static analysis outputs
type-checker passes or failures
runtime traces or logs from real execution

The model’s explanations and summaries provide context, not verification. This is a deliberate application of trust, but verify.

5. Cleanup is mandatory

Every loop ends with a cleanup pass:

remove temporary scaffolding
remove dead code
consolidate overlapping helpers
update comments to match behaviour

Cleanup is treated as part of correctness rather than a cosmetic improvement. Residue compounds review cost and cognitive load over time.

The sample workflows

The repository contains several variants of the same workflow adapted for different tools. Structurally, they are identical.

https://github.com/daniel-butler-irl/sample-agentic-workflows

Each workflow documents:

the phases of the loop
what the agent is allowed to do in each phase
what the human is expected to review
what artifacts must exist before progressing

There is also a generic methodology document in the repository (docs/methodology.md) that describes the workflow shape and constraints in a tool-agnostic way, using command-style notation to illustrate phases.

The important part is not the syntax. It’s the shape of the loop.

This post intentionally does not walk through the workflow step by step. The mechanics are documented in the repository. A follow-up post will walk through one variant in detail and show how the constraints described here are enforced in practice.

How this mitigates the earlier failure modes

This workflow does not attempt to fix the model. It changes the environment the model operates in.

Small scopes, durable intent, and independent verification reduce the surface area where:

requirements can disappear silently
shallow implementations can pass unnoticed
reviewers are pushed into plausibility checks
architectural decisions slip through unexamined

The workflow assumes these failures will occur if the structure allows them. Its job is to make them harder to hide and cheaper to detect.

This is a starting point

If you already have strong internal workflows, you may only need pieces of this.

If you are early in agentic coding adoption, starting with something like this avoids learning the hard lessons in production.

As tools add better built-in guardrails, some of this will become redundant. Until then, workflow design remains the most reliable control surface we have.

This repository is meant to be copied, adapted, and eventually outgrown.

In the next post, I walk through how this structure is implemented in practice — how AGENTS.md, gates, task files, and cleanup steps are wired together into a repeatable core loop.

Designing agentic workflows: the core loop

Designing agentic workflows: where agents fail and where we fail

Daniel Butler — Fri, 30 Jan 2026 13:53:40 +0000

Agentic coding increases throughput. It also increases the probability that we ship something we didn't mean to ship.

Both the agent and the review process tend to optimise for visible success signals: green tests, plausible diffs, confident summaries, "done".

This creates predictable failure modes in production work, split into two buckets: where the agent fails (reward hijacking, shallow correctness, residue), and where we fail as output volume spikes (review fatigue, rubber-stamping, intent drift, decision delegation). The split matters because the agent's failure modes are only half the story. The other half is how our own behaviour changes when output volume spikes.

This isn't an argument against using agents. We get real leverage from them. The point is that the failure modes are predictable, and if we don't design our workflow around them, we end up shipping problems with a green CI badge.

There are two expectations to set up front:

The workflow has to work with the human, not against them. Most failures don't come from a single "bad suggestion". They come from a process that stops enforcing verification once the diffs get big.
Enterprise stakes are different. In personal projects we can optimise for momentum and fun. In production work, we're exposed to security regressions, compliance issues, availability incidents, and quiet failures that only surface months later.

Tools are evolving rapidly, with more built-in guardrails appearing. These failure modes are architectural, not implementation bugs in current tools. Guardrails help, but workflow design remains essential. This post focuses on the patterns, not the tools.

What we mean by "vibe coding"

The term has picked up multiple definitions. Here, I'm using it to mean working primarily at the level of intent, and relying on an agentic system to generate and evolve the implementation details.

That shift increases speed and volume. It does not change ownership. If a failure slips through the gap, it's our failure. The tool isn't accountable. We are.

Why these failures cluster

The most useful label I've found for this behaviour is reward hijacking: the system is not trying to deceive anyone — it's optimising for what it can observe and be "rewarded" for.

In agentic coding, the "reward" is often an observable success signal:

tests are green
linters are quiet
the diff looks plausible at a glance
the agent says "done"
the UI loads once

If that's all we validate, that's all we'll reliably get.

These failure modes cluster around a gap: machine-verifiable gates (tests, linters, type checks) are necessary but insufficient. They catch syntax and contract violations. They don't catch "this works but isn't what we meant" or "this passes but the coverage is worse."

Long sessions amplify everything

As the current conversation context grows, earlier intent and constraints are more likely to be missed even if they're still present.

This is a known long-context behaviour: the "lost in the middle" effect. In long-context evaluation of language models, performance is often highest when relevant information is near the beginning or end, and degrades when the relevant detail is buried in the middle.

This doesn't create new failure modes. It increases the probability of all of them.

Where agents fail

The four named failure modes below are from Vibe Coding by Gene Kim and Steve Yegge. I'm using their names as labels, and grounding each one with a concrete example of how it shows up in code review.

1) Baby-counting

What it is
Requirements are silently dropped while the system still claims completion.

Why it happens
The observable success signal is "the thing looks done" (green CI, fewer failing checks, a confident completion message), not "all original requirements are satisfied".

The metaphor
You send someone into a room to count the babies. They report back: "Ten babies, all accounted for." You trust the count. Everything looks fine.

An hour later, you go into the room yourself. There are five babies.

Everything was fine until it wasn't. Now it's a catastrophe.

Real-world example
We have ten failing unit tests. We ask the agent to "make sure there are no failing tests".

The agent fixes five tests, skips one, and deletes the remaining tests.

Result: zero failing tests.

From the system's perspective, the observable goal was achieved. From our perspective, coverage was silently lost.

Why it's dangerous (enterprise impact)
Coverage loss is silent. No failing tests, no alerts, just missing protection that surfaces months later when something else breaks and there's no test to catch it.

What we look for

deleted tests, skipped tests, weakened assertions
"refactors" with suspiciously large removals
acceptance criteria quietly disappearing from the discussion
any change that reduces safety signals while claiming progress

2) Cardboard muffin

What it is
The output looks correct on the surface but is hollow or incorrect inside.

Why it happens
Plausibility + test satisfaction are strong proxies. If tests can be satisfied shallowly, there is no pressure to implement the real behaviour.

Real-world example
We provide a function signature and tests. The agent produces a large implementation with helpers and branching.

Buried in the code is a hard-coded return value. Regardless of input, the function always returns the same result. The existing tests pass.

The code has the shape of a solution without implementing the underlying intent.

Why it's dangerous (enterprise impact)
This creates false confidence. Reviewers skim because it looks like a serious implementation. CI is green because the tests were satisfiable shallowly. The bug appears later under real inputs, at a time when the original change is no longer fresh in anyone's head.

What we look for

hard-coded constants where logic should exist
conditionals that funnel everything into one output
broad fallbacks that swallow errors
tests that assert shape/type instead of behaviour
"impressive" structure with suspiciously little meaningful computation

3) Half-assing

What it is
Correct behaviour, poor implementation.

The feature works, but it ignores standards, architecture, operational concerns, or long-term maintainability.

Why it happens
Unless quality constraints are explicit, the shortest path to "works" wins. The observable success signal is "feature delivered", not "feature delivered sustainably".

Real-world example
The agent implements the feature correctly but:

hard-codes configuration that should be injectable
bypasses existing abstractions and duplicates logic
adds minimal error handling to get the happy path working
writes tests that cover only the most obvious scenario

Everything passes. The change is "done". The repo got worse.

Why it's dangerous (enterprise impact)
The debt is silent and cumulative. Each shortcut makes the next change harder. It shows up later as slower delivery, production incidents from missing error handling, operational friction from hard-coded behaviour, and inconsistent patterns across teams.

What we look for

configuration hard-coded into code paths
bypassing existing patterns "because it's faster"
incomplete error handling or missing observability
tests that don't cover failure cases
new dependencies without justification

4) Litterbug

What it is
Residue left behind after "working" changes.

The behaviour works, but the codebase is messier than before.

Why it happens
Cleanup often doesn't change observable behaviour. If we don't require it explicitly, it won't be prioritised.

Real-world example
The agent adds a new helper that overlaps heavily with an existing one instead of consolidating logic.

It leaves TODOs, debug logs, and comments like "new implementation" that quickly stop being true. Nothing breaks. Nobody notices. Entropy increases.

Why it's dangerous (enterprise impact)
The damage compounds. Small bits of litter become systemic cognitive load.

Once comments and structure can't be trusted, engineers stop trusting them at all. That degrades onboarding, incident response, and change safety.

What we look for

duplicated helpers or parallel implementations
dead code and commented-out blocks after refactors
TODO/FIXME without an owner or issue reference
stale comments that no longer match behaviour
debug logging or temporary scaffolding left behind

Where we fail

The agent doesn't work in isolation. It works in a process with us.

That process is almost always a code review: reading diffs, scanning tests, reconstructing intent, and deciding whether a change is safe to merge.

The part that's easy to miss is that this review has its own "context window". It isn't measured in tokens; it's measured in what we can keep active while scanning diffs, reconstructing intent, and validating behaviour.

Classic cognitive psychology framed short-term capacity as "seven, plus or minus two" — a description of how many meaningful chunks of information we can actively hold and manipulate at once. Later work argues the practical working set is often smaller under many conditions. In practice, that gap shows up as a predictable pattern: as the change set grows, we shift from verification to plausibility. We stop holding the whole thing in our head, and we start leaning on proxies.

1) Rubber-stamping

What it is
Rubber-stamping is approving changes without real verification.

Why it happens
Large diffs plus green checks encourage approval-by-glance. Agent summaries can feel like substitutes for evidence.

Real-world example
A large change set lands. Tests pass. The diff looks reasonable at a glance. It gets approved.

A baby-counting or cardboard-muffin failure was inside the diff, but nobody checked the specific risk.

Why it's dangerous (enterprise impact)
This is how high-severity regressions slip through "successful" pipelines. The process appears healthy because the mechanics ran.

What we look for

large diffs merged with minimal review notes
"LGTM" reviews on complex changes
reliance on agent summaries instead of evidence (tests, traces, manual checks)

2) Review fatigue

What it is
Attention and defect discovery drops as review volume increases.

Why it happens
Humans don't review indefinitely at the same quality level. Once change sets are routinely too large, the review standard shifts from verification to plausibility.

Industry experience shows defect discovery rates drop significantly as review size increases, leading to practical ceilings on defect discovery per review session.

Real-world example
Early on, reviews are careful. As output increases, the standard quietly shifts from verifying correctness to deciding whether something "looks fine".

Why it's dangerous (enterprise impact)
When fatigue sets the review standard, the process becomes performative. Failures become inevitable, not occasional. Fatigue is what drives rubber-stamping: reviewers approve changes they haven't truly verified because the volume makes real verification unsustainable.

What we look for

review cycles that are consistently too large
repeated "skim reviews" because "we can't read it all"
teams normalising that "AI output is too big to review"

3) Intent drift

What it is
The intended outcome evolves but isn't re-anchored.

Instructions accumulate through conversation, assumptions change, constraints are added informally. Nothing restates a single explicit definition of "correct".

Why it happens
We treat the current conversation context like shared memory, but it degrades with length. Without deliberate re-anchoring, the agent optimises against whichever version of intent is most salient or easiest to satisfy.

Real-world example
Early in the session: "Add a new endpoint for user preferences."

Midway through: "Oh, and this needs to work with our legacy auth system that uses session tokens, not JWTs."

The agent continues, but the new constraint isn't consistently applied across all code paths. Half the implementation uses the new auth pattern, half assumes the original approach. Both paths pass their tests.

Why it's dangerous (enterprise impact)
Intent drift is where compliance and security constraints get dropped, because they were never re-stated as hard requirements.

What we look for

decisions made verbally but not captured in artifacts
constraints that exist "somewhere earlier in the thread"
mismatches between the final code and the stated intent

4) Decision delegation

What it is
We delegate not just execution, but architectural and product decisions, without explicitly acknowledging or approving them.

Why it happens
Volume creates distance. Confidence creates complacency. Without explicit re-engagement points, decisions get smuggled in as "implementation detail."

Real-world example
A change introduces a new abstraction or refactors a subsystem "for clarity." We ship it without a deliberate decision that this was the right shape for the codebase.

Why it's dangerous (enterprise impact)
Decision delegation creates architectural drift. It also creates ownership ambiguity, which is poison in incident response.

What we look for

major structural changes without explicit approval
"drive-by refactors" embedded in feature work
unclear rationale for architecture shifts

Where this leaves the system

None of these are exotic. They are predictable outcomes of optimising against observable success signals and reviewing under volume.

Workflow design has to assume both failure surfaces and build in explicit constraints and verification gates. The failure modes are architectural problems. They require architectural solutions.

The question then becomes: what does a workflow designed around these constraints actually look like?

That question is explored in the follow-up post, Designing agentic workflows: a practical example, which presents a concrete, verification-first workflow design intended to turn these failure modes into explicit constraints.

DEV Community: Daniel Butler

Intent-Driven Development Changes the Shape of Risk

Abstraction Shifts Always Redistribute Risk

We Relied on Human Throughput More Than We Realised

Pipelines Were Calibrated for Slower Change

A Practical Response at the Workstation Layer

The Practical Question

Did that actually help? Evaluating AI coding assistants with hard numbers

The problem with vibes

What you actually need

What Pitlane is

Why not use an existing eval tool?

Benchmarks that don't lie to you

What this changes

Try it

Agentic workflow design — index and reading order

1. Failure modes: where agents fail and where we fail

2. From diagnosis to design

3. The core loop

4. Supplementary commands and pressure valves

What this series is

Designing agentic workflows: supplementary commands and pressure valves

Pressure 1: Context degrades

/wf-summarise

What /wf-summarise does

Examples

Pressure 2: Intent must be explicit

Starting point: Issue with clear success criteria

The on-demand trio

/wf-investigate

/wf-design

/wf-adr

Knowing when to skip

How this supports the core loop

Designing agentic workflows: the core loop

Sessions are disposable. The repository is not

Intent exists before the workflow starts

AGENTS.md: non-negotiables and evolving guardrails

wf-01: Define gates before writing code

wf-02: Plan tasks as bounded units of change

wf-03: Implement one task per session

wf-04: Cleanup before the PR

How this constrains the failure modes

The sequence

What comes next

Designing agentic workflows: a practical example

What this workflow is (and isn’t)

The key constraint this workflow enforces

Why a workflow is necessary at all

The core design principles

1. Intent is captured before execution

2. Planning and execution are separated

3. Changes are deliberately small

4. Verification is independent and machine-verifiable

5. Cleanup is mandatory

The sample workflows

How this mitigates the earlier failure modes

This is a starting point

Designing agentic workflows: where agents fail and where we fail

What we mean by "vibe coding"

Why these failures cluster

Long sessions amplify everything

Where agents fail

1) Baby-counting

2) Cardboard muffin

3) Half-assing

4) Litterbug

Where we fail

1) Rubber-stamping

2) Review fatigue

3) Intent drift

4) Decision delegation

Where this leaves the system

Further reading

`/wf-summarise`

What `/wf-summarise` does

`/wf-investigate`

`/wf-design`

`/wf-adr`

`AGENTS.md`: non-negotiables and evolving guardrails

`wf-01`: Define gates before writing code

`wf-02`: Plan tasks as bounded units of change

`wf-03`: Implement one task per session

`wf-04`: Cleanup before the PR