Ian Johnson

Posted on Jun 18 • Originally published at tacoda.Medium on Jun 15

Intent-Driven Delivery

#agenticworkflow #agenticai #harnessengineering #softwareengineering

Two poles dominate how teams talk about AI in their workflow right now. On one end, vibe coding: open a chat, describe the change, accept what comes back, move on. On the other end, spec-driven development: write the formal spec, generate the code from it, treat the spec as the source of truth. Both have a real argument. Both miss the same thing.

Vibe coding is fast and unreliable. The work that comes back looks fine and is wrong in ways nobody can name, because nobody wrote down what right was. Spec-driven development is reliable and slow. The spec gets so heavy that writing it becomes the project, and the code ends up locked to the spec rather than to the outcome.

Waterfall has the same problems regardless of a human or agent actor.

Intent-driven delivery is the middle path. It is not a coding style. It is a delivery model — how a change moves through the whole system, from idea to production, in a way an agent can run and a human can trust. The unit of work is an intent contract : a small, falsifiable spec for one outcome, with named scope, acceptance, and a feature flag. Humans approve at three gates inside the cycle and one more at production. Agents do the middle. The flow is defined by constraints, not vibes and not waterfalls.

It answers four questions at once. What do agents do, if not vibe code or generate from a heavy spec? How does work get tracked when the ticket isn’t the unit of work anymore? Who decides a change is safe for users to see? And what’s the shape of a delivery pipeline when both the org and the project have to constrain it? One model covers all four because they’re the same problem in four costumes — the gap between human intent and shipped behavior, with an agent in between.

I’ve been running this against a real project for a few months. The skeleton holds. One part is still soft and I’ll say where.

Not vibe coding, not spec-driven development

The point isn’t which pole is right. They’re answering different questions. Vibe coding asks how fast a change can ship if you trust the agent. Spec-driven development asks how reliable a change can be if you write everything down up front.

Neither asks how a customer signal becomes a green deploy without burning out the team.

Intent-driven delivery picks a small contract per outcome — much smaller than a full spec — and then trusts the constraints around the contract to do the work the spec would have done in a heavier model. The harness catches what the contract didn’t say. The feature flag absorbs the rollout risk. The three touchpoints place humans where humans add value, and nowhere else.

This is closer to agile than to either pole. Small batches, working software, falsifiable acceptance, continuous delivery, retros on what’s slowing the team down. None of that is new. What’s new is that the unit of work is structured enough for an agent to read it, and the flow is constrained enough for the agent to run the middle without a human babysitting. Agile assumed every step had a human in it.

Intent-driven delivery keeps the cadence and the values, and lifts the human out of the steps where the human was a bottleneck instead of a contributor.

That is also where it splits from spec-driven development. Spec-driven development wants the spec to drive the code generation end-to-end. Intent-driven delivery wants the contract to drive one outcome, then get out of the way. The next contract gets its own spec. The system never accumulates a giant document that has to stay in sync with the code.

A delivery model, not a coding style

A coding style stops at the editor. A delivery model covers the whole pipe.

In this one, the pipe is:

A signal lands — support ticket, metric, incident, customer request.
A human turns the signal into a contract.
An agent reads the contract and does the work.
The agent posts evidence against the contract’s acceptance.
A human checks the evidence against the contract.
The change merges, CI runs on main, the flag stays off until the rollout is ready.

Six steps. Three of them involve a human — seed review before step 2, the contract approval at step 2, and the completion review at step 5. Three of them are mechanical. The model is the whole pipe, including the bits that are usually invisible — post-merge CI, the flag, the rollback story. If any of those is missing, the model leaks somewhere and the team eats the cost.

The thing to keep in mind: well-defined constraints are what hold the pipe together. The contract constrains what the agent may touch and what done looks like. The harness constrains how the work gets done. The flag constrains when users see it. The CI watch constrains when the contract closes. Pull any one of those out and the middle stops being safe.

What this assumes

This isn’t a stand-alone proposal. Three things have to be in place before the contract model earns its keep.

An agentic flow. One agent or a fleet. Babysitting a single tab or dispatching dozens. The shape doesn’t matter; the existence does. If the team is still writing every line by hand, the contract is overkill. The contract pays for itself when an agent is reading it.

Continuous delivery. Squash-merge, green main, deploy-on-merge. If a contract gets accepted on Tuesday and ships on Friday after a release-train meeting, the feedback loop is too long for the model to work. The contract presumes the merge is the release.

A project harness with guides and sensors. Guides are the rules the agent reads — coding standards, patterns to prefer, files to leave alone. Sensors are the checks that fire when a rule is broken — linters, tests, type checks, custom inspections. Without guides, the agent invents conventions. Without sensors, nobody catches it. The harness is what makes the agent’s freedom inside a contract safe.

Miss any of the three and the contract becomes paperwork. Have all three and the contract becomes the steering wheel.

The contract

The replacement is a contract with named sections. Not a wishlist. A spec the agent and the human can both check against.

The sections that earn their place:

Outcome. One sentence. The observable end state. “Users on an expired card see a clear renewal message instead of a generic error.” Not “fix the expired card issue.”

Signal. Where this came from. A support ticket, a metric, an incident, an internal request. Keeps the context attached so reviewers can sanity-check the framing.

Scope in. The files, modules, or surfaces the work is allowed to touch. Concrete paths or named components. Not “the billing area.”

Scope out. The places the work must not touch, even if they look related. This one is load-bearing. An empty Scope out is a draft defect — it means the author didn’t think about boundaries.

Acceptance. A numbered list of falsifiable conditions. Each one has to be the kind of thing a test or a command can decide. “Renewal message appears” is acceptable. “Improved error UX” is not. “Improved” is a draft defect. Block the contract until it’s rewritten.

Release gating. Name the feature flag and its default. If the change is user-visible and there’s no flag, the contract can’t be approved. Default is off. Always.

Risk surface. What this could plausibly break. Auth, billing, data integrity, third-party dependencies. The reviewer uses this to decide how hard to look.

Open Questions. Things the author couldn’t resolve before drafting. A contract with non-empty Open Questions cannot be approved. They have to be answered first, even if the answer is “we accept this risk.”

Why each section is there: the contract has to be falsifiable end-to-end. If a section is empty, either the section doesn’t apply (rare) or the author is hiding a decision. The contract works because the gaps are visible.

The shift in mindset: the ticket used to be a prompt. Now it’s a test. If the work passes the contract, it’s done. If it doesn’t, it isn’t. The reviewer’s eyeballs are no longer the rubric.

Four phases, three touchpoints

The cycle around the contract has four phases. Humans show up at three named gates. Nowhere else.

Three touchpoints, four phases. Notice the agent stretch in the middle — that’s the part that used to be hallway conversations.

The thing to notice in the diagram: between TP2 and TP3, no human is in the loop. If the agent gets stuck, it doesn’t ask Slack. It files a structured interrupt against the contract, with options and a recommendation, and stops. The human resolves the interrupt in writing, and the agent resumes. Interrupts attach to the contract and become part of the record.

This matters. Most of what makes the old model expensive is the unstructured interruption: “quick question on the dashboard ticket?” pinged twelve times a day, no record kept, no decision attributed. Force the interrupt into a structured artifact and the cost of asking goes up enough that the agent learns to only ask when it matters, and the answer is preserved for the next contract. It can also serve as a source for agent post-mortems to fix failure modes.

Three touchpoints, no more. Seed review at TP1 (is this worth doing?), contract approval at TP2 (is this spec good enough?), completion review at TP3 (did we get what we asked for?). Anyone who tries to insert a fourth gate in the middle is reinventing the status field.

A more granular look at the Signal to Completion flow.

Where the contract lives

I made a wrong call here and want to flag it.

First version: contracts lived as markdown files in a separate repo. Clean, version-controlled, diffable, lovely. Nobody read them. The issue tracker still had the old ticket, with a vague title and no acceptance, and the team kept working from the ticket because that’s where the search lives and that’s where the dashboards point. The contracts drifted from the tickets.

The fix is to make the issue description be the contract. The whole contract, in the description field, with a tagged YAML block for the structured parts. The tracker is where everyone already looks. The contract goes there.

The intent block at the top of the issue:

intent:
 id: IC-PROJ-42
 outcome: Users on an expired card see a clear renewal message instead of a generic error.
 flag: FEATURE_EXPIRED_CARD_MSG
 flag_default: off
 scope_in:
   - app/Billing/ExpiredCardNotice.php
   - resources/views/billing/notice.blade.php
 scope_out:
   - app/Billing/PaymentGateway.php
 acceptance:
   - When the card is expired, the renewal message renders with the flag on.
   - When the flag is off, behavior is identical to current main.
   - No change to the payment gateway code path.

Completion and interrupts go in as comments on the same issue, with a leading marker so the harness can find them programmatically:

<!-- intent:completion -->
Acceptance:
1. Pass - renewal message renders under flag on. See test BillingNoticeTest::testExpiredCardWithFlag.
2. Pass - flag off path unchanged. See test ExpiredCardLegacyPathTest.
3. Pass - payment gateway file untouched in diff.
Interrupts: none.
Change summary: added ExpiredCardNotice service and a Blade partial behind FEATURE_EXPIRED_CARD_MSG.

The trick is the marker.  and  are invisible in the tracker UI but addressable by the harness. The humans see a normal comment. The agent sees a structured artifact.

This is the part that answers “what comes after ticket trackers.” The tracker doesn’t go away — that’s where the org’s gravity is, where search lives, where the internal stakeholders look. What changes is what lives inside it. The vague ticket is gone; the structured contract takes its place. Status isn’t typed by humans dragging cards — it’s derived from artifacts on the contract. A contract is “in progress” because a branch exists and no completion comment has landed. It’s “verified” because the completion comment is there and post-merge CI went green. It’s “rolled out” because the rollout marker is in the comments. The kanban board still renders. It just reads truth from the contract instead of asking a human what they did yesterday.

Ports, adapters, and not getting locked in

The contract is a data shape. The phases are a workflow. Neither cares which vendor you’re using.

Four ports do all the I/O:

signal-source — where signals come from and where contracts live. Issue tracker.
contract-store— where the markdown body of the contract is rendered and parsed. Often the same as signal-source.
code-host — where the branches and PRs live.
notification — where the agent posts when a touchpoint is ready for a human.

The install I run uses one major tracker and one major code host. The adapters are about 200 lines each. If I had to move to a different tracker or a different code host tomorrow, I’d write new adapters and the commands would not change. The contract format wouldn’t change. The phases wouldn’t change.

This is the part that should make people relax about the proposal. It’s not “switch to this product.” It’s a shape and a set of touchpoints. You pick the providers. The shape stays.

You can overlay onto the services and tools that you already use.

If you decide to use a completely different way to manage issue information, you can plug that in. Instead of backing that with Jira or Linear, you could do it with Notion or Markdown files or an SQLite database. If you want it, just write an adapter for it.

What runs through the code host

The code-host adapter has more moving parts than the others, because the agent does most of its work there. The conventions:

Branch name intent/IC-PROJ-42. The branch is the contract.
One PR per contract. No stacked PRs across contracts. Every PR branches from main.
PR label intent-driven. So the team can filter agent work from human work.
No GitHub assignee. The contract is the owner; the assignee field invites the old model back in. Note: this is a temporary constraint until fully exploring assignee and approval flows.

The verify step has a sharp edge. Before doing anything else, it asks the code host for the PR’s mergeable state. If it’s CONFLICTING, the agent attempts an auto-rebase on origin/main with --force-with-lease. If the rebase is clean, work continues. If there are file-level conflicts, the agent stops and files an interrupt — humans decide how to resolve, not the agent.

When acceptance is verified, the merge is one command:

gh pr merge https://github.com/org/some-app/pull/123 --squash --delete-branch

Squash-merge keeps main linear and one-commit-per-contract. The branch is gone the moment it lands.

Then comes the part that I’ve seen skipped everywhere else and which matters more than the merge itself. Post-merge CI is watched:

gh run watch <run-id> --exit-status

The issue tracker does not advance the contract to “verified” until CI returns green on main. If main goes red after the merge, the contract is still open and the agent owns the cleanup. This closes the loophole where the PR is “merged” and the team finds out three hours later that main is broken.

The production gate

The merge isn’t the release. That’s the whole point of the flag, and it’s where the fourth human gate lives. The merge ships dark — code on main, flag off, no user impact. The rollout is a separate decision on a separate cadence.

The three in-cycle touchpoints decide whether the change is right. The production gate decides whether the change is ready for users. Different question, different rubric, often a different reviewer. Engineering can ship a contract that’s technically correct on a Tuesday and still wait until Thursday to flip the flag, because support is staffed differently or a marketing email goes out that morning. That decision doesn’t belong in the contract review.

In practice: the contract closes when post-merge CI on main is green and the flag is still off. A named human — product owner, on-call, whoever the team has agreed on — flips the flag when deployment health, support readiness, and customer comms line up. If something goes sideways at rollout, the fix is to flip the flag back, not to revert the merge. The code stays on main. The decision being rolled back is the rollout, not the change.

The flip is recorded on the contract with the same marker pattern as completion:

<!-- intent:rollout -->
Flag FEATURE_EXPIRED_CARD_MSG flipped on at 2026–06–12 14:02 UTC by @sara.
Cohort: 100%. No rollback as of 24h post-flip.

That marker is what closes the contract for good. Until it lands, the change exists but doesn’t ship. This is the gate that prevents agent-driven middles from quietly putting things in front of customers.

The reason why I chose an explicit gate at production is how many horrors stories have happened with misuse of an agent with production access. Our solution? Never give the agent production access. Production releases are human-triggered and business decisions, just like the Continuous Delivery model.

The honest part: bot identity

Here’s where the design is still soft.

GitHub blocks the PR author from approving their own PR. If you try, the API returns:

Review cannot be requested from pull request author.

For a single-identity agent setup — where the agent opens the PR under the same user account as the human reviewer — this is a wall. You can’t have a separate human approve, because there’s no separate identity.

The clean fix is to give the agent its own bot identity. The bot opens the PR. The human reviews and approves it. Two GitHub users, no API conflict, normal review flow. The harness I run has the bot config commented out as a follow-up. It works in single-identity mode for solo projects and breaks the moment another reviewer is involved.

I’m calling this out because the rest of the design is tight and this part isn’t. The fix is operational, not architectural — register a bot, give it write access to the repo, point the code-host adapter at its token. A day of work. I haven’t done it yet. If you build this for a team, do it on day one.

This part also does not necessarily need to be automated. This could, in fact, be a human PR review. Some organizations may want to implement this. We chose not to because a) our harness is mature, and b) we practice continuous delivery (not deployment) and so merging into main triggers the next human actions of user acceptance testing in staging environments.

What changes for the team

The rituals shift in ways that are mostly good.

Stand-ups become read-throughs. Instead of “what did you do yesterday,” you walk the contracts in flight. Each one has a state, a current touchpoint, and an open interrupt or not. The status comes from the contract, not the human’s memory. Meetings shorten.

Estimation sharpens. A contract has falsifiable acceptance. Estimation against falsifiable acceptance is meaningfully easier than estimation against “improve performance.” The team gets better at sizing because the targets are sharper.

TP3 review becomes real. When the work comes back, the reviewer reads the acceptance, reads the completion comment, and checks each numbered item. The review takes ten or fifteen minutes for most contracts because the rubric is already written. The reviewer isn’t deciding what done means; they’re checking whether done was reached. This also opens an opportunity for exploratory testing, which has a high product value and customer impact.

Status becomes derived. No more “drag the card.” The contract’s state is computed from artifacts — does a completion comment exist, did CI go green, are there open interrupts. The Kanban board still renders, but nobody curates it. It reads the truth from the contracts.

The thing that doesn’t change: the human is still the source of intent. The contract is what the human wants. The phases protect the human from being interrupted constantly. They don’t replace the human.

Org norms and project specifics

A delivery pipeline has two halves that usually get conflated. The “what” — organizational norms, product strategy, the standards the company holds itself to. The “how” — project conventions, code patterns, the sensors that catch the agent doing the wrong thing in a particular codebase. Most teams encode the “how” in their repos (linters, tests, docs) and the “what” in human heads (roadmaps, all-hands slides, Slack threads). Agents can read the “how.” They can’t read the heads.

Intent-driven delivery uses the same contract shape at both altitudes.

The project harness is what I’ve assumed through most of this post. Guides and sensors local to one repository. Files to touch, patterns to follow, checks to pass. Acceptance falsifiable against a test runner.

The contract works at the project level because the project harness constrains how the work gets done. The “how” is local. The “what” is not. The “what” is product. What the org wants, who it’s for, why it matters. Today that lives in roadmaps, PRD drives, Slack threads, and a few people’s heads. Same problem as the old ticket, one altitude higher.

The org harness sits one layer up. Same primitives — guides and sensors — but the subject is the company, not the code. Guides describe what the org cares about: who the customer is, which metrics matter, what changes need product review, what data classifications mean. Sensors run against contracts before they reach a project: does this align with a current org intent, does it touch a regulated surface, does the outcome map to a tracked metric. An org intent (“reduce billing-related support volume by 30% this quarter”) decomposes into project intents (“show a clear renewal message when the card is expired”), each with its own contract and gates.

Product and engineering stop being two languages stapled together. They share a contract format, they share gates, they share the same definition of falsifiable. An agent running inside a project harness inherits the org’s standards by reference instead of by reminder. The CEO opening the tracker doesn’t see “improve performance” — they see an intent that traces up to a quarterly outcome and down to a green CI run.

How to try this on your next ticket

You don’t need to install anything to test the idea. Pick a real ticket from your current sprint and do this:

Rewrite acceptance so every line is falsifiable. If a line says “improved,” “cleaner,” or “better,” rewrite it as a check a test or a command could run. If you can’t, the line wasn’t acceptance — it was a wish.

Add an explicit Scope out section. Three to five files, modules, or surfaces that the work must not touch. If you can’t think of any, you haven’t thought about the boundaries yet. The exercise is worth ten minutes.

Name the feature flag. If the change is user-visible and there’s no flag, add one before you start. Default off. The flag is part of the same change, not a follow-up.

Push back on tickets where “done” is what the reviewer says. If the next ticket coming your way doesn’t have falsifiable acceptance, return it. Don’t start work. The cost of writing the acceptance up front is much lower than the cost of arguing about “done” at the end.

Run one cycle where humans show up at three gates only. Seed, contract approval, completion. No mid-execution check-ins. Watch what happens to the interruption volume. Watch what happens to review time. Watch what happens to the diff size.

Most of the value shows up on the first ticket you rewrite, before any tooling exists. The contract is the lever. The phases are how to operate it without burning out the team. The flow — signal in, deploy out, three in-cycle touchpoints plus a production gate, constraints from the org and the project pressing in from both sides — is what makes intent-driven delivery a delivery model and not another spec. One shape answers what to do, how to track it, who decides it ships, and where the org and the project meet.

That’s the bet.

DEV Community