This is Article 6 of Beyond the Coding Assistant, a multi-part series on AI-assisted software engineering at enterprise scale. The full series is available free of any paywall at https://articles.zimetic.com/. Previously: Article 5 — SDLC, Not Task. Coming next: Article 7 — A Taxonomy of Agents.
The pattern is familiar by now. An engineer asks an AI agent to "implement this feature." The output looks plausible. It passes a smoke test. Then something subtle breaks in production: a timezone bug, a missed edge case, a security check silently bypassed.
Ask the same agent to perform a narrow, well-defined sub-task — "given this spec and these tests, implement this one function" — and the success rate is dramatically higher. The difference is not the agent. It's the scope of what was asked.
Now picture the opposite failure mode. It's 2 a.m. Production is on fire. The on-call engineer opens the ticket. The workflow engine dutifully walks them through requirements gathering, stakeholder interviews, architectural review, and a formal specification pass. By pass four, the outage is twenty minutes old. By pass five, somebody has called the engineer's manager.
Both failure modes have the same root cause: the workflow's structure didn't fit the work. In the first case the task was too large for a single pass. In the second case the workflow was too elaborate for the kind of work in front of it. Article 5 made the case that the unit of work for AI-assisted engineering is the work item, not the session. This article is about how that work item gets executed: with the right number of passes, in a workflow shape that matches the work.
What the short history of AI coding agents has already taught us
When researchers measure AI coding agents against benchmarks that require multi-step planning and reasoning, the numbers are sobering. One widely-cited study found GPT-4 achieving roughly a 14% success rate on complex multi-stage tasks where humans score above 92% [arXiv: Measuring AI Ability to Complete Long Tasks]. Work on dynamic task decomposition frameworks — notably the TDAG framework from researchers at the Chinese Academy of Sciences and collaborators — shows that breaking the same underlying problem into smaller, dynamically-generated subtasks produces substantially better reward scores and success rates [arXiv: TDAG]. Amazon's own applied-AI research group has published the same point: task decomposition combined with smaller LLMs is one of the main paths to making AI both more reliable and more affordable at enterprise scale [Amazon Science].
The MIT Sloan paper from Article 1 reaches the same conclusion from a different angle: AI's biggest impact comes from chaining tasks — clustering AI-friendly steps so AI executes them as a continuous sequence — rather than from single-task speedups [MIT Sloan: How AI is reshaping workflows and redefining jobs]. Same observation, multiple research traditions: small, well-bounded steps with structured handoffs are how AI delivers reliable work.
The pattern is consistent. AI coding agents succeed at small, well-scoped, well-specified work and fail at large, ambiguous, multi-step work. Failures cluster around context limits, cascading errors, and quality that varies wildly with prompt wording rather than with the structural correctness of the task.
Good news: this is not a new class of problem. We have been defining ways to break complex problems into smaller components that are self contained for decades. There are literally dozens of tools and techniques we have developed over the years for doing this, the best known is probably Domain Driven Design (DDD).
A short historical beat
Compilers ran into a related version of this problem in the 1960s. Single-pass compilers worked for simple, forward-only languages; as languages got richer, single-pass broke down. Scope resolution needed context the first pass didn't have. Errors cascaded. "Works on the example" didn't generalize.
The answer was multi-pass architectures: parse into an intermediate representation, then analyze semantically, then optimize, then generate target code. Each pass had a narrow job, produced a reviewable artifact, and was testable in isolation.
InfoWorld framed the parallel explicitly in a 2026 piece titled "The two-pass compiler is back — this time, it's fixing AI code generation" [InfoWorld]. The argument: today's LLM-based code generation tools are, architecturally, single-pass compilers. Split them into two passes and you get structural benefits prompt engineering can't provide — security issues eliminated at the IR boundary, hallucinated properties caught and stripped before they reach generated code, and because the generation pass is deterministic, output that's reproducible and auditable.
Spec-driven development is a good start
GitHub's open-source Spec Kit is the clearest current example of the move from one pass to two in AI coding [GitHub Blog] [Spec Kit]. It organizes projects around a .specify directory and a small set of slash commands — /speckit.specify to document requirements, /speckit.plan for the implementation plan, /speckit.tasks to break the plan down, /speckit.implement to execute, plus /speckit.clarify and /speckit.analyze to catch inconsistencies before implementation. The repository passed 28,000 GitHub stars in its first year — not a small signal.
Separating "specify what should be built" from "generate the code" is a real improvement, and many of the successes being reported today come from variations on this multi-pass structure. Credit where due: it's the move from single-pass to multi-pass, applied to AI code generation, and it helps.
But for most enterprise work, two passes is not enough. Real work involves design decisions that can't be settled by the spec, constraint reconciliation across services and teams and regulatory regimes, multiple verification concerns (correctness, security, compliance, performance, backward compatibility), integration with other changes, deployment considerations. Forcing all of that into two passes asks too much of either pass.
The framework this series argues for goes further — and not just by adding more passes. Instead what I am proposing is that we define the process (or passes) for the work we are doing. It won't always be the same, a title change on a UI does not need the same process as a major refactor of a 10 year old service that has grown too complicated to maintain.
More passes, deliberately chosen
For complex work, an illustrative pipeline — not the canonical one, but one that is in use and has proven effective:
- Spec pass. Requirement → formal specification.
- Design pass. Specification → DDD artifacts. Architecture decisions, API shapes, data models.
- IR pass. Design → structured intermediate representation. AI-generated, not production code.
- Validation pass. IR → verified IR. Specialized agents (security, compliance, testing) check alignment against the spec and against policy.
- Generation pass. Validated IR → production code. Ideally deterministic.
- Verification pass. Code → test results, review findings.
- Integration pass. Code → merged, deployed, monitored.
Each pass produces a set of artifacts. Each artifact is reviewable. At the end of each pass there could be one or more gates. Again, you wouldn't want to be interrupted by the coding assistant working on a title change or minor layout issue at each one of these steps, but for a change that will impact a key system and has performance and security implications you might want to check at most if not all of these steps.
If that list looks like a prescription, it isn't. It's a starting point for a particular kind of work — a complex new feature in a multi-system codebase. A different kind of work wants a different shape. And by defining this sdlc or workflow in markdown files, we have been able to make it so our agents can repeat these processes on each ticket we have. It makes an agent into a workflow orchestrator, and the good news is you don't need an expensive foundational model to be the orchestrator agent!
One workflow doesn't fit all
While the workflow described above is the right answer for my team when we are doing large features, it isn't the only workflow or sdlc we use. We have several workflows, and our team makes adjustments to the workflows as we use them and learn what works and what does not. Some questions we ask when we make adjustments are:
- If we remove this step, how do we keep these artifacts in sync? Fundamentally every story changes the system, and every change impacts code, but it also impacts documentation. Ugh, there is that nasty word again! But the documentation makes it easier to remember what you (or your coworker) did to the system three months ago that is related to this ticket, hum.... So, we have a defined set of documentation we want the system to keep up to date. Things like the bounded contexts, the contracts between systems, api docs, etc. These things are the minimum set of documentation we try to maintain in all of our processes. Now for a bug fix in production, the documentation can wait, but we can create a story to fix it, and kick that off after we get the system back online.
- What is the advantage of changing the workflow? Is it a cost savings? Does it make the work more reliable? Does it reduce the burden on someone? We try to make the system more reliable while still producing the new capabilities that will help us in our marketplace. So is this change likely to result in as good or better code generation from the agents? Is it going to make it so a cheaper model can generate the code, or the code can be generated faster? Is it going to make it easier for the agents to find any bugs before we have to look at the change and find them ourselves?
- Does this new flow support all the gates we need? Is it going to create a backlog of tasks for the human to review which is going to be a false gate? A false gate is where we have a human doing some check, but because they have so much pressure to review so many deliverables, they just pass it and hope for the best. Those don't help us or make the system any better.
How many workflows is the right number?
Only you and your team can determine how many workflows and what are the right combinations. I know for my team we have several:
- New feature — complex, multi-system. Full multi-pass: ideation through deployment. Slow, deliberate, high-touch.
- Production bug — urgent. Diagnose → patch → verify → deploy → post-incident. Speed and correctness dominate.
- Security incident. Production-bug shape plus compliance capture and mandatory human gates.
- Refactor / tech debt. Heavy on verification, light on requirements.
- Documentation / config update. Lightweight. One or two passes.
Can we standardize?
When we first started down this road, there was a desire by engineers to be able to have their own workflows. But over time and much discussion, we found that most of our disagreements were in the semantics of naming the phases and artifacts and not in the actual details of what is happening. We were able to make compromises and find a middle ground that worked for all of us and then iterate as we got use to that workflow.
Most teams can standardize their workflows within the team, but each team has a different set of requirements. Let's say you work for a healthcare company and your application impacts patient care. Well that is going to have a whole different set of requirements (and workflow) than if you work for a realtor building a marketplace of homes for sale. Different organizations have different risk profiles and different levels of risk aversion to take into consideration.
This is why the framework doesn't dictate the workflow, instead it allows the team and the individual to define the workflow and then it processes the task through the workflow. The good news is that at almost every company the taxonomy is already there. Making it explicit is what lets AI orchestration act on it.
Why intermediate representations matter
I'm going to take a short detour here, and discuss the intermediate representation step in our workflow as it was a controversial one. And you are free to go without it, but what we have found is that by having the agent generate pseudo code, or the system in a higher level language than actual code in whatever framework you are implementing in, the coding agent is able to focus on the concepts and the logic needed to implement the solution without dealing with syntactic sugar and the other minutia that often leads to bugs. Then in the next pass, the agent is able to focus on just implementing that function, sub routine, module, component or whatever your framework calls them. This multi pass approach breaks the process down for the agent so the agent is able to have a much higher success rate.
In addition breaking this into two passes also allows us to:
- Have another agent validate the logic without worrying about syntax.
- An agent can now determine unit test, integration tests, and e2e tests that are needed.
- A gate can occur on validating that the logic will support the business needs in the requirements.
- A human architect can review it for architectural integrity and make corrections at this level before the code generation.
All of these allow the system to build a much more solid solution and validate it much better than if we went straight to implementing code.
Review gates play a vital part
The review gate — human, agent, or hybrid — plays the role semantic analysis plays in a compiler. It enforces criteria the generator can't self-check. "Does this match the spec?" "Does this comply with policy X?" "Does this preserve the invariants the existing tests cover?" These are exactly the kinds of questions that look obvious in retrospect and are the first things to slip through when you rely only on generation.
A review gate is a workflow primitive with written acceptance criteria, a defined reviewer (human, agent, or both), and an escalation policy when it doesn't pass. We will go into these in detail in a later article, but for now: every pass ends in a gate, and the gate is what makes "more passes" actually pay off.
What this combination buys you
A workflow that has the right number of passes for the work in front of it, drawn from a workflow appropriate to the work item's classification, produces:
- Reproducibility. Deterministic generation from a validated IR.
- Auditability. Every pass produces an artifact that's reviewable and retainable.
- Compliance. The criteria for each gate are written down. The audit trail is the workflow's natural output, not a separate ceremony.
- Cost efficiency. Cache and reuse IRs instead of re-prompting; route cheaper models to the passes that don't need frontier capability.
- Flexibility. Engineers don't route a trivial change through a 10 step process just because "we always do it this way", they can pick the workflow that is appropriate for the task at hand.
- Quality. The trilemma move from Article 4 — actually navigated, instead of pretended away.
Multi-pass alone doesn't deliver this. Multi-workflow alone doesn't deliver this. Multi-pass and multi-workflow together do.
Implications for agents
Different workflows need different agents.
The production-bug workflow needs a fast diagnostic agent and a cautious deployer. It does not need a full-spec writer. The new-feature workflow needs an architect and a specifier; it does not need an incident-response triage process. The security-incident workflow needs reviewers with compliance-aware context, an audit-capture agent, and stricter human gates. The deployer for security incidents has different policies than the deployer for routine work.
This is the natural setup for the next article. Specialization at the agent level is a direct consequence of specialization at the workflow level. A flat "worker agent" pool can't service the workflows this article describes; the agent set has to mirror the workflow set.
Closing
The short history of AI agents has taught us what works: small, well-scoped tasks. The right answer is not heroic prompting; it is structured, multi-pass workflows with the right granularity for the work — and more than one workflow, because real engineering organizations do real engineering work, and that work is not all the same shape.
The compiler community solved the multi-pass problem half a century ago. The incident-response, change-management, and audit communities have been solving the multiple-workflow problem for almost as long. We can borrow both moves without reinventing them. And then we can ask the more interesting question: what kind of agents, on what kind of compute, with what kind of context, do we need to actually run all of this?
Coming next
Article 7: A Taxonomy of Agents. Different workflows need different agents, and engineering organizations are already a taxonomy of specialized roles — most of which are not "developer." Mapping agents to that taxonomy is the move that makes the whole-team frame from Articles 0 and 1 actually executable.
Sources
- Measuring AI Ability to Complete Long Tasks — arXiv
- TDAG: Dynamic Task Decomposition and Agent Generation — arXiv
- How task decomposition and smaller LLMs can make AI more affordable — Amazon Science
- How AI is reshaping workflows and redefining jobs — Kristin Burnham, MIT Sloan Ideas Made to Matter, April 2026
- The two-pass compiler is back — this time, it's fixing AI code generation — InfoWorld
- Spec-driven development with AI: new open source toolkit — GitHub Blog
- Spec Kit

Top comments (0)