If you have spent any time using an AI coding agent, you have probably experienced this: you ask it to refactor a module, and it immediately starts editing files, halfway through realizes it needs a different approach, backtracks, leaves orphaned imports, and produces something that half-works. The agent was not stupid. It was doing two fundamentally different cognitive tasks at the same time — figuring out what to do and doing it — and that interleaving is where things fall apart.
The fix is an architectural pattern borrowed from classical AI planning: separate the planning phase from the execution phase. Generate a complete plan first — a spec, a file map, test criteria, an ordering of changes — and only then execute against that plan. This single structural change transforms flaky, meandering agent sessions into predictable, reviewable workflows.
What You Will Learn
- Why interleaving planning and execution causes most AI coding agent failures
- The two-phase workflow pattern: what it looks like in practice
- How to write prompts that produce structured, actionable plans
- How to implement a feedback loop between execution and the original plan
- Concrete code examples you can adapt for your own agent pipelines
- When this pattern helps and when it adds unnecessary overhead
Why Agents Fail: The Interleaving Problem
When you ask an LLM to "add authentication to this Express app," the model faces two distinct problems. First, it needs to reason about the task: which files need to change, what dependencies are required, what the middleware chain should look like, how existing routes will be affected. Second, it needs to produce code: actual syntax, correct imports, proper error handling.
These are different kinds of work. Planning is divergent — it explores possibilities, considers constraints, and makes trade-offs. Execution is convergent — it commits to specific syntax and produces concrete artifacts. When an LLM tries to do both simultaneously, it makes execution decisions before planning is complete. It starts writing a JWT middleware before deciding whether sessions or tokens are the right approach. Then it has to backtrack, and backtracking inside a code generation context means lost coherence.
This is not just a theory. If you audit failed agent sessions, you will find a consistent pattern: the agent made a reasonable local decision early on that turned out to be globally wrong. It edited file A in a way that is incompatible with the changes file B needs, because it had not thought through the full dependency graph before starting.
Classical AI planning research formalized this problem decades ago. The HDDL 2.1 formalism for Hierarchical Task Networks, for example, explicitly separates task decomposition (planning) from action execution, because interleaving them in complex domains leads to invalid plans. The same principle applies to LLM-based coding agents.
The Two-Phase Pattern: Plan, Then Execute
The core idea is straightforward: split every agent task into two distinct phases with a clear boundary between them.
Phase 1 — Planning: The LLM analyzes the task, examines relevant code, and produces a structured plan. This plan includes what files will be created or modified, what the expected behavior change is, what tests should pass afterward, and in what order changes should be applied. No code is written yet.
Phase 2 — Execution: A separate agent call (or series of calls) takes the plan as input and implements each step. The execution agent can reference the plan to maintain consistency. After each step, validation checks confirm the execution matches the plan.
Here is the workflow visualized:
graph TD
A["Task Description"] --> B["Phase 1: Planning Agent"]
B --> C["Structured Plan"]
C --> D{"Human Review<br/>(optional)"}
D -->|Approved| E["Phase 2: Execution Agent"]
D -->|Revise| B
E --> F["Step 1: Implement Change"]
F --> G{"Validate Against Plan"}
G -->|Pass| H["Step 2: Next Change"]
G -->|Fail| I["Re-attempt with<br/>Plan Context"]
I --> G
H --> J["Step N: Final Change"]
J --> K{"Run Test Criteria<br/>from Plan"}
K -->|Pass| L["Done"]
K -->|Fail| M["Diagnose Against Plan"]
M --> E
The boundary between phases is the structured plan — a document that both humans and the execution agent can read, review, and reference. This is where the pattern gets its reliability: you can inspect the plan before any code changes happen.
The Plan Is Not a Suggestion
A common mistake is generating a vague, prose-heavy plan and then letting the execution agent freestyle. The plan must be specific enough that you could hand it to a different developer (or a different LLM) and get the same result. If your plan says "update the authentication logic," it is too vague. If it says "add a
verifyTokenmiddleware insrc/middleware/auth.tsthat decodes a JWT from theAuthorizationheader and attaches the payload toreq.user," that is actionable.
Designing the Planning Phase
The planning phase prompt needs to guide the LLM toward producing structured, actionable output. Here is a concrete prompt template you can adapt:
PLANNING_PROMPT = """
You are a senior software architect. Your job is to produce a detailed
implementation plan. Do NOT write any code yet.
## Task
{task_description}
## Current Codebase Context
{relevant_file_contents}
## Instructions
Analyze the task and produce a plan with the following sections:
### 1. Summary
One paragraph describing what this change accomplishes.
### 2. Files to Modify
For each file:
- File path
- What changes are needed (be specific about functions, classes, exports)
- Dependencies on other file changes (ordering constraints)
### 3. New Files to Create
For each new file:
- File path
- Purpose
- Key exports or interfaces
### 4. Dependency Changes
- New packages to install (with versions if relevant)
- Configuration changes (tsconfig, eslint, etc.)
### 5. Execution Order
Numbered list of steps, ordered so that each step can be validated
independently. Earlier steps should not depend on later ones.
### 6. Test Criteria
Specific, verifiable conditions that confirm the change is correct:
- Unit tests to add or modify
- Manual verification steps
- Edge cases to handle
### 7. Risks and Assumptions
- What could go wrong
- What assumptions you are making about the codebase
"""
Let us walk through using this with a real task.
1. Gather Context
Before sending the planning prompt, collect the files the LLM will need to reason about. This means reading the relevant source files, the project's package.json or equivalent, and any existing tests. Do not dump the entire codebase — scope it to what is relevant.
import os
def gather_context(file_paths: list[str]) -> str:
"""Read and concatenate file contents with clear delimiters."""
context_parts = []
for path in file_paths:
if os.path.exists(path):
with open(path, 'r') as f:
content = f.read()
context_parts.append(f"### {path}\n```
{% endraw %}
\n{content}\n
{% raw %}
```")
return "\n\n".join(context_parts)
# Example: gathering context for an auth feature
context = gather_context([
"src/app.ts",
"src/routes/users.ts",
"src/middleware/index.ts",
"package.json",
"tsconfig.json"
])
2. Generate the Plan
Send the planning prompt to your LLM. Use a high-reasoning model for this phase — planning benefits more from careful thinking than from fast token generation.
import openai
def generate_plan(task: str, context: str) -> str:
"""Generate an implementation plan without writing any code."""
client = openai.OpenAI()
prompt = PLANNING_PROMPT.format(
task_description=task,
relevant_file_contents=context
)
response = client.chat.completions.create(
model="o3", # Use a reasoning-capable model for planning
messages=[
{"role": "system", "content": "You are a software architect. Produce plans, not code."},
{"role": "user", "content": prompt}
],
temperature=1 # For reasoning models, temperature is typically fixed
)
return response.choices[0].message.content
plan = generate_plan(
task="Add JWT-based authentication middleware to the Express app. "
"Protect all /api/* routes except /api/auth/login and /api/auth/register.",
context=context
)
print(plan)
3. Review the Plan
This is the step most people skip, and it is the most valuable. Read the plan. Does the execution order make sense? Are there missing files? Does the LLM assume something about your codebase that is wrong? Catching a mistake here costs you 30 seconds. Catching it after execution costs you a debugging session.
def save_plan_for_review(plan: str, output_path: str = "plan.md"):
"""Save the plan to a file for human review before execution."""
with open(output_path, 'w') as f:
f.write(plan)
print(f"Plan saved to {output_path}. Review before proceeding.")
save_plan_for_review(plan)
Designing the Execution Phase
Once the plan is approved, the execution agent takes over. The key design decision here is granularity: execute one plan step at a time, validate, then proceed. Do not send the entire plan and ask the LLM to implement everything at once — that reintroduces the interleaving problem.
import json
import subprocess
EXECUTION_PROMPT = """
You are a senior developer implementing a specific step from an approved plan.
## Full Plan (for context — do NOT implement other steps)
{full_plan}
## Current Step to Implement
Step {step_number}: {step_description}
## Current File Contents
{current_file_contents}
## Instructions
Implement ONLY this step. Return the complete updated file contents.
Do not skip to later steps. Do not modify files not mentioned in this step.
"""
def execute_step(
plan: str,
step_number: int,
step_description: str,
file_paths: list[str]
) -> str:
"""Execute a single step from the plan."""
client = openai.OpenAI()
context = gather_context(file_paths)
prompt = EXECUTION_PROMPT.format(
full_plan=plan,
step_number=step_number,
step_description=step_description,
current_file_contents=context
)
response = client.chat.completions.create(
model="gpt-4o", # Fast model is fine for execution
messages=[
{"role": "system", "content": "You are a developer. Write clean, working code."},
{"role": "user", "content": prompt}
],
temperature=0 # Low temperature for deterministic code output
)
return response.choices[0].message.content
Notice that the execution prompt includes the full plan for context. This is critical — the execution agent needs to know the big picture to make locally correct decisions. But the instruction constrains it to only implement the current step.
The Validation Loop
After each step, run validation. This can be as simple as a syntax check or as thorough as running the test suite. The validation result feeds back into the execution agent if something failed.
def validate_step(step_number: int, validation_commands: list[str]) -> dict:
"""Run validation commands and return results."""
results = {"step": step_number, "passed": True, "errors": []}
for cmd in validation_commands:
try:
result = subprocess.run(
cmd.split(),
capture_output=True,
text=True,
timeout=30 # Prevent hanging
)
if result.returncode != 0:
results["passed"] = False
results["errors"].append({
"command": cmd,
"stderr": result.stderr[:500] # Truncate long errors
})
except subprocess.TimeoutExpired:
results["passed"] = False
results["errors"].append({
"command": cmd,
"stderr": "Command timed out after 30 seconds"
})
return results
# After implementing step 1 (e.g., creating the auth middleware file):
validation = validate_step(1, [
"npx tsc --noEmit", # Type check
"npx eslint src/middleware/auth.ts" # Lint the new file
])
if not validation["passed"]:
print(f"Step 1 failed validation: {validation['errors']}")
# Feed errors back to the execution agent for a retry
sequenceDiagram
participant H as Human
participant P as Planning Agent
participant E as Execution Agent
participant V as Validator
H->>P: Task + Codebase Context
P->>H: Structured Plan
H->>H: Review & Approve Plan
H->>E: Plan + Step 1
E->>V: Generated Code
V->>V: Run checks (tsc, eslint, tests)
alt Validation Passes
V->>E: Proceed to Step 2
else Validation Fails
V->>E: Error details + retry
E->>V: Revised code
end
E->>H: All steps complete
Comparing Approaches
| Aspect | Single-Phase (Plan + Execute Together) | Two-Phase (Plan, Then Execute) |
|---|---|---|
| Reliability | Degrades on multi-file tasks | Consistent across task sizes |
| Reviewability | Must read generated code to understand intent | Plan is reviewable before any code changes |
| Error recovery | Must undo partial changes and restart | Retry individual steps against the plan |
| Token efficiency | Sometimes fewer tokens for simple tasks | More tokens overall but fewer wasted on backtracking |
| Latency | Faster for trivial changes | Adds planning overhead (seconds to minutes) |
| Best for | Single-file edits, quick fixes | Multi-file features, refactors, architectural changes |
Real-World Considerations
When this pattern is overkill. If you are asking an agent to rename a variable or fix a typo, a planning phase adds latency with no benefit. Use the two-phase pattern for tasks that touch multiple files, introduce new abstractions, or change system behavior. A good heuristic: if you would write a design doc before doing it yourself, use the two-phase pattern.
Plan staleness. If your plan references file contents that change during execution (because earlier steps modified them), the plan can become stale. Mitigate this by re-reading file contents before each execution step rather than caching them from the planning phase.
Plan format matters. Structured formats (numbered steps, explicit file paths) work better than prose paragraphs. Consider having the planning agent output JSON or YAML if you want to parse the plan programmatically:
STRUCTURED_PLAN_PROMPT = """
...same as before, but add:
Output the plan as a JSON object with this schema:
{
"summary": "string",
"steps": [
{
"order": 1,
"description": "string",
"files_to_modify": ["path/to/file.ts"],
"files_to_create": ["path/to/new_file.ts"],
"validation": ["npx tsc --noEmit"]
}
],
"test_criteria": ["string"],
"risks": ["string"]
}
"""
Context window pressure. Including the full plan in every execution prompt consumes tokens. For large plans, consider summarizing completed steps and only including the full detail for the current and adjacent steps.
Do Not Skip Validation Between Steps
It is tempting to batch multiple execution steps together for speed. Resist this. If step 3 depends on step 2 and step 2 introduced a bug, batching means step 3 builds on a broken foundation. The validation loop between steps is what makes this pattern reliable. Without it, you are back to hoping the agent gets everything right in one shot.
Persisting plans across sessions. One practical challenge is that planning context gets lost when a session ends. If you generate a plan today and want to resume execution tomorrow, you need the plan — and the reasoning behind it — to be available. This is where persistent memory layers become useful: they let execution agents reference prior planning decisions without re-deriving them from scratch.
Further Reading and Tools
- HDDL 2.1: Towards Defining a Formalism and a Semantics for Temporal HTN Planning by Pellier et al. — Formalizes hierarchical task decomposition, which is the theoretical foundation for separating planning from execution in agent systems.
- Provenance-Based Assessment of Plans in Context by Friedman et al. — Explores how plan metadata (provenance) can be used to assess confidence and risk, relevant if you want to build plan-quality scoring into your pipeline.
- Anthropic's Claude Agent Guidelines — Anthropic's documentation on agentic tool use includes recommendations for structured multi-step workflows that align with the planning-execution separation.
- SuperLocalMemory — A persistent memory layer for AI agents that preserves planning context across sessions, letting execution agents reference prior decisions.
- LangGraph — A framework for building stateful agent workflows with explicit graph-based control flow, useful for implementing the plan-then-execute pattern with built-in state management.
Key Takeaways
- Separate planning from execution. Have the LLM produce a complete, structured plan before it writes any code. This eliminates the class of errors caused by premature execution decisions.
- Make plans specific and reviewable. A plan that says "update the auth logic" is useless. A plan that specifies file paths, function signatures, and execution order is actionable.
- Execute one step at a time with validation. Run checks (type checking, linting, tests) after each step. Feed errors back to the execution agent with the original plan as context.
- Include the full plan during execution. The execution agent needs the big picture to make locally correct decisions, even though it only implements one step at a time.
- Use this pattern selectively. Single-file fixes do not need a planning phase. Multi-file features, refactors, and architectural changes benefit enormously from it.
- Persist your plans. If execution spans multiple sessions, make sure the plan and its rationale survive. Lost context leads to inconsistent execution.
Top comments (0)