Separation of Planning and Execution: The Key Pattern for Reliable AI Coding Agents

#aiagents #codingagents #promptengineering #agentarchitecture

If you have spent any time using an AI coding agent, you have probably experienced this: you ask it to refactor a module, and it immediately starts editing files, halfway through realizes it needs a different approach, backtracks, leaves orphaned imports, and produces something that half-works. The agent was not stupid. It was doing two fundamentally different cognitive tasks at the same time — figuring out what to do and doing it — and that interleaving is where things fall apart.

The fix is an architectural pattern borrowed from classical AI planning: separate the planning phase from the execution phase. Generate a complete plan first — a spec, a file map, test criteria, an ordering of changes — and only then execute against that plan. This single structural change transforms flaky, meandering agent sessions into predictable, reviewable workflows.

What You Will Learn

Why interleaving planning and execution causes most AI coding agent failures

The two-phase workflow pattern: what it looks like in practice

How to write prompts that produce structured, actionable plans

How to implement a feedback loop between execution and the original plan

Concrete code examples you can adapt for your own agent pipelines

When this pattern helps and when it adds unnecessary overhead

Why Agents Fail: The Interleaving Problem

When you ask an LLM to "add authentication to this Express app," the model faces two distinct problems. First, it needs to reason about the task: which files need to change, what dependencies are required, what the middleware chain should look like, how existing routes will be affected. Second, it needs to produce code: actual syntax, correct imports, proper error handling.

These are different kinds of work. Planning is divergent — it explores possibilities, considers constraints, and makes trade-offs. Execution is convergent — it commits to specific syntax and produces concrete artifacts. When an LLM tries to do both simultaneously, it makes execution decisions before planning is complete. It starts writing a JWT middleware before deciding whether sessions or tokens are the right approach. Then it has to backtrack, and backtracking inside a code generation context means lost coherence.

This is not just a theory. If you audit failed agent sessions, you will find a consistent pattern: the agent made a reasonable local decision early on that turned out to be globally wrong. It edited file A in a way that is incompatible with the changes file B needs, because it had not thought through the full dependency graph before starting.

Classical AI planning research formalized this problem decades ago. The HDDL 2.1 formalism for Hierarchical Task Networks, for example, explicitly separates task decomposition (planning) from action execution, because interleaving them in complex domains leads to invalid plans. The same principle applies to LLM-based coding agents.

The Two-Phase Pattern: Plan, Then Execute

The core idea is straightforward: split every agent task into two distinct phases with a clear boundary between them.

Phase 1 — Planning: The LLM analyzes the task, examines relevant code, and produces a structured plan. This plan includes what files will be created or modified, what the expected behavior change is, what tests should pass afterward, and in what order changes should be applied. No code is written yet.

Phase 2 — Execution: A separate agent call (or series of calls) takes the plan as input and implements each step. The execution agent can reference the plan to maintain consistency. After each step, validation checks confirm the execution matches the plan.

Here is the workflow visualized:

graph TD
    A["Task Description"] --> B["Phase 1: Planning Agent"]
    B --> C["Structured Plan"]
    C --> D{"Human Review<br/>(optional)"}
    D -->|Approved| E["Phase 2: Execution Agent"]
    D -->|Revise| B
    E --> F["Step 1: Implement Change"]
    F --> G{"Validate Against Plan"}
    G -->|Pass| H["Step 2: Next Change"]
    G -->|Fail| I["Re-attempt with<br/>Plan Context"]
    I --> G
    H --> J["Step N: Final Change"]
    J --> K{"Run Test Criteria<br/>from Plan"}
    K -->|Pass| L["Done"]
    K -->|Fail| M["Diagnose Against Plan"]
    M --> E

The boundary between phases is the structured plan — a document that both humans and the execution agent can read, review, and reference. This is where the pattern gets its reliability: you can inspect the plan before any code changes happen.

The Plan Is Not a Suggestion

A common mistake is generating a vague, prose-heavy plan and then letting the execution agent freestyle. The plan must be specific enough that you could hand it to a different developer (or a different LLM) and get the same result. If your plan says "update the authentication logic," it is too vague. If it says "add a verifyToken middleware in src/middleware/auth.ts that decodes a JWT from the Authorization header and attaches the payload to req.user," that is actionable.

Designing the Planning Phase

The planning phase prompt needs to guide the LLM toward producing structured, actionable output. Here is a concrete prompt template you can adapt:

PLANNING_PROMPT = """
You are a senior software architect. Your job is to produce a detailed
implementation plan. Do NOT write any code yet.

## Task
{task_description}

## Current Codebase Context
{relevant_file_contents}

## Instructions
Analyze the task and produce a plan with the following sections:

### 1. Summary
One paragraph describing what this change accomplishes.

### 2. Files to Modify
For each file:
- File path
- What changes are needed (be specific about functions, classes, exports)
- Dependencies on other file changes (ordering constraints)

### 3. New Files to Create
For each new file:
- File path
- Purpose
- Key exports or interfaces

### 4. Dependency Changes
- New packages to install (with versions if relevant)
- Configuration changes (tsconfig, eslint, etc.)

### 5. Execution Order
Numbered list of steps, ordered so that each step can be validated
independently. Earlier steps should not depend on later ones.

### 6. Test Criteria
Specific, verifiable conditions that confirm the change is correct:
- Unit tests to add or modify
- Manual verification steps
- Edge cases to handle

### 7. Risks and Assumptions
- What could go wrong
- What assumptions you are making about the codebase
"""

Let us walk through using this with a real task.

1. Gather Context

Before sending the planning prompt, collect the files the LLM will need to reason about. This means reading the relevant source files, the project's package.json or equivalent, and any existing tests. Do not dump the entire codebase — scope it to what is relevant.

import os

def gather_context(file_paths: list[str]) -> str:
    """Read and concatenate file contents with clear delimiters."""
    context_parts = []
    for path in file_paths:
        if os.path.exists(path):
            with open(path, 'r') as f:
                content = f.read()
            context_parts.append(f"### {path}\n```
{% endraw %}
\n{content}\n
{% raw %}
```")
    return "\n\n".join(context_parts)

# Example: gathering context for an auth feature
context = gather_context([
    "src/app.ts",
    "src/routes/users.ts",
    "src/middleware/index.ts",
    "package.json",
    "tsconfig.json"
])

2. Generate the Plan

Send the planning prompt to your LLM. Use a high-reasoning model for this phase — planning benefits more from careful thinking than from fast token generation.

import openai

def generate_plan(task: str, context: str) -> str:
    """Generate an implementation plan without writing any code."""
    client = openai.OpenAI()

    prompt = PLANNING_PROMPT.format(
        task_description=task,
        relevant_file_contents=context
    )

    response = client.chat.completions.create(
        model="o3",  # Use a reasoning-capable model for planning
        messages=[
            {"role": "system", "content": "You are a software architect. Produce plans, not code."},
            {"role": "user", "content": prompt}
        ],
        temperature=1  # For reasoning models, temperature is typically fixed
    )

    return response.choices[0].message.content

plan = generate_plan(
    task="Add JWT-based authentication middleware to the Express app. "
         "Protect all /api/* routes except /api/auth/login and /api/auth/register.",
    context=context
)

print(plan)

3. Review the Plan

This is the step most people skip, and it is the most valuable. Read the plan. Does the execution order make sense? Are there missing files? Does the LLM assume something about your codebase that is wrong? Catching a mistake here costs you 30 seconds. Catching it after execution costs you a debugging session.

def save_plan_for_review(plan: str, output_path: str = "plan.md"):
    """Save the plan to a file for human review before execution."""
    with open(output_path, 'w') as f:
        f.write(plan)
    print(f"Plan saved to {output_path}. Review before proceeding.")

save_plan_for_review(plan)

Designing the Execution Phase

Once the plan is approved, the execution agent takes over. The key design decision here is granularity: execute one plan step at a time, validate, then proceed. Do not send the entire plan and ask the LLM to implement everything at once — that reintroduces the interleaving problem.

import json
import subprocess

EXECUTION_PROMPT = """
You are a senior developer implementing a specific step from an approved plan.

## Full Plan (for context — do NOT implement other steps)
{full_plan}

## Current Step to Implement
Step {step_number}: {step_description}

## Current File Contents
{current_file_contents}

## Instructions
Implement ONLY this step. Return the complete updated file contents.
Do not skip to later steps. Do not modify files not mentioned in this step.
"""

def execute_step(
    plan: str,
    step_number: int,
    step_description: str,
    file_paths: list[str]
) -> str:
    """Execute a single step from the plan."""
    client = openai.OpenAI()

    context = gather_context(file_paths)

    prompt = EXECUTION_PROMPT.format(
        full_plan=plan,
        step_number=step_number,
        step_description=step_description,
        current_file_contents=context
    )

    response = client.chat.completions.create(
        model="gpt-4o",  # Fast model is fine for execution
        messages=[
            {"role": "system", "content": "You are a developer. Write clean, working code."},
            {"role": "user", "content": prompt}
        ],
        temperature=0  # Low temperature for deterministic code output
    )

    return response.choices[0].message.content

Notice that the execution prompt includes the full plan for context. This is critical — the execution agent needs to know the big picture to make locally correct decisions. But the instruction constrains it to only implement the current step.

The Validation Loop

After each step, run validation. This can be as simple as a syntax check or as thorough as running the test suite. The validation result feeds back into the execution agent if something failed.

def validate_step(step_number: int, validation_commands: list[str]) -> dict:
    """Run validation commands and return results."""
    results = {"step": step_number, "passed": True, "errors": []}

    for cmd in validation_commands:
        try:
            result = subprocess.run(
                cmd.split(),
                capture_output=True,
                text=True,
                timeout=30  # Prevent hanging
            )
            if result.returncode != 0:
                results["passed"] = False
                results["errors"].append({
                    "command": cmd,
                    "stderr": result.stderr[:500]  # Truncate long errors
                })
        except subprocess.TimeoutExpired:
            results["passed"] = False
            results["errors"].append({
                "command": cmd,
                "stderr": "Command timed out after 30 seconds"
            })

    return results

# After implementing step 1 (e.g., creating the auth middleware file):
validation = validate_step(1, [
    "npx tsc --noEmit",          # Type check
    "npx eslint src/middleware/auth.ts"  # Lint the new file
])

if not validation["passed"]:
    print(f"Step 1 failed validation: {validation['errors']}")
    # Feed errors back to the execution agent for a retry

sequenceDiagram
    participant H as Human
    participant P as Planning Agent
    participant E as Execution Agent
    participant V as Validator

    H->>P: Task + Codebase Context
    P->>H: Structured Plan
    H->>H: Review & Approve Plan
    H->>E: Plan + Step 1
    E->>V: Generated Code
    V->>V: Run checks (tsc, eslint, tests)
    alt Validation Passes
        V->>E: Proceed to Step 2
    else Validation Fails
        V->>E: Error details + retry
        E->>V: Revised code
    end
    E->>H: All steps complete

Comparing Approaches

Aspect	Single-Phase (Plan + Execute Together)	Two-Phase (Plan, Then Execute)
Reliability	Degrades on multi-file tasks	Consistent across task sizes
Reviewability	Must read generated code to understand intent	Plan is reviewable before any code changes
Error recovery	Must undo partial changes and restart	Retry individual steps against the plan
Token efficiency	Sometimes fewer tokens for simple tasks	More tokens overall but fewer wasted on backtracking
Latency	Faster for trivial changes	Adds planning overhead (seconds to minutes)
Best for	Single-file edits, quick fixes	Multi-file features, refactors, architectural changes

Real-World Considerations

When this pattern is overkill. If you are asking an agent to rename a variable or fix a typo, a planning phase adds latency with no benefit. Use the two-phase pattern for tasks that touch multiple files, introduce new abstractions, or change system behavior. A good heuristic: if you would write a design doc before doing it yourself, use the two-phase pattern.

Plan staleness. If your plan references file contents that change during execution (because earlier steps modified them), the plan can become stale. Mitigate this by re-reading file contents before each execution step rather than caching them from the planning phase.

Plan format matters. Structured formats (numbered steps, explicit file paths) work better than prose paragraphs. Consider having the planning agent output JSON or YAML if you want to parse the plan programmatically:

STRUCTURED_PLAN_PROMPT = """
...same as before, but add:

Output the plan as a JSON object with this schema:
{
  "summary": "string",
  "steps": [
    {
      "order": 1,
      "description": "string",
      "files_to_modify": ["path/to/file.ts"],
      "files_to_create": ["path/to/new_file.ts"],
      "validation": ["npx tsc --noEmit"]
    }
  ],
  "test_criteria": ["string"],
  "risks": ["string"]
}
"""

Context window pressure. Including the full plan in every execution prompt consumes tokens. For large plans, consider summarizing completed steps and only including the full detail for the current and adjacent steps.

Do Not Skip Validation Between Steps

It is tempting to batch multiple execution steps together for speed. Resist this. If step 3 depends on step 2 and step 2 introduced a bug, batching means step 3 builds on a broken foundation. The validation loop between steps is what makes this pattern reliable. Without it, you are back to hoping the agent gets everything right in one shot.

Persisting plans across sessions. One practical challenge is that planning context gets lost when a session ends. If you generate a plan today and want to resume execution tomorrow, you need the plan — and the reasoning behind it — to be available. This is where persistent memory layers become useful: they let execution agents reference prior planning decisions without re-deriving them from scratch.