Sahil Kathpal

Posted on Apr 30 • Originally published at codeongrass.com

The CORE Agentic Workflow: Task Plan Review Approve PR

#agents #ai #automation #softwaredevelopment

The standard agentic loop — give the agent a task, get code back — has no checkpoint between your intent and the agent's execution. You find out what the agent decided to do by reading what it already did. The CORE workflow closes that gap by splitting every agentic task into two sequential sessions: one that drafts a plan and stops, and one that executes the approved plan and returns a PR. Human judgment sits at the decision point between them — not at every tool call, not after the damage is done.

TL;DR

The agentic approval workflow has six stages:

Write a structured task file with scope constraints and an explicit stop instruction
Run a planning session — agent drafts a plan, writes it to disk, and stops
Review the plan file, edit if needed, then approve by leaving it unchanged
Run a separate execution session that reads the approved plan and returns a diff
Run a QA inspector agent to audit the diff for scope violations before merge
Review the PR diff against the approved plan and merge

No micromanaging tool calls. Two human decision points. Every irreversible action is preceded by a plan you reviewed.

Goal

Build a repeatable workflow where AI coding agents handle implementation autonomously while human judgment gates two high-stakes checkpoints: what the agent is about to do (the plan), and what the agent actually did (the diff). The output of every task is a reviewable PR — not a clarification question, not a half-finished change, not a surprise.

The pattern is called the CORE workflow, shared by developers in r/ClaudeCode who wanted to "write tasks and come back to PRs, not decisions." It works with any Claude Code-compatible setup and requires no additional tooling to get started.

Prerequisites

Claude Code installed and authenticated (claude --version)
Git initialized in your project directory
A consistent task file format (templates below)
Optional: Grass for mobile approval forwarding when sessions run while you're away from your desk

Step 1: Write a Structured Task File

A task file is not a prompt. It's a contract between you and the agent. It specifies the objective, the scope boundaries, the constraints, and — critically — an explicit stop instruction that prevents the agent from executing before you've reviewed the plan.

# TASK.md

## Objective
Add JWT-based authentication to the Express API.

## Scope
- In scope: src/middleware/, src/routes/api.ts, src/models/User.ts
- Out of scope: src/frontend/, package.json (list proposed dependency additions
  in PLAN.md only — do not install anything during the planning phase)

## Constraints
- Do not modify existing route signatures
- All new functions require a corresponding test
- No network calls during the planning phase

## Deliverable
Write your implementation plan to PLAN.md. List every file you will
create or modify with a one-line summary of each change. List any
dependencies you need to add with version numbers.

DO NOT write any source code. Write PLAN.md and stop.

The last line is the checkpoint instruction. Burying it inside the constraints section weakens it — agents prioritize recency, and the final instruction carries the most weight. Putting it in all-caps at the end of the file is not excessive; it's precise.

Scope files as a broader pattern: Developers running multiple concurrent agents extend this into a per-agent scope file architecture. As documented in a multi-agent ops thread on r/ClaudeCode, practitioners running seven agents across three concurrent projects maintain separate scope files (what each agent can touch), soul files (decision heuristics and style preferences), and guardrail files (hard stops and forbidden operations) for each agent. The QA inspector is the seventh agent — reviewing the outputs of the other six. We'll implement that pattern in Step 5.

Step 2: Run the Planning Session

Invoke Claude Code against your task file with a prompt that reinforces the stop instruction:

claude --model claude-sonnet-4-6 \
  "Read TASK.md carefully. Write your implementation plan to PLAN.md \
   exactly as instructed. Stop when PLAN.md is written. Do not modify \
   any other files."

The agent reasons through the implementation, writes PLAN.md, and exits. Your terminal returns. No source files were touched. No dependencies were installed.

This is the key architectural decision in the CORE workflow: planning and execution are separate invocations, not a single long session with an internal checkpoint. A single session with a "plan first, then wait for my approval" instruction is unreliable — the agent may misread its own prior output or interpret a follow-up message as implicit approval. Two sessions with an explicit handoff removes that ambiguity entirely.

Step 3: Review and Approve the Plan

Open PLAN.md and read it. This is the moment where you see the agent's intentions before they become actions.

A well-structured PLAN.md looks like:

# Implementation Plan: JWT Authentication

## Files to create or modify
- src/middleware/auth.ts — CREATE; JWT verification middleware
- src/routes/api.ts — MODIFY lines ~45–60; add auth middleware to /api/data
- src/models/User.ts — MODIFY; add passwordHash field + bcrypt methods

## Proposed dependencies
- jsonwebtoken@9.0.0
- bcrypt@5.1.0

## New test files
- src/__tests__/auth.test.ts — CREATE; integration tests for token issuance
  and verification

## Execution sequence
1. Update User model with password hashing
2. Create auth middleware
3. Protect routes
4. Write tests
5. Create branch feat/jwt-auth and commit all changes

Review checklist before approving:

[ ] All listed files are within the scope defined in TASK.md
[ ] No files listed that you didn't expect
[ ] Proposed dependencies are acceptable (version ranges, license)
[ ] Execution sequence doesn't front-load destructive operations
[ ] Test coverage is specified, not implied

If something's wrong, edit PLAN.md directly. You don't need another round-trip with the agent — the execution session reads what's on disk, not what the planning session originally proposed. Edit the plan, save it, and you've changed the execution contract.

Step 4: Run the Execution Session

With PLAN.md reviewed and correct, trigger the execution session:

claude --model claude-sonnet-4-6 \
  "PLAN.md has been reviewed and approved. Execute it exactly as specified. \
   When complete: run the test suite, fix any failures, then create branch \
   feat/jwt-auth and commit all changes. Do not ask clarifying questions — \
   if you encounter ambiguity, make the conservative choice and note it in \
   a NOTES.md file."

This session runs autonomously. The agent implements what the plan specified, runs tests, and produces a clean branch. You don't watch it. You don't approve individual tool calls. The plan review was your checkpoint.

As Martin Fowler notes in his exploration of humans and agents in software engineering loops, the practical design question is where human judgment belongs in the loop — not whether to include it. Inserting humans at every bash execution kills the productivity gain. Inserting them at plan review and diff review keeps judgment at the decisions that actually matter.

One caveat on session length: Autonomy instructions erode as context accumulates. Complex executions — especially those involving 15 or more tool calls — are where agents start inserting unsolicited check-ins or silently revising the plan toward a "safer" interpretation that wasn't requested. See Why Your Claude Agent Ignores Rules Past ~15 Tool Calls for the root cause analysis. For large tasks, break the execution into stages (each with its own PLAN.md equivalent) and run each stage as a separate session.

Step 5: Run the QA Inspector

The QA inspector is the oversight layer most people skip. It's a separate Claude Code invocation — the "seventh agent" in the multi-agent framework — that reviews the execution diff before you look at it manually. It catches silent scope expansion and security patterns that are easy to miss in a line-by-line diff review.

claude --model claude-sonnet-4-6 \
  "You are a QA inspector. Review the output of 'git diff main...feat/jwt-auth'.

   Check for:
   1. Scope violations — files modified that are NOT listed in PLAN.md
   2. Security patterns — hardcoded secrets, unvalidated user input in
      SQL queries or shell commands, missing input sanitization
   3. Coverage gaps — new functions without corresponding test assertions
   4. Dependency drift — packages installed that are not in PLAN.md

   Output format:
   STATUS: PASS or FAIL

   If FAIL, list each issue as:
   - [SCOPE|SECURITY|COVERAGE|DEPENDENCY] file/path:line — description"

Give the inspector a specific rubric. A vague instruction ("check for security issues") produces vague output that doesn't catch real problems. Specific patterns — "unvalidated user input in SQL queries," "hardcoded secrets matching /api[_-]?key|token/i" — produce actionable line references.

Step 6: Review the Diff and Merge

Once the QA inspector passes, review the diff yourself:

git diff main...feat/jwt-auth

Everything in the diff should match what PLAN.md specified. If there are surprises — files you didn't expect, changes outside the planned scope — check the session's transcript before deciding whether to merge or roll back. The post How to Review AI-Generated Code That Ships Faster Than You Can Read covers the four-checkpoint review framework for diffs that move faster than you can read them linearly.

Verification: Is the Workflow Running Correctly?

A correctly functioning workflow has these observable properties:

PLAN.md exists and is timestamped before any source files in the execution branch are modified
The QA inspector produces output with specific file paths and line numbers, not generic statements
Every changed file in the PR diff is listed in PLAN.md
Long executions are split into discrete stages — no single session runs more than 15–20 tool calls before a checkpoint

If your execution session is producing PLAN.md and modifying source files in the same session, the planning checkpoint isn't holding. The stop instruction needs to be more explicit, or you need to run the planning invocation with a more restricted scope (e.g., read-only tool access).

Troubleshooting

Agent starts executing during the planning session

The stop instruction needs to be the last line of the prompt, not embedded in a constraint list. Make it explicit and terminal: "Write PLAN.md and stop. Do not modify any source files under any circumstances." Some developers run the planning session with a .claude/settings.json that sets "allowedTools": ["Read", "Write"] — limiting the session to file reads and the single PLAN.md write.

Agent drifts from the approved plan during execution

This is the context erosion problem documented in Why Your Claude Agent Ignores Rules Past ~15 Tool Calls. The most reliable fix: include the full text of PLAN.md in the execution prompt body, not just a reference to the file. As context accumulates, the agent's attention to the original file reference degrades; quoted plan text in the prompt is more durable.

QA inspector flags false positives

Tighten the rubric. Replace "check for security issues" with specific patterns the inspector should match against. Provide an example of a PASS output and a FAIL output in the prompt so the inspector has a calibration reference before it starts reviewing.

Subagents don't follow the plan

When Claude Code spawns subagents via Task() calls internally, those agents don't inherit the parent session's PLAN.md context. Pass the relevant sections of PLAN.md explicitly in the subagent's task description. For tool-level control during execution, How to Build Human-in-the-Loop Approval Gates for AI Coding Agents covers the PreToolUse hook layer that sits beneath the plan-review layer described here.

The Subagent Visibility Gap

The five-step workflow above closes the plan-review checkpoint cleanly. There's one scenario it doesn't fully handle: subagents running inside an execution session.

When your primary agent spawns subagents via internal Task() calls, you lose three things simultaneously. As one developer described in a r/ClaudeCode thread on subagent usage: "can't see diffs like main agent, difficult to interrupt, auto-allow means no feedback loop when denying permissions."

The structural fix: treat subagents as separate sequential sessions, each with their own scoped PLAN.md, rather than concurrent fire-and-forget spawns. Each subagent task gets its own planning checkpoint before execution begins. This makes the subagent outputs independently reviewable rather than folded invisibly into the parent session's diff.

As breyta.ai's overview of human-in-the-loop coding agents notes: "Human-in-the-loop workflows add planned human checkpoints to agent runs. Coding agents do the work. People approve, correct, or supply context when it matters." The planning session is where context is supplied — not mid-execution.

Inference.sh's approval gates documentation captures the design principle directly: "Approval gates are not a limitation on agent capability. They are what makes powerful capabilities safe to deploy. The combination of automation for routine actions and oversight for consequential ones gives you the benefits of both without the risks of either alone." The CORE workflow operationalizes this by making plan review the non-negotiable gate for every consequential action.

How Grass Makes This Workflow Better

The five-step workflow above runs entirely from your terminal. It works well when you're at your desk. Where it breaks down: the execution session is running, it hits an unexpected permission gate — a file write outside the planned scope, a bash command with side effects you didn't anticipate — and you're not there.

In a standard setup, that means either the session blocks indefinitely waiting for a terminal response, or auto-allow mode runs the operation through and you find out about it only when you read the diff.

How Grass closes the gap:

Grass is a machine built for AI coding agents. The CLI (npm install -g @grass-ai/ide) runs a local HTTP server that bridges your running Claude Code sessions to a native iOS app. Permission requests from any active session — including subagent sessions that would normally be invisible — surface as modal notifications on your phone. You see the exact tool name, the exact command or file path, and tap Allow or Deny. No terminal polling. No SSH session open on your phone.

npm install -g @grass-ai/ide
grass start
# Scan the QR code with the Grass iOS app
# Running sessions now forward permission requests to your phone

The Grass diff viewer shows git diff HEAD parsed into per-file views with syntax-highlighted additions and deletions — the same review you'd run manually in step 6, accessible from your phone as soon as the execution session finishes. You can approve the plan in step 3 and review the output diff in step 6 from the same surface, without being at your laptop for either.

For sessions that run over long time windows — overnight builds, tasks dispatched during your commute — Grass runs on an always-on cloud VM per user. Agents don't die when your laptop sleeps. The execution session you dispatched at 8am is still running when you check from your phone at noon.

The Grass server also supports a mode parameter in the chat API: "plan" (planning only, no tool execution) and "build" (execution mode). This maps directly to the CORE workflow's two-session split — you can trigger the planning session and the execution session from your phone without re-entering prompts.

Setup (optional enhancement to the core workflow):

Free tier at codeongrass.com: 10 hours, no credit card required.

FAQ

How do I prevent my AI coding agent from running code before I've reviewed the plan?

Split the task into two separate Claude Code invocations: a planning session that writes PLAN.md and stops, and an execution session that reads the approved PLAN.md and implements it. The stop instruction must be the final, explicit directive in your planning prompt: "Write PLAN.md and stop. Do not modify any source files." Running them as separate sessions removes ambiguity about when execution begins — there is no way for the agent to "accidentally" continue into execution when the session has already exited.

What is the CORE agentic approval workflow?

CORE is a task management pattern for AI coding agents: write a structured task file → agent drafts a plan → human reviews and approves → a separate session executes the plan → session returns a reviewable PR. The key property is that the agent never writes source code without a human-reviewed plan on disk. The pattern was shared by developers in r/ClaudeCode who wanted to dispatch tasks and return to PRs rather than mid-task decision prompts.

What is the 7-agent framework and what is the QA inspector's role?

A multi-agent architecture documented by developers running concurrent agents across multiple projects. Each agent gets three files: a scope file (what directories and files it can touch), a soul file (decision heuristics and style preferences), and a guardrail file (hard stops and forbidden operations). The seventh agent is a QA inspector that runs after execution and before the diff is merged — reviewing every other agent's output for scope violations, security patterns, and coverage gaps. It's the oversight layer that most multi-agent setups omit.

How do I handle permission requests from subagents I can't monitor directly?

The structural fix is to avoid concurrent subagent spawning. Run subagents as sequential, scoped tasks — each with its own plan checkpoint — so you can review between runs. For live permission forwarding from active subagent sessions, Grass surfaces all permission requests (including from subagents) as mobile notifications that can be allowed or denied without a terminal.

What should the QA inspector check in a post-execution diff?

At minimum: scope violations (files modified that weren't in the approved PLAN.md), security patterns (hardcoded secrets, unvalidated input in SQL or shell commands, missing sanitization), coverage gaps (new functions without corresponding test assertions), and dependency drift (packages installed that weren't in the plan). Always give the inspector a specific, enumerated rubric rather than a generic "check for issues" instruction — vague prompts produce vague output that doesn't catch real problems.

Next steps: The plan-review layer handles upstream scope control. For the tool-level enforcement layer beneath it — blocking specific bash patterns, allowlisting file operations, and forwarding live approval gates — see How to Build Human-in-the-Loop Approval Gates for AI Coding Agents. The two layers are complementary: plan review catches scope before execution starts; PreToolUse hooks catch dangerous operations during execution. Run both and you've covered both checkpoints.

Originally published at codeongrass.com

Top comments (1)

PEACEBINFLOW • May 2

The distinction between a single session with an internal checkpoint and two separate invocations is one of those things that sounds like an implementation detail but is actually the whole architecture. I've seen the same pattern in CI/CD—people try to put a manual approval gate inside a running pipeline and it always gets flaky, because the system wasn't designed to pause and resume with clean state. Two separate runs with a file on disk as the handoff artifact is the boring, reliable version of that.

What I find myself thinking about is how the PLAN.md file becomes a kind of protocol between human and agent, not just a document. It's tempting to treat it as a one-time approval artifact, but the fact that you can edit it directly—not ask the agent to revise it—means you're using the file as a communication channel that bypasses natural language negotiation entirely. You don't explain what you want changed. You just change it, and the execution session treats your edit as ground truth. That's a lot less ambiguous than going back and forth in conversation. It's also a pattern that would work even if tomorrow's models have completely different conversational behaviors, because the file format is the stable interface, not the prompt style.

The QA inspector as a separate invocation is the piece I suspect most teams will skip and later wish they hadn't. It's easy to think "I review the diff anyway, why automate it?" But the inspector isn't replacing your review—it's filtering the diff through a consistent rubric before you ever look at it, so your attention goes to the surprises rather than scanning for patterns a script could catch. That changes the review from a discovery exercise to a verification exercise. Curious how often you find the inspector catching things you would have missed—not because you're careless, but because humans are bad at scanning for missing test files in a diff that spans thirty files.