tl;dr - Plan Mode, Outcomes, Skills, and agent-as-judge workflows point toward a shared pattern: reliable coding agents depend less on a single prompt and more on the planning, steering, memory, and verification systems around the model.
A coding agent can be safe and still produce the wrong work.
A typical failure is not catastrophic behavior. It is a smaller mismatch between the requested change and the implemented change: an issue asks for a sliding-window rate limiter on /api/upload, and the agent implements a token bucket on /api/files, adds an unnecessary configuration flag, and updates the docs around that mistaken interpretation. The tests may still pass.
The problem is not model safety in the broad sense. It is intent alignment inside a specific repository, under incomplete requirements, with constraints that may not have been written into the prompt.
The current tooling suggests that the fix is no longer just waiting for a smarter model. Plans, skills, persistent instructions, rubrics, judge agents, review agents, tasks, and memory are all becoming part of the same system.
Taken together, these tools suggest that alignment is becoming less concentrated in the model itself and more dependent on the control plane around it: the artifacts, rubrics, reviewers, memories, and human decision points that shape the agent's work.
The old prompt loop is too narrow
The default workflow still treats intent as something that can be captured in one exchange:
- Give the agent a long prompt, a
/goal, or a bundle of instructions. - Let it edit files, run tests, and possibly spawn reviewers.
- Review the resulting diff.
- Explain what it misunderstood.
- Repeat until the result is acceptable.
That loop made sense when the unit of work was one chat session and one assistant. It starts to break once agents can run for hours, touch many files, write tests, and produce changes faster than a human can review them.
A final pull request is not a large enough channel for intent in that setting.
By the time the diff exists, the agent has already made several important decisions: which files matter, which constraints are optional, which existing pattern to copy, what "done" means, and how to behave when tests are missing. If those decisions are wrong, review becomes a reconstruction exercise rather than a steering mechanism.
The newer tools address that problem by making the agent externalize its assumptions before, during, and after the work.
First, make the plan editable
Cursor's Plan Mode makes the agent research, ask clarifying questions, and write an editable Markdown plan before it changes code. Cognition's Devin 2.0 puts a preliminary plan and the relevant files in front of the user early. JetBrains Junie, GitHub Spec Kit, and AWS Kiro are all moving toward a similar sequence:
requirements.md → plan.md → tasks → code
The important part of this sequence is that it moves misunderstanding earlier in the process.
If the agent misunderstands the request at the plan stage, the correction is usually a sentence in a Markdown file. If it misunderstands the request during implementation, the correction requires diff review, rework, and cleanup from side effects that may already have spread across the codebase.
A useful plan makes the important constraints inspectable before code exists: use /api/upload, not /api/files; avoid a new configuration flag; reuse the existing limiter; add the regression test first.
The plan is the first alignment surface.
Steering has to continue during the run
A plan does not remove drift from long agent runs.
Anthropic's Measuring AI agent autonomy in practice reported that Claude Code now interrupts itself for clarification on hard tasks more often than human reviewers interrupt human pairs. That behavior is not only a UX detail. It is a control mechanism.
A useful agent should be able to identify when it is guessing.
Task lists serve the same purpose. They are not only progress reporting. They expose the agent's decomposition while the work is still in progress:
- [ ] Add upload rate-limit middleware
- [ ] Create new limiter config flag
- [ ] Update /api/files tests
The second and third items show the misunderstanding while it is still cheap to correct. Waiting until the final diff makes the same issue harder to isolate.
Parallel skills make this more important. A frontend skill, migration skill, test-writing skill, and reviewer agent can all run at once. That can increase throughput, but it also creates more places where the system can choose a different definition of done unless each worker exposes its assumptions, tasks, and intermediate state.
Once a workflow involves multiple agents, chat and PR review are not enough. The system needs visibility into what each agent believes it is doing while the work is still happening.
Intent has to survive the session
The next problem is memory.
This is not memory as user personalization, and it is not only a library of reusable skills. Procedures are useful, but the more important layer is actionable memory: what the system learns from what happened in previous runs.
Every serious agent run leaves evidence behind. The plan it wrote, the files it opened, the assumptions it made, the commands it ran, the reviewer comments it ignored, and the human correction that finally made the work acceptable are all useful signals.
That evidence should not disappear into chat history. It can become a learning loop:
run logs → candidate lessons → evaluator → behavior updates → future runs
After the rate-limiter failure, the lesson is not simply "write a rate-limiter skill." The useful lesson is narrower: when an issue names an endpoint, verify the route before editing; do not introduce product-facing configuration unless the plan explicitly asks for it; if the requested algorithm and the existing pattern disagree, stop and ask.
Those are behavior updates. They can land in project instructions, planner rubrics, reviewer checks, interruption policies, or sometimes a skill. They should also be evaluated before they change future behavior. The evaluation should ask whether the lesson is supported by the logs, scoped narrowly enough to be safe, and likely to prevent the same class of mistake.
This also changes how improvement can be measured. Keeping logs, lessons, and evaluation outcomes makes it possible to ask whether scope drift is decreasing, whether reviewers are catching fewer repeat mistakes, whether human corrections are getting smaller, and whether a memory update reduced errors or only added prompt noise.
Memory without evaluation doesn't hold as much value. Memory tied to logs and evals becomes an improvement system.
Verification has to check intent
Tests describe whether the code behaves according to the cases that were written down.
They do not necessarily show whether the agent built the requested thing.
That gap is why Anthropic's Outcomes primitive is interesting. An Outcome is a Markdown rubric that a separate agent can use to grade the result in a fresh context. The planner can loop until the rubric is satisfied. The relevant shift is that the system does not only ask whether a command passed. It asks whether the result satisfies criteria that were written before implementation started.
For the rate limiter example, a useful rubric might say:
The implementation is successful if:
- /api/upload enforces a sliding-window limit per authenticated user.
- No new product-facing configuration is introduced.
- The change does not affect /api/files.
That gives the reviewer a contract. The review is no longer based only on whether the diff looks reasonable.
The Agent-as-a-Judge survey points in the same direction: judge agents become more reliable when they observe intermediate steps instead of grading only the final answer. Observation matters because the review agent needs to understand how the work happened, not only inspect the final diff.
One agent builds. Another checks. A rubric defines the target. The human still decides which failures matter.
The stack is converging
Plan. Steer. Remember. Verify.
These used to look like separate research and product threads. They are now becoming the standard shape of the agent stack.
| Layer | Artifact | What it protects |
|---|---|---|
| Plan |
plan.md, .cursor/plans/, Spec Kit, Playbooks |
The agent's understanding before code |
| Steer | Tasks, clarification interrupts, live progress | The trajectory during work |
| Remember | Run logs, evaluated lessons, behavior updates, memory systems | Learning from previous work across sessions |
| Verify | Outcomes, rubrics, judge agents, review agents | Whether the result matches intent |
The important shift is that agent reliability becomes a systems problem, not only a prompting problem.
The plan is an interface. The skill is an interface. The task list is an interface. The rubric is an interface. The reviewer prompt is an interface. Each one carries intent from one part of the system to another.
That work is software design applied to a different set of components.
What this means for engineering teams
Model providers can ship better models. Tool vendors can ship better planners, spec workflows, playbooks, and review systems.
Those tools still do not know what correctness means inside a specific codebase.
They do not know that the billing system treats retries differently for enterprise customers. They do not know that the migration tool must be run through drizzle:generate, not raw SQL. They do not know which product constraints are implicit because everyone on the team has internalized them.
That knowledge has to live somewhere durable and usable.
If it lives only in a person's head, the agent will miss it. If it lives only in chat, it will disappear. If it appears only in final PR review, it will arrive after many decisions have already been made.
The practical work is straightforward and valuable:
- Write plans the team can edit.
- Preserve run logs, reviewer findings, and human corrections.
- Turn repeated failures into scoped candidate lessons.
- Evaluate those lessons before changing the agent's behavior.
- Make success criteria explicit before implementation.
- Measure whether memory updates reduce drift, repeat mistakes, and human correction.
This is the control plane for coding agents.
The control plane is becoming the main editing surface
For years, the editor was where most software work happened. Code changed, the compiler or test suite responded, and the feedback loop stayed local and visible.
Coding agents move part of that loop up a level.
Important edits are now often made outside handler.ts: in plan.md, SKILL.md, CLAUDE.md, task lists, rubrics, and reviewer instructions. Those files decide what the agent sees, what it is allowed to change, how it reports progress, and how the result gets judged.
The model is still important, but it is not the whole product. The product is the system around it: artifacts, reviewers, memories, checks, and human decision points that convert model capability into reliable work.
The practical advantage comes from designing that surrounding system well.
Top comments (0)