Mixture of Experts

Posted on May 26

Alignment is moving into the agent control plane

#ai #agents #coding #programming

tl;dr - Plan Mode, Outcomes, Skills, and agent-as-judge workflows point toward a shared pattern: reliable coding agents depend less on a single prompt and more on the planning, steering, memory, and verification systems around the model.

A coding agent can be safe and still produce the wrong work.

A typical failure is not catastrophic behavior. It is a smaller mismatch between the requested change and the implemented change: an issue asks for a sliding-window rate limiter on /api/upload, and the agent implements a token bucket on /api/files, adds an unnecessary configuration flag, and updates the docs around that mistaken interpretation. The tests may still pass.

The problem is not model safety in the broad sense. It is intent alignment inside a specific repository, under incomplete requirements, with constraints that may not have been written into the prompt.

The current tooling suggests that the fix is no longer just waiting for a smarter model. Plans, skills, persistent instructions, rubrics, judge agents, review agents, tasks, and memory are all becoming part of the same system.

Taken together, these tools suggest that alignment is becoming less concentrated in the model itself and more dependent on the control plane around it: the artifacts, rubrics, reviewers, memories, and human decision points that shape the agent's work.

The old prompt loop is too narrow

The default workflow still treats intent as something that can be captured in one exchange:

Give the agent a long prompt, a /goal, or a bundle of instructions.
Let it edit files, run tests, and possibly spawn reviewers.
Review the resulting diff.
Explain what it misunderstood.
Repeat until the result is acceptable.

That loop made sense when the unit of work was one chat session and one assistant. It starts to break once agents can run for hours, touch many files, write tests, and produce changes faster than a human can review them.

A final pull request is not a large enough channel for intent in that setting.

By the time the diff exists, the agent has already made several important decisions: which files matter, which constraints are optional, which existing pattern to copy, what "done" means, and how to behave when tests are missing. If those decisions are wrong, review becomes a reconstruction exercise rather than a steering mechanism.

The newer tools address that problem by making the agent externalize its assumptions before, during, and after the work.

First, make the plan editable

Cursor's Plan Mode makes the agent research, ask clarifying questions, and write an editable Markdown plan before it changes code. Cognition's Devin 2.0 puts a preliminary plan and the relevant files in front of the user early. JetBrains Junie, GitHub Spec Kit, and AWS Kiro are all moving toward a similar sequence:

requirements.md → plan.md → tasks → code

The important part of this sequence is that it moves misunderstanding earlier in the process.

If the agent misunderstands the request at the plan stage, the correction is usually a sentence in a Markdown file. If it misunderstands the request during implementation, the correction requires diff review, rework, and cleanup from side effects that may already have spread across the codebase.

A useful plan makes the important constraints inspectable before code exists: use /api/upload, not /api/files; avoid a new configuration flag; reuse the existing limiter; add the regression test first.

The plan is the first alignment surface.

Steering has to continue during the run

A plan does not remove drift from long agent runs.

Anthropic's Measuring AI agent autonomy in practice reported that Claude Code now interrupts itself for clarification on hard tasks more often than human reviewers interrupt human pairs. That behavior is not only a UX detail. It is a control mechanism.

A useful agent should be able to identify when it is guessing.

Task lists serve the same purpose. They are not only progress reporting. They expose the agent's decomposition while the work is still in progress:

- [ ] Add upload rate-limit middleware
- [ ] Create new limiter config flag
- [ ] Update /api/files tests

The second and third items show the misunderstanding while it is still cheap to correct. Waiting until the final diff makes the same issue harder to isolate.

Parallel skills make this more important. A frontend skill, migration skill, test-writing skill, and reviewer agent can all run at once. That can increase throughput, but it also creates more places where the system can choose a different definition of done unless each worker exposes its assumptions, tasks, and intermediate state.

Once a workflow involves multiple agents, chat and PR review are not enough. The system needs visibility into what each agent believes it is doing while the work is still happening.

Intent has to survive the session

The next problem is memory.

This is not memory as user personalization, and it is not only a library of reusable skills. Procedures are useful, but the more important layer is actionable memory: what the system learns from what happened in previous runs.

Every serious agent run leaves evidence behind. The plan it wrote, the files it opened, the assumptions it made, the commands it ran, the reviewer comments it ignored, and the human correction that finally made the work acceptable are all useful signals.

That evidence should not disappear into chat history. It can become a learning loop:

run logs → candidate lessons → evaluator → behavior updates → future runs

After the rate-limiter failure, the lesson is not simply "write a rate-limiter skill." The useful lesson is narrower: when an issue names an endpoint, verify the route before editing; do not introduce product-facing configuration unless the plan explicitly asks for it; if the requested algorithm and the existing pattern disagree, stop and ask.

Those are behavior updates. They can land in project instructions, planner rubrics, reviewer checks, interruption policies, or sometimes a skill. They should also be evaluated before they change future behavior. The evaluation should ask whether the lesson is supported by the logs, scoped narrowly enough to be safe, and likely to prevent the same class of mistake.

This also changes how improvement can be measured. Keeping logs, lessons, and evaluation outcomes makes it possible to ask whether scope drift is decreasing, whether reviewers are catching fewer repeat mistakes, whether human corrections are getting smaller, and whether a memory update reduced errors or only added prompt noise.

Memory without evaluation doesn't hold as much value. Memory tied to logs and evals becomes an improvement system.

Verification has to check intent

Tests describe whether the code behaves according to the cases that were written down.

They do not necessarily show whether the agent built the requested thing.

That gap is why Anthropic's Outcomes primitive is interesting. An Outcome is a Markdown rubric that a separate agent can use to grade the result in a fresh context. The planner can loop until the rubric is satisfied. The relevant shift is that the system does not only ask whether a command passed. It asks whether the result satisfies criteria that were written before implementation started.

For the rate limiter example, a useful rubric might say:

The implementation is successful if:
- /api/upload enforces a sliding-window limit per authenticated user.
- No new product-facing configuration is introduced.
- The change does not affect /api/files.

That gives the reviewer a contract. The review is no longer based only on whether the diff looks reasonable.

The Agent-as-a-Judge survey points in the same direction: judge agents become more reliable when they observe intermediate steps instead of grading only the final answer. Observation matters because the review agent needs to understand how the work happened, not only inspect the final diff.

One agent builds. Another checks. A rubric defines the target. The human still decides which failures matter.

The stack is converging

Plan. Steer. Remember. Verify.

These used to look like separate research and product threads. They are now becoming the standard shape of the agent stack.

Layer	Artifact	What it protects
Plan	`plan.md`, `.cursor/plans/`, Spec Kit, Playbooks	The agent's understanding before code
Steer	Tasks, clarification interrupts, live progress	The trajectory during work
Remember	Run logs, evaluated lessons, behavior updates, memory systems	Learning from previous work across sessions
Verify	Outcomes, rubrics, judge agents, review agents	Whether the result matches intent

The important shift is that agent reliability becomes a systems problem, not only a prompting problem.

The plan is an interface. The skill is an interface. The task list is an interface. The rubric is an interface. The reviewer prompt is an interface. Each one carries intent from one part of the system to another.

That work is software design applied to a different set of components.

What this means for engineering teams

Model providers can ship better models. Tool vendors can ship better planners, spec workflows, playbooks, and review systems.

Those tools still do not know what correctness means inside a specific codebase.

They do not know that the billing system treats retries differently for enterprise customers. They do not know that the migration tool must be run through drizzle:generate, not raw SQL. They do not know which product constraints are implicit because everyone on the team has internalized them.

That knowledge has to live somewhere durable and usable.

If it lives only in a person's head, the agent will miss it. If it lives only in chat, it will disappear. If it appears only in final PR review, it will arrive after many decisions have already been made.

The practical work is straightforward and valuable:

Write plans the team can edit.
Preserve run logs, reviewer findings, and human corrections.
Turn repeated failures into scoped candidate lessons.
Evaluate those lessons before changing the agent's behavior.
Make success criteria explicit before implementation.
Measure whether memory updates reduce drift, repeat mistakes, and human correction.

This is the control plane for coding agents.

The control plane is becoming the main editing surface

For years, the editor was where most software work happened. Code changed, the compiler or test suite responded, and the feedback loop stayed local and visible.

Coding agents move part of that loop up a level.

Important edits are now often made outside handler.ts: in plan.md, SKILL.md, CLAUDE.md, task lists, rubrics, and reviewer instructions. Those files decide what the agent sees, what it is allowed to change, how it reports progress, and how the result gets judged.

The model is still important, but it is not the whole product. The product is the system around it: artifacts, reviewers, memories, checks, and human decision points that convert model capability into reliable work.

The practical advantage comes from designing that surrounding system well.

Top comments (2)

PracHub • May 26

The article is right about coding agents needing more than a single prompt for reliable output. Moving misunderstanding earlier with tools like Plan Mode seems practical given the complexity of real-world tasks. But does relying on memory and verification really solve the issue if an agent fundamentally misunderstands the initial problem? I used PracHub for system design mocks, and the follow-up questions matched what I encountered in real interviews. It's definitely better than random threads online.

Mixture of Experts • Jun 3

I think that these are ways to help reduce the challenges. I think another part of this is to be able to effectively steer mid-run. So having a way to gain proper alignment initially but then also refine alignment in the middle of a process, and then also at the end make it easy to revert changes/etc. in case the end result is off. We need to have the ability to control alignment at each stage of the build process because naturally as humans our process is similar when we handle a task alone or work on a team, it's an iterative approach that shifts, not one straight line, to get to alignment.