DEV Community: Mixture of Experts

Alignment is moving into the agent control plane

Mixture of Experts — Tue, 26 May 2026 17:03:53 +0000

tl;dr - Plan Mode, Outcomes, Skills, and agent-as-judge workflows point toward a shared pattern: reliable coding agents depend less on a single prompt and more on the planning, steering, memory, and verification systems around the model.

A coding agent can be safe and still produce the wrong work.

A typical failure is not catastrophic behavior. It is a smaller mismatch between the requested change and the implemented change: an issue asks for a sliding-window rate limiter on /api/upload, and the agent implements a token bucket on /api/files, adds an unnecessary configuration flag, and updates the docs around that mistaken interpretation. The tests may still pass.

The problem is not model safety in the broad sense. It is intent alignment inside a specific repository, under incomplete requirements, with constraints that may not have been written into the prompt.

The current tooling suggests that the fix is no longer just waiting for a smarter model. Plans, skills, persistent instructions, rubrics, judge agents, review agents, tasks, and memory are all becoming part of the same system.

Taken together, these tools suggest that alignment is becoming less concentrated in the model itself and more dependent on the control plane around it: the artifacts, rubrics, reviewers, memories, and human decision points that shape the agent's work.

The old prompt loop is too narrow

The default workflow still treats intent as something that can be captured in one exchange:

Give the agent a long prompt, a /goal, or a bundle of instructions.
Let it edit files, run tests, and possibly spawn reviewers.
Review the resulting diff.
Explain what it misunderstood.
Repeat until the result is acceptable.

That loop made sense when the unit of work was one chat session and one assistant. It starts to break once agents can run for hours, touch many files, write tests, and produce changes faster than a human can review them.

A final pull request is not a large enough channel for intent in that setting.

By the time the diff exists, the agent has already made several important decisions: which files matter, which constraints are optional, which existing pattern to copy, what "done" means, and how to behave when tests are missing. If those decisions are wrong, review becomes a reconstruction exercise rather than a steering mechanism.

The newer tools address that problem by making the agent externalize its assumptions before, during, and after the work.

First, make the plan editable

Cursor's Plan Mode makes the agent research, ask clarifying questions, and write an editable Markdown plan before it changes code. Cognition's Devin 2.0 puts a preliminary plan and the relevant files in front of the user early. JetBrains Junie, GitHub Spec Kit, and AWS Kiro are all moving toward a similar sequence:

requirements.md → plan.md → tasks → code

The important part of this sequence is that it moves misunderstanding earlier in the process.

If the agent misunderstands the request at the plan stage, the correction is usually a sentence in a Markdown file. If it misunderstands the request during implementation, the correction requires diff review, rework, and cleanup from side effects that may already have spread across the codebase.

A useful plan makes the important constraints inspectable before code exists: use /api/upload, not /api/files; avoid a new configuration flag; reuse the existing limiter; add the regression test first.

The plan is the first alignment surface.

Steering has to continue during the run

A plan does not remove drift from long agent runs.

Anthropic's Measuring AI agent autonomy in practice reported that Claude Code now interrupts itself for clarification on hard tasks more often than human reviewers interrupt human pairs. That behavior is not only a UX detail. It is a control mechanism.

A useful agent should be able to identify when it is guessing.

Task lists serve the same purpose. They are not only progress reporting. They expose the agent's decomposition while the work is still in progress:

- [ ] Add upload rate-limit middleware
- [ ] Create new limiter config flag
- [ ] Update /api/files tests

The second and third items show the misunderstanding while it is still cheap to correct. Waiting until the final diff makes the same issue harder to isolate.

Parallel skills make this more important. A frontend skill, migration skill, test-writing skill, and reviewer agent can all run at once. That can increase throughput, but it also creates more places where the system can choose a different definition of done unless each worker exposes its assumptions, tasks, and intermediate state.

Once a workflow involves multiple agents, chat and PR review are not enough. The system needs visibility into what each agent believes it is doing while the work is still happening.

Intent has to survive the session

The next problem is memory.

This is not memory as user personalization, and it is not only a library of reusable skills. Procedures are useful, but the more important layer is actionable memory: what the system learns from what happened in previous runs.

Every serious agent run leaves evidence behind. The plan it wrote, the files it opened, the assumptions it made, the commands it ran, the reviewer comments it ignored, and the human correction that finally made the work acceptable are all useful signals.

That evidence should not disappear into chat history. It can become a learning loop:

run logs → candidate lessons → evaluator → behavior updates → future runs

After the rate-limiter failure, the lesson is not simply "write a rate-limiter skill." The useful lesson is narrower: when an issue names an endpoint, verify the route before editing; do not introduce product-facing configuration unless the plan explicitly asks for it; if the requested algorithm and the existing pattern disagree, stop and ask.

Those are behavior updates. They can land in project instructions, planner rubrics, reviewer checks, interruption policies, or sometimes a skill. They should also be evaluated before they change future behavior. The evaluation should ask whether the lesson is supported by the logs, scoped narrowly enough to be safe, and likely to prevent the same class of mistake.

This also changes how improvement can be measured. Keeping logs, lessons, and evaluation outcomes makes it possible to ask whether scope drift is decreasing, whether reviewers are catching fewer repeat mistakes, whether human corrections are getting smaller, and whether a memory update reduced errors or only added prompt noise.

Memory without evaluation doesn't hold as much value. Memory tied to logs and evals becomes an improvement system.

Verification has to check intent

Tests describe whether the code behaves according to the cases that were written down.

They do not necessarily show whether the agent built the requested thing.

That gap is why Anthropic's Outcomes primitive is interesting. An Outcome is a Markdown rubric that a separate agent can use to grade the result in a fresh context. The planner can loop until the rubric is satisfied. The relevant shift is that the system does not only ask whether a command passed. It asks whether the result satisfies criteria that were written before implementation started.

For the rate limiter example, a useful rubric might say:

The implementation is successful if:
- /api/upload enforces a sliding-window limit per authenticated user.
- No new product-facing configuration is introduced.
- The change does not affect /api/files.

That gives the reviewer a contract. The review is no longer based only on whether the diff looks reasonable.

The Agent-as-a-Judge survey points in the same direction: judge agents become more reliable when they observe intermediate steps instead of grading only the final answer. Observation matters because the review agent needs to understand how the work happened, not only inspect the final diff.

One agent builds. Another checks. A rubric defines the target. The human still decides which failures matter.

The stack is converging

Plan. Steer. Remember. Verify.

These used to look like separate research and product threads. They are now becoming the standard shape of the agent stack.

Layer	Artifact	What it protects
Plan	`plan.md`, `.cursor/plans/`, Spec Kit, Playbooks	The agent's understanding before code
Steer	Tasks, clarification interrupts, live progress	The trajectory during work
Remember	Run logs, evaluated lessons, behavior updates, memory systems	Learning from previous work across sessions
Verify	Outcomes, rubrics, judge agents, review agents	Whether the result matches intent

The important shift is that agent reliability becomes a systems problem, not only a prompting problem.

The plan is an interface. The skill is an interface. The task list is an interface. The rubric is an interface. The reviewer prompt is an interface. Each one carries intent from one part of the system to another.

That work is software design applied to a different set of components.

What this means for engineering teams

Model providers can ship better models. Tool vendors can ship better planners, spec workflows, playbooks, and review systems.

Those tools still do not know what correctness means inside a specific codebase.

They do not know that the billing system treats retries differently for enterprise customers. They do not know that the migration tool must be run through drizzle:generate, not raw SQL. They do not know which product constraints are implicit because everyone on the team has internalized them.

That knowledge has to live somewhere durable and usable.

If it lives only in a person's head, the agent will miss it. If it lives only in chat, it will disappear. If it appears only in final PR review, it will arrive after many decisions have already been made.

The practical work is straightforward and valuable:

Write plans the team can edit.
Preserve run logs, reviewer findings, and human corrections.
Turn repeated failures into scoped candidate lessons.
Evaluate those lessons before changing the agent's behavior.
Make success criteria explicit before implementation.
Measure whether memory updates reduce drift, repeat mistakes, and human correction.

This is the control plane for coding agents.

The control plane is becoming the main editing surface

For years, the editor was where most software work happened. Code changed, the compiler or test suite responded, and the feedback loop stayed local and visible.

Coding agents move part of that loop up a level.

Important edits are now often made outside handler.ts: in plan.md, SKILL.md, CLAUDE.md, task lists, rubrics, and reviewer instructions. Those files decide what the agent sees, what it is allowed to change, how it reports progress, and how the result gets judged.

The model is still important, but it is not the whole product. The product is the system around it: artifacts, reviewers, memories, checks, and human decision points that convert model capability into reliable work.

The practical advantage comes from designing that surrounding system well.

How to align coding agents with your plans better than markdown, without burning tokens

Mixture of Experts — Thu, 14 May 2026 22:14:54 +0000

The expensive moments in a coding-agent session are not the model's tokens. They are the seconds you spend skimming a markdown plan and missing a subtle misalignment. You approve, then watch the implementer solve a slightly different problem than the one in your head.

We have started treating that gap as a UI problem, not a model problem. And the UI we have, for coding agents specifically, is bad.

Thariq Shihipar at Claude Code has been making this case publicly for a while: agents should be emitting HTML, not markdown, for most non-trivial output. His thread is the right primer on why, and we're not going to try to re-derive it here. What we want to add is the piece that has been missing for us. We needed a way to use HTML at every plan stage without the token cost stacking up across the session. That way is a screenshot, borrowed from how DeepSeek-OCR handles context compression.

Thariq's article.

The case Thariq makes, in three parts. We will not reproduce Thariq's thread in full. We suggest reading it. The arguments worth restating here are the ones the rest of this post leans on:

Markdown won by inertia. It rendered everywhere, was easy for a human to hand-edit, and the kinds of plans agents used to produce were short. None of that still binds. Most people are no longer hand-editing agent-generated specs, they are prompting the agent to edit them. Plans have grown into full RFCs. And every modern reviewer has a browser tab open.
HTML carries information markdown cannot. Tables with real column alignment, SVG diagrams drawn to scale, before/after panels rendered side by side at the same visual weight. In the absence of those, agents fall back to ASCII boxes and unicode block characters approximating colors. That fallback is what most markdown plans actually look like at length, and it is why nobody reads past line 100.
Information density matters most at the plan stage. This is where the gap between what the agent thinks you want and what you actually want is widest. Forcing the plan through a flat-text encoding is a lossy compression step you do not need to be performing.

Thariq catalogs the use cases: plan stages with branching options, design and prototype reviews, PR walkthroughs, code and architecture explainers, throwaway custom editors that end with a "copy as JSON" button. We have ended up using HTML for all of those. Our experience matches his closely enough that the right move is to point you at his thread rather than re-list them.

Where this landed for us: design work with a coding agent

The plan-stage argument is the one that converted us, and design work is where it shows up most starkly.

The last time we were iterating on a UI change with Claude Code, we asked for the plan as a single-file HTML artifact instead of the usual markdown. Two columns, BEFORE on the left, AFTER on the right, rendered with the real tokens and chrome the UI actually ships.

The point is not the specific feature. The point is that one artifact got us to high-fidelity comprehension in a single round trip. The markdown equivalent would have been a paragraph of prose and a bullet list. Readable, but lossy in exactly the ways that matter for a visual change. Getting to the same level of confidence through markdown would have taken three or four back-and-forth turns of "what does this look like next to X" and "show me the spacing," each one re-tokenizing the conversation and giving us a worse mental picture than the rendered comparison did instantly.

The expensive operation is reading the spec and noticing what the agent got wrong. Spending model tokens on rendered HTML pays for itself the first time it replaces three turns of "what does this look like next to X" with one look.

Where Thariq's argument gets harder: token cost on long sessions

HTML is not free. A single artifact comparing two design approaches with inline styles, SVG, and full content runs roughly four to six times the tokens of the equivalent markdown plan. Generation also takes two to four times longer. On a one-shot artifact that's fine. On a long coding-agent session, the plan gets re-read by the implementer, then the reviewer, then the follow-up planner. The HTML keeps getting re-tokenized into context, and the cost stacks up across the session.

This is the part Thariq's posts don't fully address, and it's why HTML stayed a sometimes-tool for us instead of a default. The fix came from a different research direction.

DeepSeek-OCR is the missing mechanism

DeepSeek-AI's paper DeepSeek-OCR: Contexts Optical Compression makes a simple claim: a page of text rendered as an image and processed by a vision encoder can be encoded into far fewer tokens than the same text processed as text. Their model card lists the encoding modes. A 1024x1024 image of a full page becomes 256 vision tokens. Their Tiny mode does it in 64. For content that has visual structure, the image channel encodes more per token than the text channel by a wide margin.

Paper: https://arxiv.org/abs/2510.18234
Model card: https://github.com/deepseek-ai/DeepSeek-OCR

You do not need to run their model to borrow the mechanism. Once you have an HTML artifact you are happy with, you do not need to keep the HTML itself in context for subsequent agent calls. Render it, screenshot it, feed the PNG back as an image. The vision tokens encode the same spec at a fraction of the text-token cost, and the human-readable HTML is preserved on disk for the next time you need to iterate.

The workflow we have settled into:

Agent generates the HTML artifact as part of the plan stage.
We open it in a browser, review, edit if needed, approve.
A small wrapper renders the artifact and captures a PNG.
Subsequent agent calls receive the PNG as part of the spec, not the raw HTML.

The trade is asymmetric. Our review happens against the rendered HTML, where spacing, alignment, and color do the work of catching the misalignments. The model's re-reads across the implementer and reviewer stages happen against the screenshot, which costs a fraction of the text tokens. Iteration cost stays close to a markdown plan. What we can see in one glance goes way up.

This is what moved HTML artifacts from "nice when I remember to ask for one" to "default at every plan stage" for us.

Why coding-agent TUIs have not shipped this yet

Claude chat ships artifacts. ChatGPT canvas ships canvas. The chat side of the ecosystem worked this out a while ago: prose-only loses information at exactly the moments that matter most.

The coding-agent TUIs (Claude Code, Codex, Opencode, etc.) are still markdown-first across every stage of the loop. Part of the reason is that TUIs render in terminals, and terminals do not render HTML. But the artifact does not need to live inside the TUI. A hook that drops the file in a browser tab or a side panel solves the rendering problem. The harder constraint is that the agent has to know when an HTML artifact is the right tool, and most plan-stage prompts never ask for one. The default is markdown, the path of least resistance is markdown, and you find out about the misalignment after the implementer is halfway done.

In the short term the fix is one line in your plan-stage prompt: ask for a single-file HTML artifact when the problem is comparison-heavy, visual, or architecturally branching. Then add the screenshot step before the artifact gets re-read by downstream agent calls. In the longer term we want the agents to reach for HTML on their own, the way Claude already does in chat.

Try it on the next ambiguous plan

The pattern is cheap to try in one session. The next time an agent hands you a markdown plan for something you would want to compare, draw, or render, ask for a single-file HTML artifact instead. Open it in a browser. Read the rendered comparison rather than the prose abstraction of it. If the HTML changes your read on the plan, that is what markdown was hiding.

Then screenshot it before the next agent stage reads it back. The screenshot is what makes this the default at every plan stage, instead of a tool you only reach for when the artifact feels important enough to justify the tokens.

References

[1] Thariq Shihipar (Claude Code), The Unreasonable Effectiveness of HTML: https://x.com/trq212/status/2052809885763747935 — The case for HTML over markdown as the default agent output format, with a catalog of use cases.

[2] DeepSeek-AI, DeepSeek-OCR: Contexts Optical Compression. arXiv: https://arxiv.org/abs/2510.18234 — GitHub: https://github.com/deepseek-ai/DeepSeek-OCR — The mechanism behind the screenshot trick: visual tokens encode page-structured content at a fraction of the text-token cost.

Atomic's Ralph Loop: a deterministic, plan, orchestrate review for long-running, ambiguous work

Mixture of Experts — Thu, 07 May 2026 22:52:30 +0000

Geoffrey Huntley's original Ralph essay introduced a primitive I keep coming back to: a coding agent in a while true loop reading the same prompt over and over until the work is done. The pattern is genuinely powerful, and the ecosystem around it has grown a lot since. Huntley's own follow-upgeneralizes it well beyond coding.

The official Claude Code ralph-wiggum plugin ships a Stop-hook variant. Wiggum CLI checkpoints distinct phases with a TUI on top. Vercel's ralph-loop-agent adds completion verification and token-budget stops. Adjacent tools push the same idea from neighboring angles: Aider's architect/editor mode splits planning and editing across two models and posts SOTA numbers on its own benchmark, Cline's Plan & Act keeps a human approving every diff, OpenHands wraps an action-observation loop with critic models and stuck-loop detection, and recent essays on patterns like ASDLC's Ralph Loop sketch out adversarial dual-review approaches that closely match where I ended up. I've learned a lot from all of this, and Atomic's Ralph borrows openly from the lineage.

What I personally wanted — and didn't quite see assembled in one place for my own workflow — was a loop I could leave running unattended on a 30-file refactor overnight where every step is inspectable in the morning: the RFC, the task DAG, the captured diff, both reviewers' transcripts. This post is the design of the Ralph loop that ships in Atomic, what it inherits from the work above, and the small set of choices that make it a little different.

What goes wrong with a naive Ralph

A while true over one prompt has three structural problems, and all three show up around iteration four:

The reviewer is the same brain that wrote the code. It signs off on its own bugs. Self-review converges on confidence, not correctness.
There's no plan that survives the session reset. Each iteration starts cold, constraints drift, later iterations contradict earlier ones.
Symptoms get patched instead of root causes. The reviewer finds five errors in five files. The next iteration fixes five places. The shared underlying defect ships unchanged.

A lot of tools handle a subset of these well. Aider's architect/editor pair separates the planning model from the editing model. Claude Code's plan mode persists a plan that survives /clear and supports iterative re-planning. Cline's Plan & Act keeps a human approving every diff. Devin loops autonomously inside its sandbox. What I wanted on top of these foundations was an unattended loop with two independent reviewers gating termination and a captured artifact for every step — so I could walk away for hours and still reconstruct exactly what happened when I came back.

The shape

Atomic's Ralph is one outer loop with five stages per iteration. Three are visible (you can attach to the tmux session and watch); two are headless because the SDK enforces structured output and there's nothing for a human to steer.

Flow:
spec or RFC path -> Planner (visible) -> Orchestrator (visible, RFC -> DAG -> parallel workers) -> Code Simplifier (visible) -> Infra Discovery (3 headless sub-agents, parallel) -> Dual Reviewer (2 headless, schema-enforced) -> both say "patch is correct"? -> if yes, done. If no, findings grouped by file -> planner triages root causes -> back to Planner.

The loop terminates on one of two conditions: both reviewers return overall_correctness: "patch is correct", or max_loops (default 10) elapses. There is no third "looks fine" branch.

Determinism is wired in, not prompted

The two design decisions that buy the most reliability:

Schema-enforced dual reviewers. Each iteration spawns two reviewers in parallel, each using its provider SDK's structured-output mode (Claude Agent SDK outputFormat: { type: "json_schema" }, OpenCode format: json_schema, Copilot via defineTool). The schema is a Zod object I compile to JSON Schema once:

export const ReviewResultSchema = z.object({
  findings: z.array(ReviewFindingSchema),
  overall_correctness: z.enum(["patch is correct", "patch is incorrect"]),
  overall_explanation: z.string(),
  overall_confidence_score: z.number().min(0).max(1).optional(),
});

The merge rule is conservative: either reviewer flagging "patch is incorrect" fails the iteration, and either reviewer failing to produce schema-valid output is treated as "needs another pass." This sounds obvious. The first version had a bug where a missing structured output defaulted to "correct" and the loop exited after one pass. Nothing was actually verified, but the workflow happily reported success. The fix wasn't in the prompt; it was in the merge function.

A captured branch changeset, injected. Before the reviewers run, the workflow shells out and captures the full diff, name-status, and uncommitted state relative to the parent branch. That string lands in the reviewer prompt verbatim. Reviewers don't need to discover what changed; they read it. Both reviewers on the same iteration see the same input.

These two choices remove most of the iteration-to-iteration variance. Either the reviewer sees the diff and the schema accepts the verdict, or the loop keeps running.

Re-planning, not re-prompting

The interesting part of the loop is what happens between iterations.

When the merged review fails, findings are grouped by file_path and rendered into a markdown brief. Clusters of related symptoms surface together. That brief becomes the only new context the planner gets on the next iteration. The planner is explicitly instructed to:

Validate each finding by reading the cited file (drop stale ones).
Cluster findings that share a module or underlying defect.
Root-cause the actual defect rather than the surface symptom.
Fold the corrected approach into specific RFC sections (Detailed Design, Alternatives, Test Plan).

The output is a revised RFC, not a new prompt. The orchestrator on the next iteration decomposes that RFC into a fresh task DAG. Tasks aren't carried across iterations; the design is.

This is where most DIY Ralphs diverge from this one. They feed reviewer findings back as a comment list, the agent fixes the comments, and the defect ships. Here, the next iteration is forced to revise the design first.

Decomposition is part of the loop, not a one-shot

The orchestrator stage takes the RFC and runs three phases:

Decompose into tasks with a gerund subject, an imperative description, and an explicit blockedBy dependency list. Persisted via the SDK's task tool (TaskCreate, todowrite, etc.).
Dependency-graph integrity check. Every dependency reference must point to a real task ID. Dangling references are dropped before any worker spawns. Otherwise tasks block forever.
Execute. Ready tasks (pending, all deps completed) fan out as parallel sub-agents. As workers finish, newly unblocked tasks dispatch immediately. Worker failures retry up to three times with the error in context, then mark error and unblock the rest of the graph.

A few opinionated rules baked into the prompt: tasks should be small enough that a single sub-agent finishes one in one session, test tasks come after the code they cover, foundations (schema, shared utils) come first. Decomposition is data. Bad decomposition is the leading cause of merge conflicts in long-running runs, and the prompt is where that data quality is enforced.

Three more details that matter

Infra discovery before review. Three sub-agents (codebase-locator, codebase-analyzer, codebase-pattern-finder) run in parallel and surface the exact build, test, lint, and CI commands for the repo. The reviewer is then required to run them before writing findings. Type errors and test failures become P0/P1 findings with the actual command and exit status quoted in the body. The reviewer cannot declare correctness without verifying against the project's own gates.

P3 nits get filtered. The merge step drops priority-3 findings before they reach the planner. If only nits remain, the loop stops. I don't want eight iterations debating a variable name.

A "caveman" response-style directive is appended to every prompt: drop articles, drop pleasantries, technical terms exact, code blocks unchanged, schema literals unchanged. Across ten iterations the token savings are real, and the structured outputs the loop actually depends on are explicitly carved out so they survive intact.

Try it on something hard

atomic workflow -n ralph -a claude ""

It runs against your existing Claude Code, OpenCode, or Copilot CLI install. Atomic wraps a deterministic outer loop around your agent rather than replacing it. The whole workflow is a small set of TypeScript files you can read in one sitting: https://github.com/flora131/atomic/tree/main/packages/atomic-sdk/src/workflows/builtin/ralph. MIT-licensed.

The honest claim is modest: this Ralph fails in ways you can debug. When the loop gets something wrong, I can read the RFC, the task DAG, the captured changeset, and both reviewer transcripts and tell you why. That's the bar I want for any agent loop running for hours unattended on real work.

If you try it and it breaks on something, the issue tracker (https://github.com/flora131/atomic/issues) is open.

Three Things I Learned Using Coding Agents with 1M-Token Models

Mixture of Experts — Thu, 07 May 2026 02:35:21 +0000

Key Takeaways

The effective context window is far smaller than advertised. Even with 1M-token models, performance degrades noticeably past ~100K tokens — worse coherency, more hallucinations, and planning drift. Treat the full window as a capacity limit, not an operating target.
Sub-agents are essential for long-horizon work. Delegating scoped tasks to sub-agents keeps each agent in its "smart zone" and prevents context pollution. Watch for the "impatience problem" where the main agent duplicates work already delegated.
Skills + CLIs beat MCP servers for context control. Skills offer progressive context disclosure and dynamic filtering. MCP servers push opaque context with limited filtering — a critical difference when every token counts.
Context is the scarce resource, not capability. Compaction strategy, sub-agent architecture, and tool selection should all be designed around keeping context lean, scoped, and fresh.

I've been using coding agents heavily — primarily Copilot CLI and the SDK, but also Claude Code and other agentic tools — alongside the 1M-token context models (Codex 5.4 and Opus/Sonnet 4.6). While the examples below are drawn from my Copilot CLI workflow, these patterns apply to any coding agent that operates on long-context models: Claude Code, Cursor, Windsurf, Aider, or whatever you're using. The underlying constraints are model-level, not tool-specific.

My workflow has evolved significantly from where most people start. Most developers see "1M tokens" and think "I can throw everything at the model." The results are predictably bad. Worse coherency. More hallucinations. Plans that drift until they're unrecognizable. The full context window is a capacity limit, not an operating target.

Here are three patterns that fundamentally changed how I work with these tools.

1. The "Smart Zone" Is Much Smaller Than You Think

Even though these models support context windows of up to 1 million tokens, the effective performance zone is significantly smaller — and the reasons are architectural, not incidental.

Why the limitation exists

Most 1M-token models aren't fundamentally larger or smarter than their shorter-context predecessors. They achieve extended context through mathematical techniques like YaRN (Yet another RoPE extensioN — https://arxiv.org/pdf/2309.00071) that stretch the model's sequence length without adding parameters. The context window grows, but the model's core reasoning capacity — what HumanLayer calls the "instruction budget" (https://www.hlyr.dev/blog/long-context-isnt-the-answer) — stays the same.

The instruction budget is the number of instructions a model can reliably follow before adherence starts to drop. It's strongly correlated with the model's parameter count and instruction tuning quality, not with its context window size. When you extend the context 5x without scaling the instruction budget, you can fit more information in, but the model isn't actually better at attending to it. HumanLayer found this firsthand when they tested Claude Opus 4.6 (1M context): instruction adherence degraded not just at capacity limits, but across all context lengths compared to the shorter-context Opus 4.5.

Think of it this way: your context window is a haystack where tool calls, documents, and files are the hay. The quality of the agent's next action depends on its ability to find the right needle — the most relevant instruction for the current state. Expanding the haystack 5x without improving the model's needle-finding ability just buries the signal deeper.

What degradation looks like in practice

Through experimentation across different prompt and context sizes, model performance starts to noticeably degrade past approximately 100K tokens. This shows up as:

Worse task coherency — the model loses track of the overall objective
Reduced reasoning reliability — logical chains break down
Increased hallucination rate — the model confidently fabricates details
Planning drift in long-horizon tasks — multi-step plans veer off course
Instruction disobedience — the model ignores design documents, misunderstands simple instructions, or makes trivial mistakes it wouldn't make in a leaner context

This isn't theoretical. I've watched agents produce clean, well-reasoned output at 80K tokens, then fall apart at 150K with the same task and codebase. The degradation isn't binary — it's a gradient. But the inflection point is consistent enough that I've built my workflow around it. HumanLayer observed the same pattern — they shifted their context warnings to trigger at 100K tokens rather than at a percentage of the usable window.

What works

Trigger auto-compaction earlier. Don't wait until the context window is full. Set compaction thresholds well below the model's maximum capacity.
Periodically clear the context window. Persist progress to disk — research docs, specs, task lists — then start fresh sessions that load only what's needed for the current phase.
Stop max-packing prompts. The fact that the model allows 1M tokens doesn't mean you should use them. Treat the full window as headroom for unexpected context growth, not as the target operating point.

The practical rule: treat the full 1M window as a capacity limit, not an operating target. More context isn't more capability. Design your workflows around staying well under it.

2. Use Sub-Agents to Offload Long-Horizon Work

One of the most effective patterns I've found is spawning sub-agents to balance the main agent's context and handle complex or long-running tasks.

The concept is straightforward: instead of stuffing everything into one agent's context window, delegate scoped work to sub-agents that operate in their own context windows. The orchestrating agent receives condensed results. Its context stays lean. Each sub-agent gets only the information it needs.

This directly addresses the context degradation problem. If you can keep each agent under 100K tokens by distributing work across multiple agents, you stay in the "smart zone" even for tasks that would otherwise require 300K+ tokens of total context.

The Orchestrator Pattern

Below is a template I use for an orchestrator sub-agent (adapted from HumanLayer's work on sub-agent orchestration, with modifications for my workflow):

---
name: orchestrator
description: "Orchestrate sub-agents to accomplish complex long-horizon tasks without losing coherency by delegating to sub-agents."
tools: ["execute", "agent", "edit", "search", "read"]
---

You are a sub-agent orchestrator. The most important tool available to you
is the one that dispatches sub-agents: either `Agent` or `Task`.

All non-trivial operations should be delegated to sub-agents.

Delegate research and codebase understanding tasks to codebase-analyzer,
codebase-locator, and pattern-locator sub-agents.

Delegate running bash commands (particularly ones likely to produce lots
of output) to Bash sub-agents.

Use separate sub-agents for separate tasks, and launch them in parallel —
but do not delegate tasks with significant overlap to separate sub-agents.

The key design decisions:

Separate sub-agents for separate tasks — prevents context pollution between unrelated work
Parallel execution — sub-agents can work simultaneously on independent tasks
No overlapping delegation — avoids duplicate work and conflicting outputs

The Impatience Problem

There's a behavioral quirk worth calling out. The post-training behavior of these models tends to favor smaller-model-style execution patterns. In practice, this means the main agent becomes impatient — it attempts to complete a task that's already been delegated to a sub-agent.

This defeats the purpose of sub-agents entirely. You get:

Context pollution — the main agent duplicates work happening in a sub-agent
Duplicate work — wasted compute and potentially conflicting outputs
Planning drift — the main agent's plan diverges from the sub-agent's execution
Loss of orchestration coherency — the delegation structure breaks down

That's why I explicitly include this instruction in every orchestrator prompt:

"IMPORTANT: Sometimes sub-agents will take a long time. DO NOT attempt to do the job yourself while waiting for the sub-agent to respond. Instead, use the time to plan out your next steps, or ask the user follow-up questions to clarify the task requirements."

This isn't specific to any one tool. It's a model-level behavioral tendency — the post-training optimization makes models want to "do something" rather than wait. I first noticed it in Copilot CLI, but the same pattern shows up in Claude Code, Cursor, and other agentic systems. The explicit instruction overrides that default regardless of which agent you're using.

3. Prefer Skills + CLIs Over MCP Servers

In practice, I consistently favor Skills + CLIs over MCP servers for agent tool integration.

The reason is context control. Skills and CLIs support:

Progressive context disclosure — you control exactly what context enters the prompt window, when, and in what form
Dynamic filtering — you can scope the retrieved context based on the current task

MCP servers, by contrast, often:

Push opaque context — the server decides what to include, and you have limited visibility into what enters your prompt
Provide limited filtering — the architectural design of MCP makes it harder to control the granularity of context injection

This distinction becomes critical once you're operating near the 100K+ token regime. When every token of context matters, you need tight control over what the agent "knows" at any point in time. Skills give you that control. MCP servers often don't.

Skill Registries for Discoverable Capabilities

To ground coding agents with capabilities they can dynamically discover and download, two registries are worth knowing:

The Agent Skills Directory — a curated directory of reusable agent skills: https://skills.sh/
microsoft/skills — Microsoft's open-source skill repository: https://github.com/microsoft/skills

These registries let agents find and adopt capabilities without ballooning the primary context with skill definitions that aren't needed for the current task. The skill gets loaded when it's needed, used, and then the context is reclaimed.

The Pattern

All three tips point to the same underlying principle: context is the scarce resource, not capability.

The models are capable enough. The context window is large enough. But the effective operating zone is much smaller than the theoretical maximum. Everything you do — compaction strategy, sub-agent architecture, tool selection — should be designed around keeping context lean, scoped, and fresh.

Treat context like memory in a constrained system. Allocate carefully. Free aggressively. Never assume that having more headroom means you should use it.

References

HumanLayer, "Long-Context Isn't the Answer": https://www.hlyr.dev/blog/long-context-isnt-the-answer
Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models": https://arxiv.org/pdf/2309.00071
The Agent Skills Directory: https://skills.sh/
Microsoft Skills Repository: https://github.com/microsoft/skills

The Memory Wall Is Coming Down — What It Means for Coding Agents

Mixture of Experts — Thu, 07 May 2026 02:25:16 +0000

Key Takeaways

The memory wall is a primary constraint on coding agents, not model intelligence. Quadratic attention costs, KV cache growth, and "lost in the middle" degradation create a hard ceiling on how long agents can maintain coherent reasoning.
Research breakthroughs compose: 30x+ KV memory reduction is within reach. TriAttention's intelligent pruning and TurboQuant's 3-bit quantization are complementary techniques that stack naturally, while Latent Briefing cuts multi-agent context sharing costs by 49%.
Fundamentally different theories of agent memory are emerging. For example, MemPalace bets on structured archival with spatial retrieval; Hippo Memory bets on intelligent forgetting with decay-based consolidation. The field hasn't converged on what wins or perhaps it changes depending on the use case.
The harness is becoming an operating system for agent memory. Claude Code's three-layer compaction, four-tier persistence hierarchy, and self-healing query loop reveal that production coding agents are already memory management systems — and this pattern will only deepen.

One of the biggest constraints on coding agents is memory.

Specifically, it's the quadratic cost of attention — the mechanism that lets models weigh the relevance of every token against every other token. This single architectural bottleneck determines how long an agent can think, how much context it can hold, and how complex the tasks it can tackle before it starts forgetting what it was doing.

Three layers of innovation are converging on this problem simultaneously: foundational research that slashes the memory cost of attention itself, community-built tools that give agents persistent memory across sessions, and production harness architectures that manage context as a first-class engineering concern. These aren't isolated efforts. They're solving the same problem at different altitudes. Understanding how they connect — and where they're heading — is essential for anyone building with or for coding agents today.

The attention tax every coding agent pays

Before getting into solutions, it's worth understanding the constraint clearly. If you've worked with coding agents, you've felt this — even if you didn't have a name for it.

Transformers process input through an attention mechanism. For every new token the model generates, it computes a relevance score against every previous token in the context window. This is what makes language models powerful: they can relate distant pieces of information. It's also what makes them expensive: the computation scales quadratically with sequence length. Double the context, quadruple the cost.

Cost flow: Context length drives KV cache memory (grows linearly) and attention computation (grows quadratically) -> together they hit the GPU memory wall -> result is degraded performance or out of memory.

In practice, this means a 200K token context window is not 200K tokens of useful capacity. Claude Code's 200K window shows measurable degradation around 147K–152K tokens. System prompts alone can consume 30K–40K tokens before the user types anything. The "lost in the middle" phenomenon — where models deprioritize information in the middle of long contexts — compounds the problem. More context doesn't mean better understanding. Past a threshold, it means worse understanding.

For coding agents, this creates a hard ceiling. A long refactoring session accumulates tool results, file reads, error traces, and intermediate reasoning. Each step adds to the context. Eventually, the agent is spending more compute re-attending to stale history than reasoning about the current problem. This is the memory wall, and it's the primary reason coding agents degrade on long tasks.

Research is breaking the wall

This isn't one paper. It's a wave. Multiple research teams are attacking the memory wall from different angles simultaneously: compressing the KV cache through structural insights, quantizing it to extreme bit-widths, and making multi-agent context sharing efficient at the representation level.

TriAttention: compressing the KV cache without losing quality

The KV (key-value) cache stores the attention state for every token the model has processed. As context grows, this cache becomes the dominant memory consumer. Existing compression methods like SnapKV try to prune unimportant keys, but they estimate importance using attention scores from recent queries — and those scores are distorted by a positional encoding called RoPE (Rotary Position Embedding), making them unreliable.

TriAttention, from researchers at MIT, NVIDIA, and Zhejiang University, takes a different approach. It exploits a structural property the authors call Q/K concentration: in the pre-RoPE representation space, query and key vectors cluster tightly around fixed centers regardless of input or position. Approximately 90% of attention heads in tested models show this property. These stable centers determine which token distances each head preferentially attends to via a trigonometric distance-preference function.

Instead of dynamically guessing which keys matter, TriAttention scores each key against these fixed centers using the trigonometric function, then keeps only the top-scoring keys. The scoring runs as a fused Triton kernel with a protected window of recent tokens that are never evicted.

The results are striking:

10.7x KV memory reduction
2.5x throughput on long reasoning tasks (32K token generation), with accuracy matching full attention (40.8 on AIME25 for both)
6.3x throughput on MATH-500 with only 1.2 percentage points of accuracy loss
Existing baselines (SnapKV, R-KV) collapse to roughly half the accuracy at the same memory budget

The practical implication: reasoning models that previously required multi-GPU setups can run on a single consumer GPU. For coding agents, this means longer reasoning chains within the same hardware constraints — more time thinking about your refactoring task before the memory wall hits.

TurboQuant: extreme compression, zero accuracy loss

While TriAttention prunes which keys to keep, Google's TurboQuant (ICLR 2026) attacks the same problem from a complementary angle: making each key smaller. It quantizes the KV cache down to 3 bits per parameter — training-free — using two techniques: PolarQuant, which rotates key/value vectors into a representation that quantizes more uniformly, and Quantized Johnson-Lindenstrauss compression, which reduces dimensionality while preserving distance relationships.

The result: no measurable accuracy loss across LongBench, RULER, and Needle-in-a-Haystack benchmarks. In practice, this means ~3x longer effective context on the same GPU memory. Stack TurboQuant with TriAttention's pruning and you're looking at 30x+ memory reduction — enough to hold a substantial codebase's worth of context on hardware that currently struggles with a single long conversation.

These aren't competing approaches. Pruning (which keys to keep) and quantization (how much space each key needs) compose naturally. The research community is converging on a layered compression stack for attention, much like how image codecs layer spatial compression, quantization, and entropy coding.

Latent Briefing: efficient memory sharing between agents

Multi-agent systems have a compounding token problem. When an orchestrator delegates tasks to worker agents, each worker needs context about what the orchestrator has already figured out. The naive approach — passing the full reasoning trajectory as text — causes token usage to explode with each successive call. Summarization is slow and lossy. RAG retrieval is brittle.

Latent Briefing, from Ramp Labs, operates at a different level entirely. Instead of compressing text, it compresses the model's internal representations.

The mechanism: the orchestrator's accumulated reasoning is forward-passed through the worker model. The attention scores between the task prompt's query vectors and the trajectory's KV cache keys reveal which parts of the context the worker considers relevant — and crucially, this relevance is task-adaptive. Different queries compress the same context differently. The method then constructs a compact KV cache using the important keys, bias corrections for missing keys, and reconstructed values via ridge regression.

Tested with Claude Sonnet 4 (orchestrator) and Qwen-14B (worker) on 126 LongBench v2 questions:

49% median token savings on medium-length documents (32K–100K tokens)
+3 percentage point accuracy gain at the right compaction threshold — it actually performs better with less context
Compaction takes ~1.7 seconds, roughly 20x faster than sequential attention merging and 10–30x faster than LLM summarization

That accuracy gain is the telling result. Removing irrelevant context doesn't just save tokens — it helps the model focus.

A practically important finding from the paper: different compaction thresholds win in different regimes. Longer documents (32K–100K tokens) benefit from lighter compaction — the information is dispersed and broad coverage matters, but even light pruning still saves 57% of worker tokens. Harder questions benefit from aggressive compaction (79% of context removed) because the orchestrator's speculative reasoning generates noise that dilutes the worker's signal. Moderate compaction works best for short, focused documents.

This isn't just a tuning knob. It's a design principle: the right amount of context depends on the task, not just the budget. Compaction should be task-aware, not one-size-fits-all. This validates a principle that experienced harness engineers already know intuitively: less context, better directed, beats more context.

Products are building on top

While researchers optimize what happens inside the model's context window, community builders are attacking the problem from the other direction: giving agents external memory that persists beyond any single session.

MemPalace: structured recall through spatial organization

MemPalace maps the ancient Method of Loci to a data architecture for AI agents. Wings are top-level categories (a person, a project). Rooms are specific topics within a wing. Halls connect rooms by type. Tunnels automatically link the same room across different wings. Drawers are the atomic unit: verbatim text chunks that are never summarized.

The technical backbone is dual storage: ChromaDB for semantic vector search and SQLite for a temporal knowledge graph that tracks facts over time. A four-layer memory stack minimizes token cost:

L0 (identity, ~50 tokens) and L1 (critical facts, ~120 tokens) load on every startup
L2 (room recall) and L3 (deep search) fire only on demand

In benchmarks, wing+room metadata filtering improves retrieval from 60.9% to 94.8% R@10 — though this leverages standard ChromaDB metadata filtering rather than a novel retrieval mechanism. The real value is the spatial organization model itself, which gives agents a structured way to scope queries. Everything runs locally with no cloud dependency.

Hippo Memory: forgetting as a feature

Hippo Memory takes a neuroscience-inspired approach with a three-tier hierarchy mimicking human memory: a buffer (working memory, current session only), an episodic store (timestamped memories with a 7-day half-life that strengthens through retrieval), and a semantic store (stable patterns extracted during consolidation).

The key innovation is the sleep command — a consolidation pipeline that runs a decay pass to remove weak memories, a replay pass that finds three or more related episodes via embedding similarity and extracts common patterns into semantic memory, conflict detection for contradictions, and schema indexing to update topic clusters.

Memories decay by default. Persistence is earned through use. Errors get 2x the half-life. Breakthroughs get priority. This is the opposite of "store everything and search later." It's a bet that intelligent forgetting is as important as precise recall.

The key distinction

MemPalace and Hippo Memory represent two fundamentally different theories of agent memory. MemPalace is a structured archive — store everything verbatim, make it findable through spatial organization. Hippo is a dynamic brain — memories compete for survival through use, decay, and consolidation. MemPalace bets on retrieval precision. Hippo bets on forgetting as a feature. Both are valid. The field hasn't converged on which approach wins yet — and the answer may be that different tasks demand different memory architectures.

Claude Code's architecture: the harness layer

Claude Code reveals what a production coding agent does when it can't wait for research to ship. Its architecture is a pragmatic, multi-layered response to the memory wall — and it's instructive because it shows what works today.

The self-healing query loop

Claude Code doesn't use standard request-response. It runs a continuous state machine designed to absorb failures. When the model exhausts its output budget mid-task, the loop doesn't crash. It triggers compression automatically, carving out a buffer before the token ceiling and generating a structured summary. If the API returns a prompt_too_long error, reactive compression fires and retries. To prevent infinite loops, auto-compaction pauses after three consecutive failures.

Three-layer compaction

The compaction system uses progressively stronger cleanup:

Rules engine cleanup — lightweight, no LLM call. Strips known low-value patterns: stale tool results, redundant messages.
Session memory extraction — writes extracted facts to disk, removes them from context. Still avoids an LLM call.
Full summary — when layers 1 and 2 are insufficient, an LLM-generated summary replaces older messages.

A critical design choice: Claude Code preserves the message prefix so Anthropic's prompt cache remains valid. Naive oldest-first deletion would invalidate the entire cache on every compaction — a costly mistake that would negate the efficiency gains.

Four-tier memory hierarchy

Persistence across sessions uses four tiers:

CLAUDE.md — project-level instructions, read on every session start, survives compaction by being re-read from disk
Auto Memory — topic files in .claude/ that evolve with project knowledge
Session Memory — cross-session context extracted every ~5,000 tokens
/remember — promotes recurring patterns into permanent configuration

How the layers stack: Research layer (TriAttention KV cache pruning + TurboQuant 3-bit KV quantization + Latent Briefing representation-level sharing) provides a wider foundation (more tokens, less cost) -> feeds the Product layer (MemPalace structured archive + Hippo Memory dynamic forgetting) for extended reach (memory beyond sessions) -> feeds the Harness layer (3-layer compaction + 4-tier memory hierarchy + self-healing loop) as a managed interface (finite context, infinite tasks) -> Developer Experience.

This architecture is fundamentally a memory management system that bridges finite context windows and unbounded tasks. The model handles what fits in attention. The harness handles everything else.

How the pieces map together

These three layers — research, product, harness — aren't competing. They're solving the same problem at different altitudes, and their relationship is structural.

The research layer makes the foundation wider. TriAttention and TurboQuant compose to achieve 30x+ memory reduction for the KV cache. Latent Briefing lets multiple agents share context at 50% of the token cost with better accuracy. These don't change what models can do conceptually — they change the economics of how much they can hold while doing it.

The product layer extends reach beyond any single context window. MemPalace and Hippo Memory give agents access to knowledge that no context window, however large, could contain: months of project history, cross-session decisions, accumulated preferences. They're building external memory systems because even with perfect attention, a context window is still a window.

The harness layer manages the interface between finite capacity and infinite demand. Claude Code's compaction, memory hierarchy, and self-healing loop exist because even with better attention and external memory, someone still needs to decide what goes in the context window right now. The harness is the memory manager — it routes information between tiers, decides what to compress, and recovers when capacity runs out.

The layers are complementary because each one makes the others more effective:

Better KV compression (research) means harness compaction can be less aggressive, preserving more context quality
Richer external memory (product) means the harness can offload more confidently, knowing it can retrieve when needed
Smarter harness routing means research-level efficiency gains translate into user-visible capability, not just lower API bills

Predictions: where this convergence leads

Near-term

The next wave of KV cache compression hits production. The first wave is already standard: Grouped Query Attention is baked into every major open-weight model, PagedAttention is the default memory manager in vLLM, FP8 KV quantization ships in both vLLM and TensorRT-LLM, and prefix caching is default-on at every major provider. What's coming is more aggressive. TriAttention already ships as a vLLM plugin with community ports for llama.cpp and MLX. TurboQuant emerged from Google Research with community MLX implementations appearing within weeks. Within a year, sub-4-bit KV quantization and intelligent pruning will be default options in inference frameworks — not research experiments. This next wave directly extends how long coding agents can maintain coherent reasoning on a single task.

Multi-agent memory sharing moves from research to production. Latent Briefing's representation-level compaction is one compelling approach, but it's part of a broader wave. Teams are exploring shared KV cache pools across co-located agents, lightweight context distillation protocols, and hierarchical memory architectures where agents at different levels of an orchestration tree maintain context at different granularities. The common thread: making delegation cheap by solving context transfer at the systems level rather than through brute-force token passing. Expect multi-agent coding workflows to adopt one or more of these techniques, dramatically cutting the cost of orchestrator-to-worker handoffs.

External memory becomes expected, not optional. MemPalace and Hippo Memory are early but the pattern is clear: coding agents that remember project context across sessions will outperform those that don't, and developers will demand this capability. Claude Code's CLAUDE.md and Auto Memory are the first-party version of this trend.

Medium-term

Always-on agents become practical. The combination of compressed attention, efficient multi-agent sharing, and tiered external memory unlocks a new class of applications: coding agents that maintain continuous context over days or weeks. Not because the context window grows to millions of tokens, but because the system around it manages memory intelligently at every layer.

Memory consolidation becomes the norm. Hippo Memory's bet — that forgetting is as important as remembering — is likely directionally right, even if the specific mechanisms evolve. As agents accumulate months of project history, storing everything becomes as costly as forgetting everything. The winning systems will almost certainly need some form of consolidation: compressing episodes into patterns, decaying noise, strengthening what's used. Human memory works this way for a reason.

Models develop distinct memory tiers internally. Current models treat all tokens in the context window equally. Future architectures will likely differentiate between working memory (high-attention, recent, expensive) and reference memory (lower-attention, compressed, cheap) — mirroring what harnesses already do externally. When this happens, the external product layer and the internal model layer will start to merge.

What this means for harness engineering

This is where it gets concrete for builders.

Harnesses will manage multiple memory tiers, not just context windows. Today, a harness manages one thing: what's in the context. Tomorrow, it manages in-context tokens, compressed KV cache segments, external vector stores, persistent project files, and cross-session summaries. The harness becomes a memory routing system.

Memory routing becomes a first-class discipline. For every piece of information an agent encounters, the harness will need to make a routing decision: does this go in active context? Compressed cache? External store? Disk? Nowhere? Getting this routing right — fast, at scale, without human intervention — is the defining challenge of next-generation harness engineering.

Compaction strategies become differentiating. Claude Code's three-layer compaction is state-of-the-art for a single-agent coding tool today. But as tasks get longer and multi-agent workflows become standard, compaction will need to become task-aware (like Latent Briefing), importance-weighted (like Hippo's decay model), and cache-preserving (like Claude Code's prefix protection). The teams that ship the most capable agents won't be the ones with the most sophisticated compaction pipeline in isolation — they'll be the ones who understand what's in development at each layer of the research stack, recognize where models are and aren't capable of managing memory on their own, and synthesize all of that into a coherent strategy for how memory should work across long-horizon tasks.

The harness is becoming an operating system for agent memory. This isn't hyperbole. An OS manages memory tiers (registers, L1/L2 cache, RAM, disk), makes routing decisions transparently, and provides a clean abstraction to the application layer. Harnesses are converging on the same architecture for agent memory.

Independent research is arriving at the same conclusion. A recent paper from Yu, Zhang, Ni et al. explicitly frames multi-agent memory as a computer architecture problem — proposing a three-layer hierarchy (I/O, cache, memory) with shared vs. distributed paradigms and formal consistency protocols. Their central argument: multi-agent memory consistency is the most pressing unsolved problem in agent systems, just as cache coherence was for multiprocessor systems decades ago.

The parallel is exact. And it means the teams building harnesses today are, whether they realize it or not, building the memory management layer of a new computing paradigm.

The memory wall for coding agents isn't permanent. It's an engineering problem being solved at every layer simultaneously — in attention research, in community-built memory products, and in production harness architectures. The builders who understand where these layers connect will build the most capable agents. And the ones who realize that the harness isn't just an execution wrapper but a memory management system — those are the ones building the infrastructure that everything else will run on.

References

[1] Weian Mao et al. "TriAttention: Efficient Tri-State KV Cache Compression for Long-Context Transformers." MIT, NVIDIA, Zhejiang University, 2026. Code: https://github.com/WeianMao/triattention

[2] Google Research. "TurboQuant: Redefining AI Efficiency with Extreme Compression." ICLR 2026. https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

[3] Ramp Labs. "Latent Briefing: Efficient Multi-Agent Context Sharing via Representation-Level Compaction." 2026. https://x.com/RampLabs/status/2042660310851449223

[4] MemPalace. "MemPalace: Structured Spatial Memory Architecture for AI Agents." Code: https://github.com/MemPalace/mempalace

[5] Hippo Memory. "Hippo Memory: Neuroscience-Inspired Memory with Forgetting and Consolidation." Code: https://github.com/kitfunso/hippo-memory

[6] Yu, Zhang, Ni et al. "Multi-Agent Memory as a Computer Architecture Problem." arXiv:2603.10062, 2026. https://arxiv.org/abs/2603.10062

The Rise of Edge AI — A New Layer in the Coding Agent Stack

Mixture of Experts — Thu, 07 May 2026 02:13:32 +0000

Key Takeaways

Compression breakthroughs are collapsing the hardware barrier. TurboQuant achieves 6x memory reduction with zero quality loss, and PrismML's 1-bit Bonsai 8B fits a competitive model in 1.15 GB — 14x smaller than its 16-bit equivalent. Models that required data center GPUs now run on a MacBook Pro or even a phone.
Edge AI is earning a permanent place in the coding agent stack, not replacing the cloud. The open-source capability gap has closed to roughly three months. Gemma 4 ships edge-first with native function-calling under Apache 2.0, and Reflection AI's $2.5B raise signals that enterprises and infrastructure providers are investing in locally-deployable coding models as a complement to cloud services.
Reinforcement learning is making tiny models genuinely useful for agent orchestration. LiquidAI's 350M-parameter model — 1/20th the size of GPT-2 — achieves over 95% accuracy in multi-turn tool-calling, running on hardware as small as a Raspberry Pi. Tool use is the capability that separates a coding agent from a chatbot, and it no longer requires billions of parameters.
The local runtime is being purpose-built for coding agents. Ollama's MLX integration delivers 2x faster decode on Apple Silicon with caching designed specifically for agentic coding patterns — long, iterative conversations with repeated file context.
A fully edge-native coding stack is viable today for cost-constrained and regulated environments. When local inference is free at the margin, the question shifts from "is edge good enough?" to "why am I paying for something I can run myself?" — and for air-gapped or compliance-bound teams, edge AI isn't a fallback, it's the only way AI enters the workflow.

The next major shift in coding agents isn't about replacing the cloud. It's about complementing it — from your own hardware.

Today, the most capable coding agents — Claude Code, Codex, Copilot — route every keystroke through remote inference servers. The assumption baked into the entire ecosystem is that frontier-quality AI requires frontier-scale hardware, which means renting compute from someone else. That assumption still holds for the hardest reasoning tasks. But a cascade of breakthroughs in the first quarter of 2026 is opening up a parallel track: model compression, edge-optimized releases, local runtime optimization, and reinforcement learning for small models are converging to make local AI not just possible, but genuinely useful for a growing class of developer workflows.

Edge AI isn't arriving to kill the cloud. It's emerging as its own category — one that earns a permanent place in the development stack for specific, high-value scenarios where latency, privacy, cost, or availability matter more than peak reasoning power.

This post walks through the evidence — paper by paper, release by release — and maps out the scenarios where edge AI will be the preferred choice for developers building and using coding agents. The audience is software engineers who use these tools daily, whether or not you've ever read a machine learning paper.

The compression revolution: making big models edge-ready

The most direct path to edge AI is making existing models smaller without making them dumber. Two breakthroughs in early 2026 moved the needle dramatically, bringing frontier-class capabilities within reach of consumer hardware.

Google's TurboQuant: 6x memory reduction, zero quality loss

Google Research revealed TurboQuant in March 2026 — a compression algorithm that reduces the memory footprint of large language models while boosting speed and maintaining accuracy. The technique targets the key-value cache, which Google describes as a "digital cheat sheet" storing previously computed attention states so the model doesn't recompute them from scratch.

TurboQuant is a two-step process. First, PolarQuant converts the traditional Cartesian vector representation into polar coordinates — reducing each vector to a radius (data strength) and direction (semantic meaning). Google's analogy: instead of "go 3 blocks East, 4 blocks North," you say "go 5 blocks at 37 degrees." Less data, same destination, and no expensive normalization steps. Second, a technique called Quantized Johnson-Lindenstrauss applies a 1-bit error-correction layer, reducing residual quantization noise while preserving the distance relationships that attention scores depend on.

The numbers: 6x memory reduction in the KV cache with perfect downstream accuracy across long-context benchmarks using both Gemma and Mistral models. Computing attention with 4-bit TurboQuant runs 8x faster than 32-bit unquantized keys on NVIDIA H100 accelerators. And critically, TurboQuant quantizes to 3 bits with no additional training — it can be applied to existing models off the shelf.

For software engineers, here's the translation: a model that previously required 48 GB of VRAM could fit in 8 GB. That's the difference between a data center GPU and a MacBook Pro.

Caltech/PrismML: 1-bit models on your phone

If TurboQuant is aggressive, PrismML's work — emerging from breakthrough research at Caltech — is radical. They've achieved true end-to-end 1-bit quantization: embeddings, attention layers, MLP layers, and the language model head are all compressed to a single bit per parameter. No higher-precision escape hatches.

The result is their Bonsai 8B model: a model that competes with leading 8-billion-parameter models while occupying just 1.15 GB — 14x smaller than its 16-bit equivalent. PrismML measures this with an "Intelligence Density Score" of 1.06 per GB, compared to Qwen3 8B's 0.10 per GB. That's a 10.6x improvement in intelligence per unit of memory.

What does this look like in practice? The Bonsai 8B runs at approximately 40 tokens per second on an iPhone 17 Pro and 131 tokens per second on an M4 Pro Mac — with an energy cost of just 0.068 mWh per token on the iPhone 17 Pro Max.

An 8B-class model, competitive on benchmarks, running at interactive speeds on a phone. A year ago, that was a research fantasy.

The edge-ready model wave: from cloud-only to run-anywhere

Compression makes models smaller. But the edge story isn't just about shrinking existing models — it's about an entire class of models being designed, released, and optimized for local deployment. Open-weight releases, edge-first architectures, and purpose-built small models are all expanding what's possible without a cloud connection.

The capability gap is closing fast

Dave Friedman's analysis quantifies the trend. In 2023, closed-source models scored approximately 88% on MMLU benchmarks while open models managed 70.5% — a meaningful gap. By 2026, that gap is effectively zero on knowledge benchmarks and single digits on most reasoning tasks. Open-source models now trail the state of the art by approximately three months, down from roughly a year in late 2024.

The efficiency story is equally compelling. DeepSeek's V3 model used 2.6 million GPU hours versus Llama 3 405B's 30.8 million — a tenfold efficiency improvement for comparable performance. DeepSeek's R1 reasoning model matched OpenAI's o1 at roughly 3% of the cost.

For edge AI, the implication is direct: the models available for local deployment are no longer second-tier. Many of the tasks developers perform daily — code completion, documentation, refactoring, test generation — fall well within the capability range of models that can run on consumer hardware. The cloud retains its advantage for the hardest reasoning tasks, but the floor of "good enough for local" keeps rising.

Capability-gap timeline: 2023 closed-source leads (MMLU 88% vs 70.5%) -> 2024 gap narrows (open trails by ~1 year) -> 2025 DeepSeek R1 matches o1 at 3% cost -> 2026 gap ≈ 0 on knowledge, open trails by ~3 months -> Edge-capable models reach "good enough" for daily tasks -> Cloud for frontier reasoning, edge for everything else.

With LLM inference costs dropping roughly 10x annually and edge-capable models improving every quarter, the set of tasks that require cloud inference is shrinking. Cloud APIs will continue to lead on frontier reasoning, complex multi-step planning, and large-context tasks — but the everyday development workflows that consume the most tokens are increasingly viable on local hardware.

Gemma 4: edge-first by design

Google's Gemma 4 release in April 2026 is a landmark for the edge AI category. The Gemma 4 family ships in four sizes — E2B, E4B, 26B MoE, and 31B Dense — under an Apache 2.0 license, with the smaller variants explicitly designed for on-device deployment.

The performance is no longer "good for a local model." It's simply good. The 31B model is the #3 open model in the world on the Arena AI text leaderboard. The 26B MoE is #6, outcompeting models 20x its size. The MoE architecture activates only 3.8 billion of its 26 billion total parameters during inference — frontier-level reasoning at a fraction of the compute cost.

For edge deployment specifically, Gemma 4's E2B and E4B models run completely offline with near-zero latency across phones, Raspberry Pi, and NVIDIA Jetson Orin Nano. They feature 128K context windows, native multimodal capabilities (vision, audio), and — critically for coding agents — native function-calling, structured JSON output, and system instructions. These aren't toy models. They're agent-ready and edge-first, representing a new design philosophy: models built for the edge from the ground up, not cloud models shrunk down as an afterthought.

Reflection AI: $2.5 billion bet on deployable coding models

The capital markets are backing the edge thesis. Reflection AI, founded in 2024 by former DeepMind researchers Misha Laskin and Ioannis Antonoglou, is raising $2.5 billion at a $25 billion valuation — backed by NVIDIA and JPMorgan Chase. The company's valuation went from $545 million to $25 billion in under 12 months. A 46x increase.

Reflection builds open-weight models focused explicitly on automating software development — AI systems that write, test, and maintain code. Positioned as "the DeepSeek of the West," they're building a model network for enterprises, research institutions, and universities — models designed to run on your infrastructure, not just through an API.

When NVIDIA pours nearly a billion dollars into a coding AI lab building locally-deployable models, and JPMorgan participates through its Security and Resiliency Initiative, the strategic message is clear: edge-deployable AI models aren't a research curiosity. They're a category that enterprises, governments, and infrastructure providers are investing in as a complement to cloud-based AI services.

The small model breakthrough: RL changes everything

Compression makes big models edge-ready. Open weights and edge-first architectures give you deployment flexibility. But the third force might be the most surprising: small models are getting dramatically smarter through reinforcement learning.

LiquidAI: a 350-million-parameter model that can use tools

LiquidAI's LFM 2.5 350M is a 350-million-parameter model — roughly 1/20th the size of GPT-2 — that delivers performance previously associated with models many times its size. The key innovation is applying large-scale reinforcement learning to a small model after expanded pre-training (28 trillion tokens, up from 10 trillion).

The results redefine what "small" means:

76.96% on IFEval (instruction following), up from 64.96% in the previous version
44.11 on BFCLv3 (tool use), roughly double the prior version's 22.95
Over 95% accuracy in multi-turn tool-calling interactions across smart home, banking, and terminal use cases
Runs at 40,400 tokens per second on an NVIDIA H100, and inference works on everything from a Raspberry Pi 5 to an Apple M5 Max

A 350-million-parameter model with reliable tool use. Think about what that means for coding agents. Tool use — calling functions, reading files, executing shell commands — is the foundational capability that separates a coding agent from a chatbot. If a model small enough to run on a Raspberry Pi can reliably call tools, the minimum hardware bar for a useful coding agent drops to essentially nothing.

LiquidAI's model isn't recommended for complex math, code generation, or creative writing — those tasks still demand larger models. But for the orchestration layer — deciding which tools to call, in what order, with what arguments — a 350M model with strong tool-calling accuracy could serve as a lightweight local coordinator that dispatches heavier tasks to larger models only when necessary.

The runtime is ready: Ollama's Apple Silicon moment

Models don't run in a vacuum. They need runtime infrastructure optimized for local hardware. Ollama's March 2026 update delivered exactly this.

Ollama 0.19 is now built on Apple's MLX framework, directly leveraging the unified memory architecture of Apple Silicon. The performance gains are substantial: 1.6x faster prefill and roughly 2x faster decode speed compared to the previous version. On M5-series chips, Ollama taps the new GPU Neural Accelerators for further acceleration.

But the most telling detail is what Ollama chose to optimize for. Their announcement explicitly names coding agents — Claude Code, OpenCode, Codex — as the primary beneficiaries. The new caching system reuses context across conversations, stores intelligent checkpoints within prompts, and implements smarter eviction policies where shared prefixes survive longer. These are features designed specifically for the pattern of agentic coding: long, iterative conversations where the model keeps returning to the same files and context.

The infrastructure layer is no longer an afterthought. The local runtime is being purpose-built for coding agents — a clear signal that edge AI is maturing from experiment to product category.

The trust and control gap: where edge AI earns its place

There's a reason edge AI matters that goes deeper than latency and cost. Stanford's 2026 AI Index quantified a growing disconnect: only 10% of Americans say they're more excited than concerned about AI, compared to 56% of AI experts. On whether AI will help with jobs, 73% of experts say yes — only 23% of the public agrees.

This trust gap creates real demand for alternatives. Frontier models like Anthropic's Mythos are expensive to serve, and AI power demand is now comparable to Switzerland's entire national electricity consumption. For developers and organizations who need AI capabilities but have concerns about data sovereignty, cost predictability, or availability, cloud-only isn't always the right answer.

This is precisely the market that edge AI serves. It's not about choosing sides in a cloud-versus-local debate — it's about recognizing that different scenarios call for different deployment models:

Data-sensitive development: When working with proprietary codebases, regulated data, or pre-disclosure work, local inference means your code never leaves your machine.
Cost-predictable workflows: For high-volume, routine tasks (linting, code completion, documentation), local models eliminate per-token costs entirely.
Offline and low-latency scenarios: Air-gapped environments, travel, unreliable networks, or latency-sensitive workflows where round-trip times to a cloud API are unacceptable.
Developer autonomy: The ability to fine-tune, customize, and control the model stack without vendor dependency.

Edge AI isn't replacing cloud AI. It's filling gaps that cloud AI structurally cannot — and giving developers more options in how they architect their workflows.

Predictions: how edge AI reshapes the coding agent stack

Here's where the evidence points. These aren't about cloud AI disappearing — they're about a new layer emerging alongside it, with its own strengths and use cases.

Hybrid agent architectures become the default

Coding agents will increasingly run a lightweight local model for orchestration, tool-calling, and context management, while dispatching complex reasoning tasks to cloud models when the task demands it. LiquidAI's 350M model demonstrates that tool-calling reliability doesn't require billions of parameters. Gemma 4's E4B shows that meaningful code understanding fits in a phone-sized footprint. The architecture will be hybrid by design — local for speed, cost, and privacy; cloud for frontier reasoning when needed.

The always-on local coding daemon emerges

Today, you start a coding agent session that connects to a remote API. Within two years, your IDE will also ship with a background process — a daemon — running a compressed model locally, always warm, always available. Ollama's caching improvements (cross-conversation reuse, intelligent checkpoints) are the early infrastructure for exactly this pattern. This local daemon handles the fast, repetitive work — completions, refactors, linting suggestions — while cloud agents remain available for deep reasoning and complex multi-file tasks.

Edge creates a new cost tier for AI-assisted development

As compression techniques like TurboQuant and Bonsai-style 1-bit quantization make local inference effectively free, a new pricing tier emerges: tasks that can run locally cost nothing at the margin. This doesn't eliminate cloud AI's value proposition — frontier reasoning, large-context synthesis, and model-as-a-service convenience remain worth paying for. But for the high-volume, routine tokens that make up the majority of a developer's daily AI usage, local inference is a compelling alternative. The strategic differentiation shifts toward which tasks each deployment model handles best.

Data sovereignty becomes a first-class developer concern

When a coding agent can run locally, proprietary code never has to leave your machine. For enterprises in regulated industries — finance, healthcare, defense — this unlocks AI-assisted development in contexts where cloud APIs were never an option. For individual developers, it means control over what data is shared and with whom. "Runs locally" will become a first-class feature in coding tool evaluations, not just a nice-to-have.

Edge-optimized coding models become a distinct category

Reflection AI is raising $2.5 billion specifically to build deployable models for automated software development. Gemma 4 already ships with native function-calling and code generation in edge-friendly form factors. The trajectory is clear: by mid-2027, models fine-tuned specifically for coding, compressed to run on consumer hardware, and wrapped in polished local agent harnesses will be a recognized product category — not replacements for cloud coding agents, but purpose-built alternatives optimized for the scenarios where local deployment wins.

Some developers go fully edge — and never look back

The preceding predictions frame edge AI as a complement to the cloud. But for a meaningful segment of developers, edge won't just be one layer — it will be the entire stack. The evidence already supports it: Bonsai 8B delivers competitive code understanding in 1.15 GB. Gemma 4's E4B provides native function-calling, 128K context, and structured output — everything a coding agent needs to operate autonomously. LiquidAI's 350M model handles tool orchestration on a Raspberry Pi. Ollama's runtime is purpose-built for agentic coding patterns. Stack these together and every component of a self-contained coding agent — orchestration, code comprehension, tool use, and runtime — runs on consumer hardware today.

Two populations will drive this shift. First, cost-constrained developers — indie builders, students, and developers in emerging markets where per-token API costs are a real barrier. When local inference is free at the margin, the calculus isn't "is edge good enough?" — it's "why am I paying for something I can run myself?" Second, developers in regulated and air-gapped environments — defense contractors, healthcare organizations bound by HIPAA, government agencies with no external network access. For them, "cloud is not an option" isn't a preference; it's a hard constraint. Full-edge AI doesn't just complement their workflow — it's the only way AI enters their workflow at all.

The natural objection is that models alone aren't enough — that cloud coding agents like Claude Code and Codex derive their real advantage from the harness, not just the model. The harness is the orchestration layer: the tool-calling logic, the context management, the workflow patterns that turn a raw model into a useful coding partner. And today, that's a real advantage. But it's an eroding one. The developer community is rapidly learning how to build its own harnesses — open-source agent frameworks, custom tool integrations, workflow automation that codifies exactly how a specific engineer or team works. Every month, more developers ship their own agentic workflows tailored to their stack, their codebase, their preferences. Simultaneously, the models themselves are getting better at leveraging these harnesses. A more capable local model doesn't just generate better code — it follows tool-calling conventions more reliably, handles multi-step workflows with less hand-holding, and recovers from errors more gracefully. The harness becomes easier to build and more effective to run as model quality improves. The result: the gap between a polished cloud agent's harness and what a motivated developer can assemble locally is closing from both directions.

This won't be the default path for most developers. Cloud agents will retain clear advantages in frontier reasoning, massive-context synthesis, and zero-setup convenience. But the floor of "good enough for a full day's work" is rising fast. By late 2027, a developer with a modern laptop and no internet connection will be able to run a coding agent that handles completions, refactors, test generation, documentation, and tool-calling — all locally, all free. For the populations where cost or compliance makes cloud untenable, that's not a consolation prize. It's a better fit.

Convergence picture: Compression (TurboQuant, Bonsai 1-bit) + Edge-Ready Models (Gemma 4, Reflection AI) + RL for Small Models (LiquidAI LFM 2.5) + Local Runtimes (Ollama + MLX) -> Edge AI layer in coding agents -> hybrid architecture (edge for speed & privacy, cloud for frontier reasoning), always-on local daemon alongside cloud agents, data sovereignty as a first-class feature, full-edge developers in cost-constrained and regulated environments.

A new layer in the stack

Every layer of the stack is converging on the same conclusion: edge AI is ready to be a real part of the development workflow. Compression researchers are proving you can shrink models 14x without meaningful quality loss. Google is releasing edge-first models under Apache 2.0. A startup valued at $25 billion is building locally-deployable coding AI. Apple's own ML framework is being wired directly into local agent runtimes. A 350-million-parameter model can reliably call tools. And developers are increasingly asking for options beyond cloud-only.

These aren't independent trends. They're the emergence of a new market category: edge AI for software development.

For software engineers, the implication is practical. The coding agent stack you use a year from now will likely include both cloud and local models, each handling the tasks they're best suited for. Cloud APIs aren't going anywhere — frontier reasoning, massive-context synthesis, and the convenience of hosted inference remain valuable. But alongside them, a local layer will handle the fast, private, cost-free work that makes up the bulk of daily AI-assisted development.

The developers who thrive will be the ones who understand both layers — when to reach for cloud reasoning power and when local inference is the smarter choice. Edge AI isn't the end of the cloud era. It's the beginning of a more nuanced one.

References

[1] "Google says new TurboQuant compression can lower AI memory usage without sacrificing quality." Ars Technica, March 2026. https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality

[2] "Caltech Researchers Claim Radical Compression of High Fidelity AI Models." Wall Street Journal, 2026. https://www.wsj.com/cio-journal/caltech-researchers-claim-radical-compression-of-high-fidelity-ai-models-e66f31c9

[3] "Bonsai 8B: 1-bit models for mobile." PrismML, 2026. https://prismml.com/news/bonsai-8b

[4] Dave Friedman. "Closed Source vs Open Source AI: A Shrinking Moat." Substack, 2026. https://davefriedman.substack.com/p/closed-source-vs-open-source-ai-a

[5] "Gemma 4: Our most capable open models to date." Google Blog, April 2, 2026. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/

[6] "Nvidia-backed Reflection AI eyes $25 billion valuation." Reuters, March 2026. https://www.reuters.com/business/nvidia-backed-reflection-ai-eyes-25-billion-valuation-wsj-reports-2026-03-26/

[7] "LFM2.5-350M: No Size Left Behind." Liquid AI Blog, 2026. https://www.liquid.ai/blog/lfm2-5-350m-no-size-left-behind

[8] "Ollama is now powered by MLX on Apple Silicon." Ollama Blog, March 30, 2026. https://ollama.com/blog/mlx

[9] "2026 AI Index Report." Stanford HAI, 2026. https://hai.stanford.edu/ai-index/2026-ai-index-report

Claude Opus 4.7: Anthropic's Agentic Reliability Release, Explained

Mixture of Experts — Thu, 07 May 2026 01:52:27 +0000

Key Takeaways

Opus 4.7 posts the strongest coding numbers of any generally-available frontier model: 87.6% on SWE-Bench Verified (up from 80.8% on Opus 4.6) and 64.3% on SWE-Bench Pro (up from 53.4%). On CursorBench it hits 70% versus Opus 4.6's 58%. The benchmark jump is real, but it's not the most interesting change.
The release is about agent reliability, not just capability. Anthropic's own framing emphasizes that Opus 4.7 achieves the highest quality-per-tool-call ratio they've measured, with markedly lower rates of looping and better recovery from mid-run tool failures. For engineers running long autonomous jobs, that matters more than a benchmark delta.
Two new surfaces to learn: xhigh effort level and Task Budgets (public beta). xhigh sits between high and max and is the new default in Claude Code. Task Budgets let you cap token spend across a multi-step run so the model prioritizes work instead of burning compute on the first sub-task.
/ultrareview is a dedicated code-review session — a separate run that re-reads the diff with a reviewer's mindset and flags bugs and design issues. Pro and Max users get three free ultrareviews to try it.
Drop-in migration: same API shape, same $5 / $25 per million tokens as Opus 4.6. The model ID is claude-opus-4-7, available on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Prompts from 4.6 generally work, though the stricter instruction-following may require some retuning.

Anthropic released Claude Opus 4.7 today. On paper it's an incremental point release in the Claude 4.x line, priced identically to Opus 4.6 and exposed through the same API surface. But reading through the release notes, the third-party benchmark coverage, and the partner reports, a different story emerges: this isn't a benchmark release with a reliability footnote. It's a reliability release with a benchmark footnote.

For software engineers shipping production AI features — especially anyone running coding agents, code review pipelines, or multi-step autonomous workflows — the changes in Opus 4.7 map directly onto the failure modes that actually waste engineering time. Looping agents. Silent error recovery that wasn't. Ballooning token spend on a six-hour run. This post walks through what's new, what the numbers actually say, what early partners are reporting, and where Opus 4.7 should and shouldn't land in your stack.

The benchmark picture

Opus 4.7 leads the publicly-available frontier field on most coding benchmarks, but the delta is uneven across workloads. Here's the cleanest view of the numbers Anthropic and third parties have reported so far:

Benchmarks (Opus 4.7 -> Opus 4.6 -> Notable peer):

Two numbers deserve particular attention. On SWE-Bench Pro — the harder, larger, multi-repo variant that tracks real production-style issues — Opus 4.7 moves from 53.4% to 64.3%, an ~11-point jump. The visual acuity benchmark moves from 54.5% to 98.5%, which is the quantitative shadow of Anthropic's other vision claim: Opus 4.7 accepts images up to 2,576 pixels on the long edge, roughly 3x the resolution Opus 4.6 could ingest. Engineers generating UI mockups, reading dense dashboards, or inspecting failing screenshots should feel this immediately.

One weakness worth flagging: Opus 4.7 trails GPT-5.4 meaningfully on BrowseComp (79.3% vs 89.3%). If your agent's bottleneck is navigating the open web — research agents, browser-based RPA, deep-research workflows — Claude is not the clear winner here.

Anthropic also ran third-party evaluations with partners, and those are the numbers most aligned with real production work. On Rakuten-SWE-Bench (an internal benchmark constructed from actual Rakuten production tasks), Opus 4.7 resolves 3x more tasks than Opus 4.6, with double-digit improvements in code quality and test quality scores. Databricks reports 21% fewer errors on OfficeQA Pro, their document-reasoning benchmark, when the model is working from source documents.

What actually changed in how the model works

The benchmark gains matter, but the new control surfaces and behavioral changes are where Opus 4.7 will show up in daily engineering work.

xhigh: a new reasoning effort level

Claude's effort parameter already exposed minimal, low, medium, high, and max. Opus 4.7 inserts a new xhigh level between high and max. The practical point is that max is expensive and often latency-prohibitive for interactive work, while high sometimes under-reasons on hard tasks. xhigh gives you a middle rung. Anthropic has raised the Claude Code default to xhigh across all plans, which means existing Claude Code users will feel slightly slower, slightly smarter behavior by default starting today.

Adaptive extended thinking

In Opus 4.6, extended thinking was effectively all-or-nothing — enabling it meant the model invested reasoning effort even on trivial queries, paying for itself sometimes and burning tokens for nothing other times. Opus 4.7 makes extended thinking context-aware. With it enabled, the model decides per-query how much depth a problem warrants: simple questions return quickly, complex ones get proportionally more reasoning. The practical effect is that you can leave extended thinking on without paying a flat latency tax on every request — meaningful for production deployments where request difficulty varies widely across the workload.

Task Budgets (public beta)

Task Budgets let you hand the model a token budget for a multi-step task so it can prioritize work across sub-tasks rather than burning through its budget on step one. This is a meaningful primitive for anyone running long agent jobs in production — the classic failure mode where an agent exhausts context on exploration and then has nothing left for execution now has a native knob.

/ultrareview

The /ultrareview slash command kicks off a dedicated review session that re-reads a diff and surfaces bugs and design issues that a careful human reviewer would catch. Unlike asking Claude to review its own work inline, this runs as a separate session with a reviewer's prompt posture. Pro and Max users get three free ultrareviews to try; beyond that it's metered as normal usage.

Agentic reliability: the less-flashy changes

The behavioral deltas are the ones that don't fit cleanly on a benchmark chart. Anthropic reports that Opus 4.7 loops on roughly 1 in 18 queries less often than prior Opus versions, keeps executing through tool failures that used to halt Opus 4.6, and devises its own verification steps before reporting a task complete. The concrete example Anthropic published — having the model build a Rust text-to-speech engine from scratch (neural model, SIMD kernels, browser demo) and then feed its own output through a speech recognizer to check that it matched the Python reference — is the clearest expression of what "verifies its own outputs" means in practice.

More conservative tool use

Opus 4.7 is noticeably more reluctant to call tools autonomously than Opus 4.6. It defaults to answering from its training knowledge unless you point it at a source. This is the right default for production agents — fewer surprise tool calls, lower variance in cost and latency — but it changes how you should prompt. If you want the model to search the web, query a connector, or read from a specific knowledge source, name the source explicitly in the prompt ("search the web for X," "check the Slack channel," "read the file at this path"). Inspect the thinking trace afterward to verify which sources actually got used.

Workflow flow: Task arrives -> xhigh reasoning (new default in Claude Code) -> Long-running multi-step work -> if token spend gets large, Task Budgets redistribute effort, otherwise continue -> if a tool call fails, graceful recovery keeps going, otherwise continue -> Self-verify output before reporting done -> /ultrareview (separate review session) -> Ship.

What early partners are reporting

Early-access partners' reports give a more grounded picture than benchmarks alone. The useful signal across their summaries is remarkably consistent: the improvements they highlight are about reliability under autonomy, not raw capability ceilings.

Rakuten's engineering leadership has emphasized that the uplift on their internal benchmark translated into real movement in the quality metrics their teams care about — not just pass/fail on tasks, but code quality and test quality rising together. Databricks' framing of the OfficeQA Pro gain is practical: their users work against source documents, and a 21% drop in errors means fewer hallucinated citations and fewer manual re-runs.

Three other partner reports from the enterprise early-access group paint a consistent picture. A financial technology platform observed the model catching its own logical errors during the planning phase rather than during execution — a behavioral shift that matters because plan-time errors are orders of magnitude cheaper than execution-time errors. A code review platform saw a greater-than-10% improvement in bug detection recall while holding precision steady, which is a harder combination to get than either metric alone. An autonomous workflow company reported ~14% gains in task success alongside a third the tool errors at fewer tokens — a rare case where quality and efficiency moved in the same direction.

The common thread: the behaviors that get better are the ones that make the difference between "impressive demo" and "safe to leave running overnight."

How to actually use this as a software engineer

If you're building with Claude today, here's the practical playbook.

Migrate opportunistically, not urgently. The API is drop-in compatible, pricing is unchanged, and the model ID is claude-opus-4-7. Run a shadow evaluation on your existing agent traces before flipping production traffic — not because migration is risky, but because the stricter instruction-following can expose prompts that were implicitly relying on Opus 4.6's looser interpretation. Concretely: Opus 4.7 takes directives more literally, so repeated emphasis ("be brief, really brief, don't ramble") and defensive padding ("skip the obvious parts") now execute more precisely than you may have intended. Prefer a single clear instruction over layered emphasis, and audit project- or system-level prompts that grew through accretion.

Default to xhigh, not max. For interactive coding work, xhigh is the sweet spot that the Claude Code team has already chosen as their new default. Save max for tasks you know need it and can afford to wait on.

Reach for Task Budgets on anything multi-step. If you're orchestrating agents that run for more than a few minutes — research, refactors, migration scripts, data pipeline debugging — Task Budgets are the right primitive to prevent the classic "spent 80% of tokens exploring, 20% executing" failure. Start conservative; the knob rewards iteration.

Put /ultrareview in your PR flow, but not as a rubber stamp. The most useful place for /ultrareview is between "Claude implemented it" and "human merges it" — a separate review session that catches the class of bugs a tired reviewer misses. It is not a replacement for a human reviewer on anything with security, compliance, or customer-data implications.

Don't reach for Opus 4.7 for open-web research agents. The BrowseComp gap to GPT-5.4 is real and meaningful. If your agent's job is navigating the open web, run an A/B on both models before committing.

Be explicit about which sources you want the model to consult. Because Opus 4.7 leans toward answering from its own knowledge before reaching for tools, prompts that worked on 4.6 by implicitly assuming "Claude will obviously search the web for this" can return stale or training-cutoff answers on 4.7. Name the source in the prompt: search the web for…, query the connector…, read this file at…. This is also a quiet quality-of-life win — your traces become easier to audit when tool selection is in the prompt instead of the model's discretion.

Watch the vision path — and skip the pre-processing. If your stack uses Claude to look at mockups, screenshots, PDFs of dashboards, or generated UIs, the 3x resolution jump and the visual acuity benchmark jump (54.5% -> 98.5%) are the changes most likely to show up as noticeably better outputs without any prompt changes. The corollary: pipelines that pre-cropped, downsampled, or upscaled images to work around 4.6's resolution limits should be retired. Send the original — Opus 4.7 reads small axis labels, dense table cells, and footnotes natively.

Read the system card. Anthropic published the Opus 4.7 system card alongside the release. Notable: low rates of deception and sycophancy, and stronger resistance to prompt injection than Opus 4.6, but modestly weaker on overly-detailed harm-reduction advice on controlled substances. If your deployment has safety-sensitive surfaces, read it before you ship

What Opus 4.7 signals about the direction

A useful frame for thinking about this release: Anthropic is optimizing harder for autonomy reliability than for peak capability. The agentic-search gap to GPT-5.4 is notable because it's the one place where Anthropic clearly chose not to catch up in this release. The numbers they did move — quality-per-tool-call, loop resistance, mid-run error recovery, self-verification — are the ones that determine whether an agent is shippable, not just demonstrable.

For software engineers, that's a meaningful product posture. The next year of AI engineering work is going to be dominated by "can I actually trust this thing to run without me watching?" The features in Opus 4.7 — xhigh as a cheaper path to deep reasoning, Task Budgets as a primitive for long runs, /ultrareview as a separate-session review gate, and the underlying reliability behaviors — are all calibrated to that question. Worth adopting, worth instrumenting, worth testing before you trust it on anything that matters.

References

[1] Anthropic, "Introducing Claude Opus 4.7." April 16, 2026. https://www.anthropic.com/news/claude-opus-4-7

[2] OfficeChai, "Anthropic Releases Claude Opus 4.7, Beats GPT-5.4, Gemini 3.1 Pro On Most Benchmarks." April 16, 2026. https://officechai.com/ai/ckaude-opus-4-7-benchmarks/

[3] Anthropic, "Claude Opus 4.7 — product page." April 16, 2026. https://www.anthropic.com/claude/opus

[4] Anthropic, "Working with Claude Opus 4.7." April 16, 2026. https://claude.com/resources/tutorials/working-with-claude-opus-4-7

Open Claude Design: A Weekend Harness Built on Atomic

Mixture of Experts — Thu, 07 May 2026 01:44:02 +0000

Anthropic released Claude Design (https://www.anthropic.com/news/claude-design-anthropic-labs) on April 17, 2026 — a conversational tool for producing prototypes, slides, and marketing collateral, with a design-system import step, a refinement loop, and a Claude Code handoff bundle at the end.

Three days later we shipped open-claude-design (https://github.com/flora131/atomic/tree/main/src/sdk/workflows/builtin/open-claude-design): an open-source replica implemented as a built-in Atomic workflow. Five deterministic phases, the same pipeline ported across three different coding agents (Claude Agent SDK, Copilot CLI, opencode) — roughly 500 lines of typescript orchestration per provider. The full source lives at src/sdk/workflows/builtin/open-claude-design.

We didn't rebuild Claude Code to do this. We built a thin harness around it.

That distinction is the point of this post.

The pipeline

Claude Design's UX is a conversation, but underneath it's a pipeline. We reverse-engineered the phases from the announcement and from the partner quotes ("20+ prompts to 2 prompts" is a tell — there's a deterministic skeleton under the chat).

Pipeline flow:
Phase 1: Design System Onboarding (parallel headless fan-out + HIL approval) -> Phase 2: Import (URL / file / codebase capture, headless) -> Phase 3: Generation (first design version, visible) -> Phase 4: Refinement Loop (≤5 iterations, HIL + parallel critique). The loop either iterates back on itself or, on approved / "ship it", moves to Phase 5: Export + Handoff (Claude Code / Copilot CLI / opencode).

Headless stages run on Sonnet with bypassPermissions for cost and speed — but only in the Claude provider, where the Agent SDK lets us pin a per-stage model. The Copilot CLI and opencode providers don't expose that knob, so their headless stages inherit whatever orchestrator model the user invoked the workflow with. Visible stages inherit the orchestrator model (Opus) across all three providers and surface to the user. The refinement loop is a bounded human-in-the-loop cycle with early exit on completion signal phrases ("approved", "ship it", "done").

Inside Phase 4, the refinement quality comes from pairing two tools: the impeccable skill drives the creative pass (taste, hierarchy, distinctive aesthetics over generic AI defaults), while the Playwright CLI captures screenshots of the rendered output so a critique sub-agent can inspect what actually shipped, not what the model thinks shipped. Visual grounding + structured critique closes the loop that a text-only refinement would leave open — the agent sees its own mistakes instead of hallucinating past them.

The full topology — including the three parallel codebase-analysis sub-agents in Phase 1 and the parallel critique + screenshot validation in Phase 4 — is laid out in the workflow source.

The workflow SDK is the whole trick

Here's a trimmed version of the Claude provider for Phase 1 — the parallel fan-out followed by a human-in-the-loop approval stage:

// Layer 1: three headless agents analyze the codebase in parallel
const [locator, analyzer, patterns] = await Promise.all([
  ctx.stage(
    { name: "ds-locator", headless: true },
    {}, {},
    async (s) => s.session.query(
      buildDesignLocatorPrompt({ root }),
      { agent: "codebase-locator", ...HEADLESS_OPTS },
    ),
  ),
  ctx.stage(
    { name: "ds-analyzer", headless: true },
    {}, {},
    async (s) => s.session.query(
      buildDesignAnalyzerPrompt({ root }),
      { agent: "codebase-analyzer", ...HEADLESS_OPTS },
    ),
  ),
  ctx.stage(
    { name: "ds-patterns", headless: true },
    {}, {},
    async (s) => s.session.query(
      buildDesignPatternPrompt({ root }),
      { agent: "codebase-pattern-finder", ...HEADLESS_OPTS },
    ),
  ),
]);

// Layer 2: visible agent reviews the findings with the user
await ctx.stage(
  { name: "design-system-builder" },
  {}, {},
  async (s) => s.session.query(
    buildDesignSystemBuilderPrompt({
      root,
      locatorOutput: locator.result,
      analyzerOutput: analyzer.result,
      patternsOutput: patterns.result,
    }),
  ),
);

Three things to notice:

ctx.stage is just a function around a session. The orchestration is plain TypeScript — Promise.all, for loops, early break on signal phrases. No DSL. No YAML. No graph declaration.
s.session.query calls the coding agent's native harness. We're not reimplementing Claude Code's tool loop, its permission model, or its subagent dispatch — we're calling into them. agent: "codebase-locator" points at an existing Atomic subagent; HEADLESS_OPTS sets bypassPermissions and forces Sonnet.
The orchestrator picks the minimum toolset for each stage. Headless analyzers get bypassPermissions. Visible stages inherit Opus. The refinement loop gets AskUserQuestion. Each stage sees only what it needs.

The headless model is also a knob, not a fixed choice. The HEADLESS_OPTS constant pins the sub-agents to Sonnet by default because the analysis stages are well-scoped and cost-sensitive, but you can swap it to Opus for harder codebases, or drop the model field entirely to inherit whatever the orchestrator is running. One line, repo-wide — pick your point on the cost/performance curve.

Prompts are the other knob, and usually the more important one. Each stage's instructions are a plain TypeScript function — buildDesignLocatorPrompt, buildDesignAnalyzerPrompt, the refinement critique prompt — so tailoring outputs to your stack means editing a string, not reconfiguring the pipeline. Want the analyzer to look specifically for shadcn tokens, or the generator to prefer Tailwind over inline styles, or the critique to hammer on accessibility over aesthetics? Edit the prompt. Swapping models gets you capacity; adjusting the instructions is what dials in taste, framework conventions, and the specific shape of output you want for your project. The two knobs are complementary — you'll almost always reach for the prompt first.

The workflow-creator skill got us 90% of the way there

The non-obvious part was the pipeline shape, not the code. Once we knew what phases we wanted, the workflow-creator skill scaffolds the defineWorkflow().run().compile() structure, the ctx.stage calls, the WorkflowInput schema, and the provider split (Claude vs. Copilot vs. opencode).

Our actual work was:

Phase 1 product analysis — watched the Claude Design demo, read the announcement, listed the phases.
Scaffold via workflow-creator — described the five phases and the topology, got back a working provider skeleton.
Tweak prompts and behavior — adjusted the stage prompts, model assignments, and early-exit conditions until the pipeline produced what we wanted.
Test across the three agents — ran the same workflow under Claude, Copilot CLI, and opencode.

The research artifacts — the product analysis, the SDK mapping, the RFC — all live alongside the workflow source on GitHub.

Same pipeline, three coding agents

Because the SDK's only abstraction over the agent is s.session.query(...), porting to a different coding agent is mechanical. The Copilot CLI provider is the same five phases; it just passes different stage options and deals with Copilot's SessionEvent[] message format on the way out:

atomic workflow -n open-claude-design -a claude --prompt "Landing page for a dev tool"
atomic workflow -n open-claude-design -a copilot --prompt "Landing page for a dev tool"
atomic workflow -n open-claude-design -a opencode --prompt "Landing page for a dev tool"

One workflow, three harnesses, identical CLI surface.

Why "thin harness" is the right frame

The temptation when you want agent X to do task Y is to build a new agent. It's the wrong instinct. Coding agents are already harnesses — they have a tool loop, a permission model, subagents, skills, MCP. Rebuilding that is how you end up with a 50K-line framework that's worse than what you wrapped.

A thin harness inverts the relationship:

You don't own the agent's inner loop. Claude Code keeps its tool-use cycle. Copilot CLI keeps its session machinery. opencode keeps its own runtime. Your code never reimplements any of them.
You own the outer pipeline. Which stages run, in what order, under what model, with what permissions, with what early-exit conditions. This is the part that's actually workflow-specific.
The abstraction is one function. s.session.query(prompt, opts). Everything above it — Promise.all, for, if — is TypeScript you already know.
You pick the minimum toolset per stage. Headless analyzers don't get write permissions. Visible creative stages inherit Opus. Each stage sees what it needs and nothing more — the cheapest way to keep a long pipeline coherent.

What you give up. Claude Design's chat UX streams tokens straight into a rendered preview — it feels fast because the product is purpose-built around that loop. A CLI workflow with discrete phases and HIL gates won't match that feel, and shouldn't try. You're trading perceived latency for a pipeline you can read, fork, and re-point at any coding agent. If you want the streaming feel back, that's what the next paragraph is for — the workflow SDK doesn't care whether the frontend is a CLI, a web app, or a chat surface.

Claude Design is a product. Open Claude Design is a recipe. The recipe runs on whatever coding agent you already trust, in your own repo, against your own design system, exported to whatever you want. You can read every line.

And because the pipeline is just TypeScript, you can fork it, add a phase, swap a model, change the early-exit conditions, or bolt a vercel deploy step onto Phase 5. Or go further — build your own harness entirely, wrap it in whatever UX you want (a web app, a desktop shell, a chat surface, a VS Code extension), and let the workflow SDK be the thing underneath. The CLI is one frontend; nothing stops you from writing another. That's the part that matters. Not the workflow — the fact that building the next workflow, or the next harness around it, is a weekend.

This is what coding at scale looks like from here on out: teams won't just use coding agents, they'll build thin harnesses like open-claude-design to orchestrate them across every dev workflow they run.

References

[1] "open-claude-design — workflow source." Atomic, GitHub. https://github.com/flora131/atomic/tree/main/src/sdk/workflows/builtin/open-claude-design

[2] Anthropic, "Claude Design — Anthropic Labs." April 17, 2026. https://www.anthropic.com/news/claude-design-anthropic-labs

[3] "Atomic — agent workflow toolkit." GitHub. https://github.com/flora131/atomic

[4] "Atomic workflow architecture." alexlavaee.me, 2026. https://alexlavaee.me/blog/atomic-workflow

[5] "Harness engineering: why coding agents need infrastructure." alexlavaee.me, 2026. https://alexlavaee.me/blog/harness-engineering-why-coding-agents-need-infrastructure

Atomic's Workflow SDK: Deterministically Extending Coding Agents

Mixture of Experts — Thu, 07 May 2026 01:35:43 +0000

Coding agents are great at day-to-day work. What they still can't do reliably — and what keeps you babysitting every step — is finish a long-running, complex task while following your team's specific guardrails. After thousands of hours shipping with coding agents, I've landed on what actually helps me amplify what coding agents already do well into long-running, ambiguous tasks and am open-sourcing it.

Neither a coding agent nor a general framework closes this gap

Coding agents alone can't do it. By design, they ship as strong harnesses built for day-to-day coding — and they're genuinely good at the hard parts of that job: context management, memory, tool orchestration, and sub-agent dispatch inside a session. What they can't reliably do is follow your specific guardrails on long-running, ambiguous, complex work. For example, there's no built-in way to migrate a 300-file React 17→19 upgrade in the dependency order your senior engineers mapped out, run your team's regression gate between each batch, pause for human review on the files you flagged as high-risk, and keep the branch green end-to-end.

The second you reach for a general agent framework to get that structure, you're wrapping the coding-agent SDK inside their graph nodes. Thousands of lines of net-new code to rebuild a tool loop, permission model, sub-agent dispatcher, and context manager — all things your coding agent already has, except worse.

Others skip the framework and build a custom harness around the raw model. Same problem at a different layer: you get structure, not your guardrails — the constraints, review bars, and team-specific requirements that actually determine whether the output is usable.

None of these paths give you an easy way to put gutters around the coding agent. Workflows are gutters: guardrails that keep the agent on your team's path through long-running or ambiguous work, so you're not watching every step.

The problem

Inside a session, a coding agent can add a feature, fix a bug, refactor a module. It's fine.

Outside a session is where the failures live:

On-call triage: an alert fires; the agent loses the trace context the moment the session resets.
Complex refactors in large codebases: constraints drift by the third or fourth session.
Team review standards: every engineer prompts the agent slightly differently, so every engineer gets slightly different results.

Most of your time goes into babysitting.

What Atomic does

Atomic (https://github.com/flora131/atomic) is a TypeScript SDK that enhances a coding agent by wrapping configurable, deterministic structure around it. The agent's harness — tool-use, context management, sub-agents, permission model — stays intact and keeps doing what it's good at. Atomic adds the outer pipeline that encodes your specific guardrails, so the agent's execution actually follows them on long-running, ambiguous work.

A workflow is plain TypeScript:

import { defineWorkflow } from "@bastani/atomic/workflows";

export default defineWorkflow({ name: "review-and-fix", /* ... */ })
  .for<"claude">()
  .run(async (ctx) => {
    const review = await ctx.stage({ name: "review" }, {}, {}, async (s) => {
      await s.session.query("Review the diff against our UX standards.");
      s.save(s.sessionId);
    });

    await ctx.stage({ name: "fix" }, {}, {}, async (s) => {
      const findings = await s.transcript(review);
      await s.session.query(`Address the findings in ${findings.path}.`);
      s.save(s.sessionId);
    });
  })
  .compile();

Every ctx.stage is a real coding-agent session in its own tmux pane — Claude Code, Copilot CLI, or opencode, interchangeable with a single flag. Data flows between stages only through explicit transcript reads. Topology — parallel fan-out, serial dependencies — comes from await and Promise.all, not a graph DSL. .compile() freezes the graph, so the only variance between runs is the LLM's output.

That's the entire mental model. You're aligning the coding agent's execution with your team's explicit goals, so long-running and ambiguous work finally becomes tractable.

Try it yourself

The non-obvious part of any of this is the shape of the pipeline — which stages, which run parallel, where the human gate goes. Ask Atomic's workflow-creator skill to encode your workflow in natural language; it hands you a working skeleton in minutes. That's our actual dev loop — not hand-writing topology.

Example flow: PR opened -> Fan-out into parallel coding-agent sessions (accessibility, spacing, copy, reuse) -> Aggregate findings -> HIL approval gate -> approve unblocks merge, or changes requested kicks off a Ralph loop (plan -> implement -> review -> debug) that feeds back into Aggregate findings.

Workflows teams have already built:

UX review gate on every PR. On pull_request.opened, dispatch a fleet of coding-agent sessions specialized on your design system, each reviewing the diff along a different axis (accessibility, spacing, copy, reuse). Merge is blocked until a human approves.
50-persona feedback gate pre-PR. Before a feature PR opens, dispatch 50 headless sessions in parallel — each primed with a distinct persona (the skeptical CFO, the power-user admin, the first-time mobile user, the accessibility-dependent reviewer). Feedback rolls into one report with tasks. A human picks what to implement; Atomic's built-in Ralph loop (planner → orchestrator → worker → reviewer → debugger) executes and raises back to the human.
Support ticket → root cause → draft PR. A webhook drops tickets into the workflow. Agents research the codebase, write the root cause back onto the ticket, and attempt a fix in a sandboxed branch. A human gate reviews the diff and evidence; the PR only opens on approval.
Production regression triage. A workflow listens to observability alerts, pulls the failing trace, deep-researches the codebase, and dispatches a session to localize the regression against recent commits. High-confidence fix? Draft PR with a repro. Low-confidence? On-call gets a ranked shortlist of suspects instead of a raw stack trace.

Every one of these lives in your repo as a TypeScript file. You run it, diff it, fork it, code-review it. Sharing across the team is merging the file.

Sandboxed by default

Workflows run with the coding agent's permission checks disabled — which is how you get one-shot execution without constant approval prompts, and why you should never run them on your host. Atomic ships three devcontainer features on GHCR (Claude, Copilot, opencode) with Bun, the CLI, playwright-cli, and config templates pre-baked. "Try this workflow" is code . plus rebuild-and-reopen-in-container, not an hour of setup.

Systems, not prompts

Move from prompting to systems thinking. Define the pipeline once. Run it the same way every time. Stop babysitting.

References

[1] "Atomic — open-source TypeScript SDK for coding-agent workflows." GitHub. https://github.com/flora131/atomic

[2] "Atomic example workflows." GitHub. https://github.com/flora131/atomic/tree/main/.atomic/workflows

[3] "Atomic SDK source." GitHub. https://github.com/flora131/atomic/tree/main/src/sdk

[4] "Open Claude Design: a weekend harness built on Atomic." alexlavaee.me, 2026. https://alexlavaee.me/blog/open-claude-design-atomic-harness

[5] "Harness engineering: why coding agents need infrastructure." alexlavaee.me, 2026. https://alexlavaee.me/blog/harness-engineering-why-coding-agents-need-infrastructure

[6] "Atomic: automated procedures and memory for AI coding agents." alexlavaee.me, 2025. https://alexlavaee.me/blog/atomic-workflow

GPT-5.5: The Honest Take on OpenAI's Response to Opus 4.7

Mixture of Experts — Thu, 07 May 2026 01:30:12 +0000

OpenAI released GPT-5.5 today, exactly one week after Anthropic shipped Claude Opus 4.7. The timing is not subtle. Opus 4.7 took the SWE-Bench Verified crown at 87.6% and put Anthropic at the top of most third-party coding leaderboards; GPT-5.5 is the direct response. Worth flagging upfront: SWE-Bench Verified scores at this tier should be read with heavy skepticism. Every frontier lab has plausibly trained on or adjacent to this data, and Anthropic itself has acknowledged memorization signals on related SWE-Bench splits. Treat any Verified or Pro number in this post as a directional signal, not a trustworthy measurement — we include them because they are what the labs report, not because we think they carry much weight.

The release is interesting for software engineers not because it "wins" — the verdict is more mixed than OpenAI's launch post suggests — but because of the specific benchmarks it wins on, the specific ones it doesn't, and the pricing decision that frames everything else. This post walks through what changed, how OpenAI built and served it, what the numbers actually say relative to Opus 4.7 and Gemini 3.1 Pro, and what the first day of real usage is surfacing.

The benchmark picture

The cleanest summary: GPT-5.5 is state-of-the-art on a subset of coding and math benchmarks, nominally behind Opus 4.7 on SWE-Bench Pro (a benchmark we'd largely discount given widespread memorization evidence), and behind both Opus 4.7 and Gemini 3.1 Pro on several agent/tool-use workloads.

Benchmarks (GPT-5.5 -> GPT-5.4 -> Opus 4.7 -> Gemini 3.1 Pro):

All figures from OpenAI's release; asterisks and memorization caveats discussed below.

*We'd argue SWE-Bench Pro (and SWE-Bench Verified) should be heavily discounted at this point. Both benchmarks have known memorization issues: Anthropic's own Opus 4.7 notes flag "evidence of memorization" on the benchmark, and Scale's public leaderboard methodology documents this as a known failure mode. When every frontier lab has plausibly seen the data, the scores tell you more about training set overlap than model capability. Use them as a floor, not a ranking — and weight Terminal-Bench 2.0, OSWorld-Verified, Expert-SWE, and your own task-specific evals far more heavily.

One real gap deserves attention: Terminal-Bench 2.0. A 13-point lead over Opus 4.7 is the largest single-benchmark gap between today's frontier coding models, and this benchmark is newer and harder to pre-train against than the SWE-Bench family. If your agent workload is long-running terminal sessions — sandboxed CI jobs, reproduction scripts, multi-step shell workflows — GPT-5.5 leads it clearly. On MCP Atlas, Opus 4.7 still edges ahead, which matters more for tool-heavy agent workloads than the SWE-Bench Pro delta does.

As with the last several releases, the honest framing is that no single model is best at everything. Which model wins depends on which benchmark you pick, which scaffold runs the evaluation, and what your actual workload looks like.

How the model was built and served

OpenAI's most concrete technical claim in the release is about serving, not training: GPT-5.5 matches GPT-5.4 per-token latency in production while performing at a higher level of intelligence. For a larger, more capable model, holding latency flat is a non-trivial infrastructure result.

The stated method has two parts.

Co-designed with NVIDIA GB200 and GB300 NVL72 systems. OpenAI says GPT-5.5 was trained on and served from Blackwell-class hardware, and that the serving stack was optimized in lockstep with the model. The release post specifically credits Codex and GPT-5.5 itself for helping identify infrastructure optimizations — model-assisted systems work, which is increasingly how frontier labs describe their inference stacks.

Dynamic load balancing replaced static chunking. Before GPT-5.5, OpenAI split requests on each accelerator into a fixed number of chunks. Codex analyzed weeks of production traffic and wrote custom heuristic algorithms to partition work dynamically based on request shape, reportedly increasing token generation speed by over 20%.

This doesn't change anything about how you build with the model, but it's worth internalizing: much of the per-token efficiency story is about the serving system, not the weights. It also explains the pricing: GPT-5.5 is a larger, more expensive model to serve than GPT-5.4, and the 2x API rate hike partially reflects that cost even after the dynamic batching wins.

One other architectural note worth flagging for engineers planning migrations: the 1M context window is now supported both in Codex (standard) and in the forthcoming API endpoint. Long-context performance looks materially better than GPT-5.4. On OpenAI MRCR v2 8-needle from 512K–1M, GPT-5.5 scores 74.0% vs GPT-5.4's 36.6%. On Graphwalks BFS at 1M tokens, F1 goes from 9.4% (GPT-5.4) to 45.4%. This is probably the largest generational jump in GPT-5.5 and gets less attention than the coding numbers.

That said: Opus 4.7 still beats GPT-5.5 on several mid-range long-context evaluations. And the honest caveat from prior GPT-5 releases still applies — a 1M window where the last 400K tokens are unreliable is functionally smaller than the marketing suggests. Treat the 1M number as a real improvement over GPT-5.4, not as a license to stop managing context.

Pricing, and why it's the most-discussed part of the release

The pricing on gpt-5.5 is $5 per 1M input tokens and $30 per 1M output tokens, with batch and flex at half that rate and priority at 2.5x. gpt-5.5-pro is $30 / $180. Compared to GPT-5.4 ($2.50 / $15), the base model is a flat 2x price increase.

This dominated the first day's discussion. One Hacker News commenter summarized it bluntly: this is roughly 3x the price of GPT-5.1 released six months earlier. OpenAI's counter-argument, repeated several times in the release and by employees in community threads, is that GPT-5.5 uses meaningfully fewer tokens for the same task — so price per completed unit of work can be lower even when price per million tokens is higher.

Both things are true. The practical implication depends on where you're running it:

In Codex and ChatGPT subscriptions, OpenAI says the per-task token reduction is enough that most users get better results with fewer tokens at their existing tier. This matches early reports from subscribers on the HN thread.
In the API, the math is workload-dependent. If your app sends short prompts and GPT-5.5 produces 30% fewer output tokens than GPT-5.4, you're still paying ~40% more per call. If you're running long-horizon agent loops where GPT-5.5 cuts tokens by half, the net cost can drop.

The cleanest read is: assume API costs go up unless you measure otherwise. The Decoder's coverage put it plainly — "despite the higher price tag, GPT-5.5 is more efficient and needs fewer tokens for comparable tasks" is marketing language for "you'll pay more per token but possibly less per outcome".

What early users are actually reporting

Day-one impressions are a mix of genuine enthusiasm from early-access partners and skepticism from the broader community.

Positive reports. Several early-access partners highlighted strong long-horizon coding results. Dan Shipper (CEO of Every) credited GPT-5.5 with unusual conceptual clarity, pointing to a refactor it produced that matched the solution one of his senior engineers eventually landed on — and that GPT-5.4 had not been able to find. Pietro Schirano (MagicPath) reported GPT-5.5 merging hundreds of changes from a refactor branch into a main branch that had also moved significantly, in a single ~20-minute pass. Michael Truell at Cursor emphasized persistence, noting the model stays on task materially longer before stopping early.

These are real signals — long-horizon coding and cross-branch reasoning are exactly where Expert-SWE (OpenAI's internal benchmark) shows a 5-point lift over GPT-5.4.

Skeptical reports. The top Hacker News thread as of this writing flags the opposite failure mode: one developer reported GPT-5.5 refusing to perform a quick, benign subtask that GLM, Kimi, and MiniMax all completed — and dropping OpenAI as a result. Another recurring complaint in the thread is around model "motivation" — GPT-5.5 and GPT-5.4 both yielding control mid-task or declining work that was explicitly requested. Benchmark fatigue was also visible: commenters pushed back on OpenAI's "strongest and fastest model yet" framing as boilerplate launch language.

Coding verdict emerging. The pragmatic consensus in the HN thread is to hold off on swapping Claude out for coding work until independent SWE-Bench numbers are published and verified, with several commenters calling out Opus 4.7 as still the strongest option for long-horizon refactors. That's an overstatement given GPT-5.5's Expert-SWE lead, but the underlying point — wait for independent evaluations — is correct. Vals.ai and Scale typically publish third-party numbers within a few weeks of release, and those numbers are what to watch for.

How it compares to the alternatives, task by task

Given the mixed benchmark picture, a single "which model is best" answer doesn't exist. Based on first-day evidence, here's a reasonable task-routing view.

Long-running terminal agents -> GPT-5.5 (+13pt lead over Opus 4.7 on Terminal-Bench 2.0)
Real GitHub issue resolution (multi-file patches) -> Opus 4.7 (tentative; nominally 64.3% vs 58.6% on SWE-Bench Pro, but memorization caveats mean this ordering is weak — run your own eval before committing)
MCP tool-heavy agents -> Opus 4.7 (79.1% vs 75.3% on MCP Atlas)
Deep research / web-browsing agents -> Gemini 3.1 Pro or GPT-5.4 Pro (both lead BrowseComp; GPT-5.5 Pro closes the gap to 90.1% but the Pro pricing is steep)
Hard math / theorem work -> GPT-5.5 Pro (39.6% on FrontierMath Tier 4 — no close peer)
Long-context (512K–1M tokens) -> GPT-5.5 (MRCR 8-needle 512K–1M at 74.0% vs GPT-5.4's 36.6% and Opus 4.7's 32.2%)
Cost-sensitive coding -> GPT-5.4 or Opus 4.6/4.7 (GPT-5.5 is 2x GPT-5.4's API price)

For teams running coding agents in production, the least controversial take is: Opus 4.7 is still the safer default for multi-file refactor work — though we'd emphasize that the SWE-Bench Pro lead driving that conclusion is the weakest part of the evidence, given the benchmark's memorization problems. GPT-5.5 is the new default for terminal-heavy agent loops and for the long-context workloads where GPT-5.4 was unreliable. If you're already routing between models, GPT-5.5 slots in without changing the overall shape of your stack — and if you care about multi-file coding specifically, the honest advice is to run a real eval on your own codebase rather than trusting the leaderboard.

What this release actually signals

Two things stand out beyond the benchmarks.

The pace has not slowed. Opus 4.7 shipped April 16. GPT-5.5 shipped April 23. Anthropic's restricted Claude Mythos Preview is already being referenced in OpenAI's comparison tables. If you're planning infrastructure, assume another frontier model drop within 4–8 weeks, and design so that swapping the model behind your scaffold is cheap.

Pricing is now a product decision, not just a cost. OpenAI doubling the API rate of its flagship while routing cheaper alternatives (GPT-5.4, GPT-4.1) through the same interface is a conscious segmentation. It matches Anthropic's Opus/Sonnet split and Google's Pro/Ultra split. For most engineering workloads, the right question is no longer "what's the best model?" but "what's the cheapest model that clears my quality bar for this specific task?"

GPT-5.5 doesn't answer that question for you. But it changes the shape of the answer: on terminal agents and long-context work, it's probably worth the premium. On most other shapes of coding work, Opus 4.7 or GPT-5.4 still wins on price-per-quality. As always, measure before migrating.

References

[1] OpenAI, "Introducing GPT-5.5." April 23, 2026. https://openai.com/index/introducing-gpt-5-5/

[2] Hacker News, "GPT 5.5 Released in Codex" discussion thread. April 21–23, 2026. https://news.ycombinator.com/item?id=47858903

[3] The Decoder, "OpenAI unveils GPT-5.5, claims a 'new class of intelligence' at double the API price." April 23, 2026. https://the-decoder.com/openai-unveils-gpt-5-5-claims-a-new-class-of-intelligence-at-double-the-api-price/

[4] Hacker News, "GPT-5.5" discussion thread. April 23, 2026. https://news.ycombinator.com/item?id=47879092

[5] VentureBeat, "OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0." April 23, 2026. https://venturebeat.com/technology/openais-gpt-5-5-is-here-and-its-no-potato-narrowly-beats-anthropics-claude-mythos-preview-on-terminal-bench-2-0

[6] Scale Labs, "SWE-Bench Pro Leaderboard." Retrieved April 23, 2026. https://labs.scale.com/leaderboard/swe_bench_pro_public

Software Quality Has Never Been More Vulnerable

Mixture of Experts — Thu, 07 May 2026 01:25:06 +0000

Anthropic published a postmortem recently. The document was specific, technical, self-critical, and honest about what their full pre-release pipeline failed to catch. Three separate issues degraded Claude Code between March 4 and April 20. All three were fixed by v2.1.116 on April 20, and usage limits were reset for every subscriber on April 23.

But the document is also a mirror. The conditions it describes — continuous change across weights, prompts, scaffolds, and caches; evaluation coverage that trails release velocity; internal dogfooding that drifts from external usage; regressions that hide inside normal output variance for weeks — are not conditions unique to one lab or one product. They are the working conditions of the entire AI-assisted software industry right now.

We are in the era where AI coding has lifted the ceiling on how fast teams can ship, and we have not yet lifted the ceiling on how fast we can verify what we shipped. Software has never been more vulnerable than it is right now, and the Claude Code postmortem is the clearest public evidence we have of why.

What the postmortem actually covers

Three issues, stacked.

A reasoning-effort default change (March 4 – April 7). Claude Code's default reasoning effort was switched from high to medium because high made the UI feel frozen. It was a reasonable tradeoff on paper — lower latency for tasks that didn't need deep reasoning. In practice, users felt the capability drop immediately. The team reverted, and the current defaults are xhigh for Opus 4.7 and high for other models.

A caching bug that cleared reasoning every turn (March 26 – April 10). A prompt caching optimization for idle sessions shipped with a broken header flag. The clear_thinking_20251015 flag was meant to fire once. It fired every turn. The downstream effect was forgetfulness, repetition, and odd tool choices — exactly the pattern users reported. The issue was masked in internal usage by two unrelated concurrent experiments. It was eventually surfaced by back-testing Claude Code Review with Opus 4.7 against the offending pull request; Opus 4.6 had missed it. The fix shipped April 10.

A verbosity reduction in the system prompt (April 16 – April 20). The prompt added length limits on text between tool calls and on final responses. It passed weeks of eval runs. Broader ablations during the investigation revealed a flat 3% intelligence drop on both Opus 4.6 and 4.7 — small in isolation, real in aggregate. Reverted April 20.

The postmortem is explicit that each of these passed "multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding." It is also explicit that users, via /feedback and public posts, were the mechanism that surfaced the problems at the speed they did.

All of that is in the document. Read it. It's a good document.

The conditions the document describes are everyone's conditions now

Here is the part of the postmortem that deserves more attention than the individual bugs.

Anthropic's summary of why detection took time: "each change affected different traffic segments on different schedules. Early reports in March were difficult to distinguish from normal variation, and neither internal usage nor standard evals initially reproduced the issues."

That is not a description of a broken process. It is a description of the operating environment that every AI-assisted software product now lives in.

Consider what shipped under the Claude Code surface in those six weeks — a reasoning effort default, a caching optimization, a system prompt edit. None of those are "model releases" in the traditional sense. They are small, continuous tuning knobs that are part of the product every AI-native team ships. And any one of them can independently introduce a regression that looks, to a user, like "the model got worse."

Now generalize outward. Every team building on frontier models is continuously tuning prompts, swapping models, adjusting temperature and reasoning effort, reworking tool definitions, rebuilding RAG indices, editing agent scaffolds. Most of those teams have a small fraction of Anthropic's eval infrastructure. Most of them have no /feedback channel, no dedicated developer relations account, no mechanism for surfacing regressions from users in hours.

The Claude Code postmortem is the rare case where the conditions of AI-native software development were written out in public. The conditions themselves are universal.

AI coding raised the ceiling on velocity. Verification didn't follow.

The second thing to sit with is that AI-assisted development has changed how fast software can be produced, and this reshapes the risk profile of everything shipped with it.

A small team in 2026 can land work that would have taken ten engineers in 2022. Coding agents — Claude Code, Codex, Copilot, Cursor, the rest — produce real, shippable code at a throughput that makes the old cadence look obsolete. Labs use their own agents to accelerate their own pipelines; OpenAI's latest release notes credit Codex with infrastructure optimizations on GPT-5.5 itself. The recursion is explicit. Frontier labs are writing more of their software with AI. Product teams are writing more of their software with AI. The ceiling on how much code gets shipped per week, everywhere, has moved up sharply.

What has not kept pace is the ability to verify the behavior of AI-powered systems at the same velocity. Traditional CI was built around the assumption that software is deterministic, that a green test suite means something stable, and that regressions are rare because the artifact is frozen between releases. None of those assumptions hold cleanly for LLM-powered products. The artifact is not frozen — it's a live composition of weights, prompts, tools, and retrieval state. The tests don't catch behavioral regressions that fall inside output variance. The regression is rare per change, but the change rate is extreme.

The gap between "how fast we can ship" and "how fast we can verify what we shipped" is larger today than at any point in the history of the industry. That gap is where vulnerability lives.

This isn't a lab problem, it's a paradigm problem

There is a tempting misreading of the postmortem that frames it as a story about Anthropic specifically — their process, their evals, their engineers. That's the wrong frame, and it misses the more important point.

Every part of the document describes a structural condition that generalizes:

Continuous tuning is now part of the product surface. The verbosity prompt is the clearest example. A single instruction about output length, inside a system prompt, caused a measurable intelligence drop invisible to standard evals. Every AI product team edits system prompts. Every one of them is one prompt change away from a similar effect.
Output variance masks real regressions. "Difficult to distinguish from normal variation" is not an Anthropic phrase. It is the default state of every LLM-powered product. Noise is loud, and real drift hides inside it.
Internal usage drifts from external. Staff at any AI lab, and at most AI-native product companies, run builds that are subtly different from what users see — early access to models, experimental flags, different rollout cohorts. "Dogfooding" as a guarantee gets weaker the further internal diverges from external.
Users are now part of the evaluation loop. Not by choice, and not unique to Claude Code. The fastest regression-detection mechanism for almost every AI product in 2026 is a user noticing something felt off and having a channel to say so.

None of this is an indictment. It's a description. The question worth asking isn't "how did this happen at Anthropic." It's "given these are the conditions everywhere, what should responsible AI-assisted shipping look like?"

What this should change about how we ship

A few honest consequences if the framing above is right.

Treat prompt edits as model edits. The verbosity incident is the proof point. A prompt change is a capability change. If you'd gate a model swap behind a full eval suite, gate a prompt edit the same way. The per-model ablations Anthropic is committing to running on every system prompt change are a good template.

Budget soak periods into release cadence. Among Anthropic's corrective actions: "soak periods, broader evaluation suites, and gradual rollouts to catch issues earlier." The implicit admission is that the previous cadence didn't leave room for them. Most AI product teams are in the same position, and many have far less room than Anthropic did.

Close the internal-to-external build gap, however you can. This is hard. Staff getting early access to new models is how labs and product teams move fast. But the further internal builds drift from external, the less your dogfooding tells you. One commitment from the postmortem worth copying: have the people who ship the software actually use the shipped software, in the same configuration users see.

Build a real user feedback path before you need one. A /feedback command, a dedicated community channel, a developer relations account that actually reads what's posted — the Claude Code postmortem makes clear that these are not nice-to-haves. They are the primary mechanism by which real-world regressions get caught in hours instead of weeks. Most AI products don't have this. The ones that will survive the next cycle of release velocity will.

For users: keep your own evals. If you're running AI-assisted work that matters, do not rely on any provider's internal quality bar to hold steady through continuous silent changes. Keep a small suite of your own tasks that you re-run periodically. You don't need much — a handful of representative prompts that produce outputs you can compare over time is enough to notice drift early.

Closing

The Claude Code postmortem is a good document from a team that did the right thing in publishing it. The story it tells is not about one lab or one product. It's about the working conditions of AI-assisted software development in 2026 — conditions under which everyone is shipping faster than anyone can verify, and real regressions routinely hide inside output variance for weeks before users surface them.

Software has never been more vulnerable than it is right now. Not because anyone is being careless. Because the ceiling on velocity moved up sharply and the ceiling on verification didn't follow.

The labs are aware. Anthropic just wrote out in public what the condition looks like. The rest of the industry should read the document as a mirror, not a scoreboard, and act accordingly.

References

[1] Anthropic Engineering. "Claude Code quality issues: postmortem summary." April 23, 2026. https://www.anthropic.com/engineering/april-23-postmortem

DeepSeek V4: What's Inside, How It Compares, and Where It Actually Wins

Mixture of Experts — Thu, 07 May 2026 01:19:04 +0000

DeepSeek V4 shipped on April 24, 2026 — four days after Moonshot's Kimi K2.6, one day after OpenAI's GPT-5.5. Two MIT-licensed models, both 1M-context: V4-Pro at 1.6T total / 49B active, and V4-Flash at 284B / 13B active.

The headline number is the price: $3.48 per million output tokens for V4-Pro vs $25 for Claude Opus 4.7 and $30 for GPT-5.5. (DeepSeek is also running a launch promo at 75% off — $0.87/M output — through May 5, 2026, which widens the gap further during the evaluation window.) That's a 7-9x gap at the standard rate, against a model that's within ~5-7 points of the closed frontier on most coding benchmarks. That gap is large enough to make many teams reconsider their model routing decisions.

But price isn't the complete picture. V4 performs well on some workloads and poorly on others, and integration is more difficult than the marketing suggests. Here's the assessment, engineer reports, and what's new under the hood.

Where V4 actually wins (and where it doesn't)

Three frontier-class models shipped in nine days, and no single model dominates. The ranking flips depending on the workload:

Real-world software engineering (PRs, refactors, multi-repo bug fixes): Opus 4.7 leads on independent evaluations that require reasoning across many files — Vals AI's Vibe Code Benchmark, the Aider Polyglot suite, and contamination-resistant tests like LiveCodeBench. It's the right pick when changes are multi-file and planning is the difficult part. You can run Opus end-to-end (plan + edits), or split the workflow: Opus writes the plan, GPT-5.5 on low or medium reasoning executes the file edits against it. The split is often the better cost-quality tradeoff.
Terminal / agentic shell: GPT-5.5 leads at 82.7% on Terminal-Bench 2.0, ~15 points ahead of V4-Pro. These workloads involve many small tool calls and shell-output error recovery, and V4 hasn't been RL-trained on them at the same depth.
Long-horizon autonomous execution (12+ hour runs): Kimi K2.6 is the open-source choice, with its Claw Groups multi-agent coordination and demonstrated runs across 4,000+ tool calls.
Whole-repo reasoning (hundreds of files, >200K tokens): V4-Pro's 1M context is the only frontier option that's economical to use at full length — its architecture cuts inference cost to roughly a quarter of V3.2's at 1M context. More on why below. The natural fit is the discovery phase of a task: load the whole repo and use V4-Pro for deep research, search, and understanding how a codebase fits together — the analysis pass that feeds into a plan, which you then hand to Opus or GPT-5.5 to execute.
Cost-per-task at scale: V4-Flash at $0.28/M output is 90-107x cheaper than the closed frontier. Tencent Hy3-preview at ~$0.55/M is in a similar range. For batch and overnight workloads, neither closed model is competitive.

One additional entry worth noting: Tencent Hy3-preview is not competing for the largest open coding model. It's a 21B-active model optimized for cost-per-step in real product traffic, with stable agent runs of up to 495 steps in production powering CodeBuddy and WorkBuddy. If you're building product-embedded agents on tight budgets rather than optimizing for benchmark scores, it's worth evaluating. Tencent is direct about the tradeoffs: the release notes describe "weak error recovery capabilities when calling the tool and sensitivity to inference hyperparameters."

The benchmarks (DeepSeek V4-Pro -> Claude Opus 4.7 -> GPT-5.5 -> Kimi K2.6 -> Qwen3.6-27B -> Tencent Hy3-preview):

Total / Active params: 1.6T / 49B -> undisclosed -> undisclosed -> 1T / 32B -> 27B dense -> 295B / 21B
Context window: 1M -> 1M -> 1M -> 256K -> 256K (1M YaRN) -> 256K
SWE-Bench Verified: 80.6% -> 87.6% -> — -> 80.2% -> 77.2% -> 74.4%
SWE-Bench Pro: 55.4% -> 64.3% -> 58.6% -> 58.6% -> 53.5% -> —
Terminal-Bench 2.0: 67.9% -> 69.4% -> 82.7% -> 66.7% -> 59.3% -> 54.4%
LiveCodeBench: 93.5% -> 84.69% -> 85.30% -> 89.6% -> — -> —
BrowseComp: 83.4% -> 79.3% -> 84.4% -> 83.2% -> — -> 67.1%
Output price ($/M tokens): $3.48 -> $25 -> $30 -> ~$2.50 -> ~$1.56 -> ~$0.55
License: MIT (open weights) -> Closed API -> Closed API -> Modified MIT -> Apache 2.0 -> Open weights

Sources: DeepSeek, VentureBeat, BenchLM, AkitaOnRails, Latent Space, Tencent, Qwen.

What engineers actually report

The more useful signal comes from independent reviewers running real tasks, and the picture from the first 72 hours is mixed in informative ways.

On the positive side, AkitaOnRails ran his RubyLLM benchmark — the same chat-app-against-a-specific-Ruby-library task he's been using to track open-source coding models — and observed V4 move from hallucinating API methods in V3.2 to writing code that compiled and ran on the first try, with Pro producing essentially reference-quality output. Vals AI observed the same pattern on their Vibe Code Benchmark (https://www.vals.ai/benchmarks/vibe-code), where V4 improved roughly 10x from V3.2 and now leads open-source. DeepSeek's own team, in the release notes, is measured about positioning: V4 is their internal default now, better than Sonnet 4.5 and close to Opus 4.6 in non-thinking mode — but they explicitly stop short of claiming parity with Opus 4.7. The vendor's own framing is more accurate than much of the launch coverage.

The negative reports cluster on integration. AkitaOnRails couldn't run V4-Pro through OpenCode at all — it kept failing on the thinking-mode handshake — and his broader assessment of DeepSeek launches reflects a consistent pattern: marketing ships earlier than working tool support, the community spends a few weeks reverse-engineering the protocol, and gaps in open-source harnesses tend to persist. Cursor's forum is showing similar issues, with open threads reporting V4's context capped at 200K with reasoning_content errors after tool calls (https://forum.cursor.com/t/deepseek-v4-context-limited-to-200k-reasoning-content-error/159045) and an open feature request for proper reasoning_content compatibility (https://forum.cursor.com/t/compatibility-with-deepseek-models-design-to-return-reasoning-content-after-tool-calls/158905). Local-inference users are also waiting — no community GGUF at launch, llama.cpp support days out, MLX on Apple Silicon trailing by a similar margin. vLLM works on the native FP4/FP8 checkpoints out of the box, but the hardware floor is one H200 141GB or two A100 80GBs for Flash, and four A100s or two H200s to use the full 1M context.

A useful counterpoint came from Chew Loong Nian, who tested all four V4 tiers across 20 real tasks instead of leaderboard prompts. V4-Pro-Max didn't dominate. Flash won 7 outright at $0.14 per million input tokens, mostly on shorter tasks where the price-quality tradeoff favored it. Pro-Max only pulled clearly ahead when the workload genuinely required it: on three long-context retrieval tasks loading 800K tokens of a real GitHub repo and asking for a function's call graph, Pro hit 3/3 while Flash hit 1/3. That suggests the right approach — V4 is two models with different optimization points, and Pro earns its premium when context is large.

The practical takeaway: budget for integration work, not just inference. The thinking-mode protocol is non-trivial, OpenCode and Claude Code adapters aren't all working cleanly at launch, and you'll likely maintain your own patches for several weeks. Run V4 in shadow before deploying it to customers.

Why V4 performs well where it does

Two design choices explain most of V4's profile.

It only activates 49B of its 1.6T parameters per token. That's the mixture-of-experts approach — only the experts relevant to the current token activate. Combined with running natively in 4-bit weights at inference (real FP4, not simulated quantization), this is how a 1.6T model fits within deployable economics. It's also why V4-Flash exists at 13B active: the same approach scaled further down. The cost gap to closed models comes from MoE plus FP4 plus training-efficiency improvements.

It doesn't process the full million-token context. Instead, V4 summarizes long context into compressed blocks and learns which blocks to attend to for a given query. The result is concrete: at 1M context, V4-Pro uses 27% of V3.2's compute and 10% of its memory. That's what makes 1M context economically viable to serve, and why whole-repo reasoning is V4's primary workload.

The tradeoff: the same compression is why V4 underperforms on terminal/agentic shell tasks. Those workloads are short-context, high-frequency tool calls — there's no million tokens to summarize, and the architectural advantage disappears. V4's weakness there isn't due to lack of effort; GPT-5.5 has been RL-trained on shell sessions much more heavily, and at short context that's what matters most.

The technical report has more — including novel work on residual connections that other labs will likely adopt within two release cycles — but for routing decisions, the three points above are the most important.

What to watch next

Three things will determine whether V4 becomes a production default, or remains a release that performs well on benchmarks but is difficult to integrate, like several DeepSeek launches before it.

Tool ecosystem catch-up (2-3 weeks). OpenCode, Cursor, Claude Code, Cline, and the long tail of agent harnesses need clean thinking-mode and reasoning_content support. The Cursor forum threads are the leading indicator; if they resolve within a few weeks, V4-Pro becomes a viable production option. If integration drags into May, the practical adoption ceiling stays low.
The Birkhoff-constrained transformer in other labs. mHC is the architectural idea most likely to spread. Watch Llama 5, Qwen 4, and Mistral's next foundation model for residual-connection changes that reference it.
Closed-frontier pricing response. With V4-Pro at one-seventh the price of Opus 4.7 and GPT-5.5 at near-comparable coding numbers, sustained pressure on closed-API pricing is the most likely industry move. The question is whether Anthropic and OpenAI hold premium pricing on differentiated workloads (real-world SWE for Anthropic, terminal/agentic for OpenAI) or take broader cuts.

The broader context: six months ago, the best open-weights coding model trailed the closed frontier by 15-20 points on SWE-Bench. Today, three open models — DeepSeek V4-Pro, Kimi K2.6, GLM-5 — sit within ~7 points of Claude Opus 4.7. Chinese labs alone have shipped a coding-focused checkpoint roughly every week for the past three months. The open vs closed framing is no longer the most useful one. The more useful framing is which model fits which workload, at what cost, with which reliability profile. V4 changes the answer for several of those workloads. The rest is integration work.

V3 to V4 is roughly the same step V2 to V3 was, on a similar release cadence. What's different this time is the timing: it arrives at the point where the open frontier has caught up enough to make multi-model routing — V4-Flash for cheap calls, V4-Pro for long-context, Opus 4.7 or GPT-5.5 for the critical path — the default architecture. The teams that establish routing and evaluation infrastructure first will gain a larger advantage than any single model choice.

References

[1] DeepSeek, "DeepSeek V4 Preview Release." DeepSeek API Docs, April 24, 2026. https://api-docs.deepseek.com/news/news260424

[2] MarkTechPost, "DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts." April 24, 2026. https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/

[3] AkitaOnRails, "LLM Coding Benchmark (April 2026): GPT 5.5, DeepSeek v4, Kimi v2.6, MiMo, and the State of the Art." April 24, 2026. https://akitaonrails.com/en/2026/04/24/llm-benchmarks-parte-3-deepseek-kimi-mimo/

[4] Chew Loong Nian, "I Tested All 4 DeepSeek V4 Modes on 20 Real Tasks — The $0.04 Flash Won 7 of Them." Towards AI on Medium, April 2026. https://medium.com/@chewloongnian/i-tested-all-4-deepseek-v4-modes-on-20-real-tasks-the-0-04-flash-won-7-of-them-0ef0fb5c1771

[5] VentureBeat, "DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5." April 24, 2026. https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5

[6] BenchLM, "Best Chinese LLMs in 2026: DeepSeek V4, Kimi 2.6, GLM-5, Qwen, and Every Model Ranked." April 2026. https://benchlm.ai/blog/posts/best-chinese-llm

[7] Latent Space, "Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6." April 20, 2026. https://www.latent.space/p/ainews-moonshot-kimi-k26-the-worlds

[8] vLLM Blog, "DeepSeek V4 in vLLM: Efficient Long-context Attention." April 2026. https://vllm.ai/blog/deepseek-v4

[9] DeepSeek-V4-Pro on Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

[10] Hacker News discussion: "DeepSeek v4." April 24, 2026. https://news.ycombinator.com/item?id=47884971

[11] Tencent, "Tencent Unveils Hy3 preview; Model Enhances Agent Capabilities and Real-World Usability." April 23, 2026. https://www.tencent.com/en-us/articles/2202320.html. Model weights: tencent/Hy3-preview on Hugging Face — https://huggingface.co/tencent/Hy3-preview

[12] Cursor Community Forum, "DeepSeek V4: context limited to 200K + reasoning_content error." April 2026. https://forum.cursor.com/t/deepseek-v4-context-limited-to-200k-reasoning-content-error/159045

[13] Cursor Community Forum, "Compatibility with DeepSeek models design to return reasoning_content after tool calls." April 2026. https://forum.cursor.com/t/compatibility-with-deepseek-models-design-to-return-reasoning-content-after-tool-calls/158905

[14] Qwen Team, "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model." April 22, 2026. https://qwen.ai/blog?id=qwen3.6-27b. Model weights: Qwen/Qwen3.6-27B on Hugging Face — https://huggingface.co/Qwen/Qwen3.6-27B. Pricing via OpenRouter — https://openrouter.ai/qwen/qwen3.6-27b