<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mixture of Experts</title>
    <description>The latest articles on DEV Community by Mixture of Experts (@mixture-of-experts).</description>
    <link>https://dev.to/mixture-of-experts</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916862%2F03b63012-0632-4c84-b324-269b51e29ad6.jpg</url>
      <title>DEV Community: Mixture of Experts</title>
      <link>https://dev.to/mixture-of-experts</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mixture-of-experts"/>
    <language>en</language>
    <item>
      <title>Three Things I Learned Using Coding Agents with 1M-Token Models</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 02:35:21 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/three-things-i-learned-using-coding-agents-with-1m-token-models-501o</link>
      <guid>https://dev.to/mixture-of-experts/three-things-i-learned-using-coding-agents-with-1m-token-models-501o</guid>
      <description>&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The effective context window is far smaller than advertised. Even with 1M-token models, performance degrades noticeably past ~100K tokens — worse coherency, more hallucinations, and planning drift. Treat the full window as a capacity limit, not an operating target.&lt;/li&gt;
&lt;li&gt;Sub-agents are essential for long-horizon work. Delegating scoped tasks to sub-agents keeps each agent in its "smart zone" and prevents context pollution. Watch for the "impatience problem" where the main agent duplicates work already delegated.&lt;/li&gt;
&lt;li&gt;Skills + CLIs beat MCP servers for context control. Skills offer progressive context disclosure and dynamic filtering. MCP servers push opaque context with limited filtering — a critical difference when every token counts.&lt;/li&gt;
&lt;li&gt;Context is the scarce resource, not capability. Compaction strategy, sub-agent architecture, and tool selection should all be designed around keeping context lean, scoped, and fresh.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've been using coding agents heavily — primarily Copilot CLI and the SDK, but also Claude Code and other agentic tools — alongside the 1M-token context models (Codex 5.4 and Opus/Sonnet 4.6). While the examples below are drawn from my Copilot CLI workflow, these patterns apply to any coding agent that operates on long-context models: Claude Code, Cursor, Windsurf, Aider, or whatever you're using. The underlying constraints are model-level, not tool-specific.&lt;/p&gt;

&lt;p&gt;My workflow has evolved significantly from where most people start. Most developers see "1M tokens" and think "I can throw everything at the model." The results are predictably bad. Worse coherency. More hallucinations. Plans that drift until they're unrecognizable. The full context window is a capacity limit, not an operating target.&lt;/p&gt;

&lt;p&gt;Here are three patterns that fundamentally changed how I work with these tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The "Smart Zone" Is Much Smaller Than You Think
&lt;/h2&gt;

&lt;p&gt;Even though these models support context windows of up to 1 million tokens, the effective performance zone is significantly smaller — and the reasons are architectural, not incidental.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the limitation exists
&lt;/h3&gt;

&lt;p&gt;Most 1M-token models aren't fundamentally larger or smarter than their shorter-context predecessors. They achieve extended context through mathematical techniques like YaRN (Yet another RoPE extensioN — &lt;a href="https://arxiv.org/pdf/2309.00071" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2309.00071&lt;/a&gt;) that stretch the model's sequence length without adding parameters. The context window grows, but the model's core reasoning capacity — what HumanLayer calls the "instruction budget" (&lt;a href="https://www.hlyr.dev/blog/long-context-isnt-the-answer" rel="noopener noreferrer"&gt;https://www.hlyr.dev/blog/long-context-isnt-the-answer&lt;/a&gt;) — stays the same.&lt;/p&gt;

&lt;p&gt;The instruction budget is the number of instructions a model can reliably follow before adherence starts to drop. It's strongly correlated with the model's parameter count and instruction tuning quality, not with its context window size. When you extend the context 5x without scaling the instruction budget, you can fit more information in, but the model isn't actually better at attending to it. HumanLayer found this firsthand when they tested Claude Opus 4.6 (1M context): instruction adherence degraded not just at capacity limits, but across all context lengths compared to the shorter-context Opus 4.5.&lt;/p&gt;

&lt;p&gt;Think of it this way: your context window is a haystack where tool calls, documents, and files are the hay. The quality of the agent's next action depends on its ability to find the right needle — the most relevant instruction for the current state. Expanding the haystack 5x without improving the model's needle-finding ability just buries the signal deeper.&lt;/p&gt;

&lt;h3&gt;
  
  
  What degradation looks like in practice
&lt;/h3&gt;

&lt;p&gt;Through experimentation across different prompt and context sizes, model performance starts to noticeably degrade past approximately 100K tokens. This shows up as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worse task coherency — the model loses track of the overall objective&lt;/li&gt;
&lt;li&gt;Reduced reasoning reliability — logical chains break down&lt;/li&gt;
&lt;li&gt;Increased hallucination rate — the model confidently fabricates details&lt;/li&gt;
&lt;li&gt;Planning drift in long-horizon tasks — multi-step plans veer off course&lt;/li&gt;
&lt;li&gt;Instruction disobedience — the model ignores design documents, misunderstands simple instructions, or makes trivial mistakes it wouldn't make in a leaner context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't theoretical. I've watched agents produce clean, well-reasoned output at 80K tokens, then fall apart at 150K with the same task and codebase. The degradation isn't binary — it's a gradient. But the inflection point is consistent enough that I've built my workflow around it. HumanLayer observed the same pattern — they shifted their context warnings to trigger at 100K tokens rather than at a percentage of the usable window.&lt;/p&gt;

&lt;h3&gt;
  
  
  What works
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Trigger auto-compaction earlier. Don't wait until the context window is full. Set compaction thresholds well below the model's maximum capacity.&lt;/li&gt;
&lt;li&gt;Periodically clear the context window. Persist progress to disk — research docs, specs, task lists — then start fresh sessions that load only what's needed for the current phase.&lt;/li&gt;
&lt;li&gt;Stop max-packing prompts. The fact that the model allows 1M tokens doesn't mean you should use them. Treat the full window as headroom for unexpected context growth, not as the target operating point.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical rule: treat the full 1M window as a capacity limit, not an operating target. More context isn't more capability. Design your workflows around staying well under it.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Use Sub-Agents to Offload Long-Horizon Work
&lt;/h2&gt;

&lt;p&gt;One of the most effective patterns I've found is spawning sub-agents to balance the main agent's context and handle complex or long-running tasks.&lt;/p&gt;

&lt;p&gt;The concept is straightforward: instead of stuffing everything into one agent's context window, delegate scoped work to sub-agents that operate in their own context windows. The orchestrating agent receives condensed results. Its context stays lean. Each sub-agent gets only the information it needs.&lt;/p&gt;

&lt;p&gt;This directly addresses the context degradation problem. If you can keep each agent under 100K tokens by distributing work across multiple agents, you stay in the "smart zone" even for tasks that would otherwise require 300K+ tokens of total context.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Orchestrator Pattern
&lt;/h3&gt;

&lt;p&gt;Below is a template I use for an orchestrator sub-agent (adapted from HumanLayer's work on sub-agent orchestration, with modifications for my workflow):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orchestrator&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Orchestrate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sub-agents&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;accomplish&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;long-horizon&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tasks&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;without&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;losing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;coherency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;delegating&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sub-agents."&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execute"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;edit"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="s"&gt;You are a sub-agent orchestrator. The most important tool available to you&lt;/span&gt;
&lt;span class="na"&gt;is the one that dispatches sub-agents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;either `Agent` or `Task`.&lt;/span&gt;

&lt;span class="s"&gt;All non-trivial operations should be delegated to sub-agents.&lt;/span&gt;

&lt;span class="s"&gt;Delegate research and codebase understanding tasks to codebase-analyzer,&lt;/span&gt;
&lt;span class="s"&gt;codebase-locator, and pattern-locator sub-agents.&lt;/span&gt;

&lt;span class="s"&gt;Delegate running bash commands (particularly ones likely to produce lots&lt;/span&gt;
&lt;span class="s"&gt;of output) to Bash sub-agents.&lt;/span&gt;

&lt;span class="s"&gt;Use separate sub-agents for separate tasks, and launch them in parallel —&lt;/span&gt;
&lt;span class="s"&gt;but do not delegate tasks with significant overlap to separate sub-agents.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate sub-agents for separate tasks — prevents context pollution between unrelated work&lt;/li&gt;
&lt;li&gt;Parallel execution — sub-agents can work simultaneously on independent tasks&lt;/li&gt;
&lt;li&gt;No overlapping delegation — avoids duplicate work and conflicting outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Impatience Problem
&lt;/h3&gt;

&lt;p&gt;There's a behavioral quirk worth calling out. The post-training behavior of these models tends to favor smaller-model-style execution patterns. In practice, this means the main agent becomes impatient — it attempts to complete a task that's already been delegated to a sub-agent.&lt;/p&gt;

&lt;p&gt;This defeats the purpose of sub-agents entirely. You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context pollution — the main agent duplicates work happening in a sub-agent&lt;/li&gt;
&lt;li&gt;Duplicate work — wasted compute and potentially conflicting outputs&lt;/li&gt;
&lt;li&gt;Planning drift — the main agent's plan diverges from the sub-agent's execution&lt;/li&gt;
&lt;li&gt;Loss of orchestration coherency — the delegation structure breaks down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's why I explicitly include this instruction in every orchestrator prompt:&lt;/p&gt;

&lt;p&gt;"IMPORTANT: Sometimes sub-agents will take a long time. DO NOT attempt to do the job yourself while waiting for the sub-agent to respond. Instead, use the time to plan out your next steps, or ask the user follow-up questions to clarify the task requirements."&lt;/p&gt;

&lt;p&gt;This isn't specific to any one tool. It's a model-level behavioral tendency — the post-training optimization makes models want to "do something" rather than wait. I first noticed it in Copilot CLI, but the same pattern shows up in Claude Code, Cursor, and other agentic systems. The explicit instruction overrides that default regardless of which agent you're using.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Prefer Skills + CLIs Over MCP Servers
&lt;/h2&gt;

&lt;p&gt;In practice, I consistently favor Skills + CLIs over MCP servers for agent tool integration.&lt;/p&gt;

&lt;p&gt;The reason is context control. Skills and CLIs support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Progressive context disclosure — you control exactly what context enters the prompt window, when, and in what form&lt;/li&gt;
&lt;li&gt;Dynamic filtering — you can scope the retrieved context based on the current task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MCP servers, by contrast, often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Push opaque context — the server decides what to include, and you have limited visibility into what enters your prompt&lt;/li&gt;
&lt;li&gt;Provide limited filtering — the architectural design of MCP makes it harder to control the granularity of context injection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This distinction becomes critical once you're operating near the 100K+ token regime. When every token of context matters, you need tight control over what the agent "knows" at any point in time. Skills give you that control. MCP servers often don't.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skill Registries for Discoverable Capabilities
&lt;/h3&gt;

&lt;p&gt;To ground coding agents with capabilities they can dynamically discover and download, two registries are worth knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Agent Skills Directory — a curated directory of reusable agent skills: &lt;a href="https://skills.sh/" rel="noopener noreferrer"&gt;https://skills.sh/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;microsoft/skills — Microsoft's open-source skill repository: &lt;a href="https://github.com/microsoft/skills" rel="noopener noreferrer"&gt;https://github.com/microsoft/skills&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These registries let agents find and adopt capabilities without ballooning the primary context with skill definitions that aren't needed for the current task. The skill gets loaded when it's needed, used, and then the context is reclaimed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;All three tips point to the same underlying principle: context is the scarce resource, not capability.&lt;/p&gt;

&lt;p&gt;The models are capable enough. The context window is large enough. But the effective operating zone is much smaller than the theoretical maximum. Everything you do — compaction strategy, sub-agent architecture, tool selection — should be designed around keeping context lean, scoped, and fresh.&lt;/p&gt;

&lt;p&gt;Treat context like memory in a constrained system. Allocate carefully. Free aggressively. Never assume that having more headroom means you should use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;HumanLayer, "Long-Context Isn't the Answer": &lt;a href="https://www.hlyr.dev/blog/long-context-isnt-the-answer" rel="noopener noreferrer"&gt;https://www.hlyr.dev/blog/long-context-isnt-the-answer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models": &lt;a href="https://arxiv.org/pdf/2309.00071" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2309.00071&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Agent Skills Directory: &lt;a href="https://skills.sh/" rel="noopener noreferrer"&gt;https://skills.sh/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Microsoft Skills Repository: &lt;a href="https://github.com/microsoft/skills" rel="noopener noreferrer"&gt;https://github.com/microsoft/skills&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>learning</category>
    </item>
    <item>
      <title>The Memory Wall Is Coming Down — What It Means for Coding Agents</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 02:25:16 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/the-memory-wall-is-coming-down-what-it-means-for-coding-agents-282j</link>
      <guid>https://dev.to/mixture-of-experts/the-memory-wall-is-coming-down-what-it-means-for-coding-agents-282j</guid>
      <description>&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The memory wall is a primary constraint on coding agents, not model intelligence. Quadratic attention costs, KV cache growth, and "lost in the middle" degradation create a hard ceiling on how long agents can maintain coherent reasoning.&lt;/li&gt;
&lt;li&gt;Research breakthroughs compose: 30x+ KV memory reduction is within reach. TriAttention's intelligent pruning and TurboQuant's 3-bit quantization are complementary techniques that stack naturally, while Latent Briefing cuts multi-agent context sharing costs by 49%.&lt;/li&gt;
&lt;li&gt;Fundamentally different theories of agent memory are emerging. For example, MemPalace bets on structured archival with spatial retrieval; Hippo Memory bets on intelligent forgetting with decay-based consolidation. The field hasn't converged on what wins or perhaps it changes depending on the use case.&lt;/li&gt;
&lt;li&gt;The harness is becoming an operating system for agent memory. Claude Code's three-layer compaction, four-tier persistence hierarchy, and self-healing query loop reveal that production coding agents are already memory management systems — and this pattern will only deepen.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  One of the biggest constraints on coding agents is memory.
&lt;/h2&gt;

&lt;p&gt;Specifically, it's the quadratic cost of attention — the mechanism that lets models weigh the relevance of every token against every other token. This single architectural bottleneck determines how long an agent can think, how much context it can hold, and how complex the tasks it can tackle before it starts forgetting what it was doing.&lt;/p&gt;

&lt;p&gt;Three layers of innovation are converging on this problem simultaneously: foundational research that slashes the memory cost of attention itself, community-built tools that give agents persistent memory across sessions, and production harness architectures that manage context as a first-class engineering concern. These aren't isolated efforts. They're solving the same problem at different altitudes. Understanding how they connect — and where they're heading — is essential for anyone building with or for coding agents today.&lt;/p&gt;

&lt;h2&gt;
  
  
  The attention tax every coding agent pays
&lt;/h2&gt;

&lt;p&gt;Before getting into solutions, it's worth understanding the constraint clearly. If you've worked with coding agents, you've felt this — even if you didn't have a name for it.&lt;/p&gt;

&lt;p&gt;Transformers process input through an attention mechanism. For every new token the model generates, it computes a relevance score against every previous token in the context window. This is what makes language models powerful: they can relate distant pieces of information. It's also what makes them expensive: the computation scales quadratically with sequence length. Double the context, quadruple the cost.&lt;/p&gt;

&lt;p&gt;Cost flow: Context length drives KV cache memory (grows linearly) and attention computation (grows quadratically) -&amp;gt; together they hit the GPU memory wall -&amp;gt; result is degraded performance or out of memory.&lt;/p&gt;

&lt;p&gt;In practice, this means a 200K token context window is not 200K tokens of useful capacity. Claude Code's 200K window shows measurable degradation around 147K–152K tokens. System prompts alone can consume 30K–40K tokens before the user types anything. The "lost in the middle" phenomenon — where models deprioritize information in the middle of long contexts — compounds the problem. More context doesn't mean better understanding. Past a threshold, it means worse understanding.&lt;/p&gt;

&lt;p&gt;For coding agents, this creates a hard ceiling. A long refactoring session accumulates tool results, file reads, error traces, and intermediate reasoning. Each step adds to the context. Eventually, the agent is spending more compute re-attending to stale history than reasoning about the current problem. This is the memory wall, and it's the primary reason coding agents degrade on long tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Research is breaking the wall
&lt;/h2&gt;

&lt;p&gt;This isn't one paper. It's a wave. Multiple research teams are attacking the memory wall from different angles simultaneously: compressing the KV cache through structural insights, quantizing it to extreme bit-widths, and making multi-agent context sharing efficient at the representation level.&lt;/p&gt;

&lt;h3&gt;
  
  
  TriAttention: compressing the KV cache without losing quality
&lt;/h3&gt;

&lt;p&gt;The KV (key-value) cache stores the attention state for every token the model has processed. As context grows, this cache becomes the dominant memory consumer. Existing compression methods like SnapKV try to prune unimportant keys, but they estimate importance using attention scores from recent queries — and those scores are distorted by a positional encoding called RoPE (Rotary Position Embedding), making them unreliable.&lt;/p&gt;

&lt;p&gt;TriAttention, from researchers at MIT, NVIDIA, and Zhejiang University, takes a different approach. It exploits a structural property the authors call Q/K concentration: in the pre-RoPE representation space, query and key vectors cluster tightly around fixed centers regardless of input or position. Approximately 90% of attention heads in tested models show this property. These stable centers determine which token distances each head preferentially attends to via a trigonometric distance-preference function.&lt;/p&gt;

&lt;p&gt;Instead of dynamically guessing which keys matter, TriAttention scores each key against these fixed centers using the trigonometric function, then keeps only the top-scoring keys. The scoring runs as a fused Triton kernel with a protected window of recent tokens that are never evicted.&lt;/p&gt;

&lt;p&gt;The results are striking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10.7x KV memory reduction&lt;/li&gt;
&lt;li&gt;2.5x throughput on long reasoning tasks (32K token generation), with accuracy matching full attention (40.8 on AIME25 for both)&lt;/li&gt;
&lt;li&gt;6.3x throughput on MATH-500 with only 1.2 percentage points of accuracy loss&lt;/li&gt;
&lt;li&gt;Existing baselines (SnapKV, R-KV) collapse to roughly half the accuracy at the same memory budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical implication: reasoning models that previously required multi-GPU setups can run on a single consumer GPU. For coding agents, this means longer reasoning chains within the same hardware constraints — more time thinking about your refactoring task before the memory wall hits.&lt;/p&gt;

&lt;h3&gt;
  
  
  TurboQuant: extreme compression, zero accuracy loss
&lt;/h3&gt;

&lt;p&gt;While TriAttention prunes which keys to keep, Google's TurboQuant (ICLR 2026) attacks the same problem from a complementary angle: making each key smaller. It quantizes the KV cache down to 3 bits per parameter — training-free — using two techniques: PolarQuant, which rotates key/value vectors into a representation that quantizes more uniformly, and Quantized Johnson-Lindenstrauss compression, which reduces dimensionality while preserving distance relationships.&lt;/p&gt;

&lt;p&gt;The result: no measurable accuracy loss across LongBench, RULER, and Needle-in-a-Haystack benchmarks. In practice, this means ~3x longer effective context on the same GPU memory. Stack TurboQuant with TriAttention's pruning and you're looking at 30x+ memory reduction — enough to hold a substantial codebase's worth of context on hardware that currently struggles with a single long conversation.&lt;/p&gt;

&lt;p&gt;These aren't competing approaches. Pruning (which keys to keep) and quantization (how much space each key needs) compose naturally. The research community is converging on a layered compression stack for attention, much like how image codecs layer spatial compression, quantization, and entropy coding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latent Briefing: efficient memory sharing between agents
&lt;/h3&gt;

&lt;p&gt;Multi-agent systems have a compounding token problem. When an orchestrator delegates tasks to worker agents, each worker needs context about what the orchestrator has already figured out. The naive approach — passing the full reasoning trajectory as text — causes token usage to explode with each successive call. Summarization is slow and lossy. RAG retrieval is brittle.&lt;/p&gt;

&lt;p&gt;Latent Briefing, from Ramp Labs, operates at a different level entirely. Instead of compressing text, it compresses the model's internal representations.&lt;/p&gt;

&lt;p&gt;The mechanism: the orchestrator's accumulated reasoning is forward-passed through the worker model. The attention scores between the task prompt's query vectors and the trajectory's KV cache keys reveal which parts of the context the worker considers relevant — and crucially, this relevance is task-adaptive. Different queries compress the same context differently. The method then constructs a compact KV cache using the important keys, bias corrections for missing keys, and reconstructed values via ridge regression.&lt;/p&gt;

&lt;p&gt;Tested with Claude Sonnet 4 (orchestrator) and Qwen-14B (worker) on 126 LongBench v2 questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;49% median token savings on medium-length documents (32K–100K tokens)&lt;/li&gt;
&lt;li&gt;+3 percentage point accuracy gain at the right compaction threshold — it actually performs better with less context&lt;/li&gt;
&lt;li&gt;Compaction takes ~1.7 seconds, roughly 20x faster than sequential attention merging and 10–30x faster than LLM summarization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That accuracy gain is the telling result. Removing irrelevant context doesn't just save tokens — it helps the model focus.&lt;/p&gt;

&lt;p&gt;A practically important finding from the paper: different compaction thresholds win in different regimes. Longer documents (32K–100K tokens) benefit from lighter compaction — the information is dispersed and broad coverage matters, but even light pruning still saves 57% of worker tokens. Harder questions benefit from aggressive compaction (79% of context removed) because the orchestrator's speculative reasoning generates noise that dilutes the worker's signal. Moderate compaction works best for short, focused documents.&lt;/p&gt;

&lt;p&gt;This isn't just a tuning knob. It's a design principle: the right amount of context depends on the task, not just the budget. Compaction should be task-aware, not one-size-fits-all. This validates a principle that experienced harness engineers already know intuitively: less context, better directed, beats more context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Products are building on top
&lt;/h2&gt;

&lt;p&gt;While researchers optimize what happens inside the model's context window, community builders are attacking the problem from the other direction: giving agents external memory that persists beyond any single session.&lt;/p&gt;

&lt;h3&gt;
  
  
  MemPalace: structured recall through spatial organization
&lt;/h3&gt;

&lt;p&gt;MemPalace maps the ancient Method of Loci to a data architecture for AI agents. Wings are top-level categories (a person, a project). Rooms are specific topics within a wing. Halls connect rooms by type. Tunnels automatically link the same room across different wings. Drawers are the atomic unit: verbatim text chunks that are never summarized.&lt;/p&gt;

&lt;p&gt;The technical backbone is dual storage: ChromaDB for semantic vector search and SQLite for a temporal knowledge graph that tracks facts over time. A four-layer memory stack minimizes token cost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;L0 (identity, ~50 tokens) and L1 (critical facts, ~120 tokens) load on every startup&lt;/li&gt;
&lt;li&gt;L2 (room recall) and L3 (deep search) fire only on demand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In benchmarks, wing+room metadata filtering improves retrieval from 60.9% to 94.8% R@10 — though this leverages standard ChromaDB metadata filtering rather than a novel retrieval mechanism. The real value is the spatial organization model itself, which gives agents a structured way to scope queries. Everything runs locally with no cloud dependency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hippo Memory: forgetting as a feature
&lt;/h3&gt;

&lt;p&gt;Hippo Memory takes a neuroscience-inspired approach with a three-tier hierarchy mimicking human memory: a buffer (working memory, current session only), an episodic store (timestamped memories with a 7-day half-life that strengthens through retrieval), and a semantic store (stable patterns extracted during consolidation).&lt;/p&gt;

&lt;p&gt;The key innovation is the sleep command — a consolidation pipeline that runs a decay pass to remove weak memories, a replay pass that finds three or more related episodes via embedding similarity and extracts common patterns into semantic memory, conflict detection for contradictions, and schema indexing to update topic clusters.&lt;/p&gt;

&lt;p&gt;Memories decay by default. Persistence is earned through use. Errors get 2x the half-life. Breakthroughs get priority. This is the opposite of "store everything and search later." It's a bet that intelligent forgetting is as important as precise recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  The key distinction
&lt;/h3&gt;

&lt;p&gt;MemPalace and Hippo Memory represent two fundamentally different theories of agent memory. MemPalace is a structured archive — store everything verbatim, make it findable through spatial organization. Hippo is a dynamic brain — memories compete for survival through use, decay, and consolidation. MemPalace bets on retrieval precision. Hippo bets on forgetting as a feature. Both are valid. The field hasn't converged on which approach wins yet — and the answer may be that different tasks demand different memory architectures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code's architecture: the harness layer
&lt;/h2&gt;

&lt;p&gt;Claude Code reveals what a production coding agent does when it can't wait for research to ship. Its architecture is a pragmatic, multi-layered response to the memory wall — and it's instructive because it shows what works today.&lt;/p&gt;

&lt;h3&gt;
  
  
  The self-healing query loop
&lt;/h3&gt;

&lt;p&gt;Claude Code doesn't use standard request-response. It runs a continuous state machine designed to absorb failures. When the model exhausts its output budget mid-task, the loop doesn't crash. It triggers compression automatically, carving out a buffer before the token ceiling and generating a structured summary. If the API returns a prompt_too_long error, reactive compression fires and retries. To prevent infinite loops, auto-compaction pauses after three consecutive failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three-layer compaction
&lt;/h3&gt;

&lt;p&gt;The compaction system uses progressively stronger cleanup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Rules engine cleanup — lightweight, no LLM call. Strips known low-value patterns: stale tool results, redundant messages.&lt;/li&gt;
&lt;li&gt;Session memory extraction — writes extracted facts to disk, removes them from context. Still avoids an LLM call.&lt;/li&gt;
&lt;li&gt;Full summary — when layers 1 and 2 are insufficient, an LLM-generated summary replaces older messages.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A critical design choice: Claude Code preserves the message prefix so Anthropic's prompt cache remains valid. Naive oldest-first deletion would invalidate the entire cache on every compaction — a costly mistake that would negate the efficiency gains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four-tier memory hierarchy
&lt;/h3&gt;

&lt;p&gt;Persistence across sessions uses four tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CLAUDE.md — project-level instructions, read on every session start, survives compaction by being re-read from disk&lt;/li&gt;
&lt;li&gt;Auto Memory — topic files in .claude/ that evolve with project knowledge&lt;/li&gt;
&lt;li&gt;Session Memory — cross-session context extracted every ~5,000 tokens&lt;/li&gt;
&lt;li&gt;/remember — promotes recurring patterns into permanent configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How the layers stack: Research layer (TriAttention KV cache pruning + TurboQuant 3-bit KV quantization + Latent Briefing representation-level sharing) provides a wider foundation (more tokens, less cost) -&amp;gt; feeds the Product layer (MemPalace structured archive + Hippo Memory dynamic forgetting) for extended reach (memory beyond sessions) -&amp;gt; feeds the Harness layer (3-layer compaction + 4-tier memory hierarchy + self-healing loop) as a managed interface (finite context, infinite tasks) -&amp;gt; Developer Experience.&lt;/p&gt;

&lt;p&gt;This architecture is fundamentally a memory management system that bridges finite context windows and unbounded tasks. The model handles what fits in attention. The harness handles everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the pieces map together
&lt;/h2&gt;

&lt;p&gt;These three layers — research, product, harness — aren't competing. They're solving the same problem at different altitudes, and their relationship is structural.&lt;/p&gt;

&lt;p&gt;The research layer makes the foundation wider. TriAttention and TurboQuant compose to achieve 30x+ memory reduction for the KV cache. Latent Briefing lets multiple agents share context at 50% of the token cost with better accuracy. These don't change what models can do conceptually — they change the economics of how much they can hold while doing it.&lt;/p&gt;

&lt;p&gt;The product layer extends reach beyond any single context window. MemPalace and Hippo Memory give agents access to knowledge that no context window, however large, could contain: months of project history, cross-session decisions, accumulated preferences. They're building external memory systems because even with perfect attention, a context window is still a window.&lt;/p&gt;

&lt;p&gt;The harness layer manages the interface between finite capacity and infinite demand. Claude Code's compaction, memory hierarchy, and self-healing loop exist because even with better attention and external memory, someone still needs to decide what goes in the context window right now. The harness is the memory manager — it routes information between tiers, decides what to compress, and recovers when capacity runs out.&lt;/p&gt;

&lt;p&gt;The layers are complementary because each one makes the others more effective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better KV compression (research) means harness compaction can be less aggressive, preserving more context quality&lt;/li&gt;
&lt;li&gt;Richer external memory (product) means the harness can offload more confidently, knowing it can retrieve when needed&lt;/li&gt;
&lt;li&gt;Smarter harness routing means research-level efficiency gains translate into user-visible capability, not just lower API bills&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Predictions: where this convergence leads
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Near-term
&lt;/h3&gt;

&lt;p&gt;The next wave of KV cache compression hits production. The first wave is already standard: Grouped Query Attention is baked into every major open-weight model, PagedAttention is the default memory manager in vLLM, FP8 KV quantization ships in both vLLM and TensorRT-LLM, and prefix caching is default-on at every major provider. What's coming is more aggressive. TriAttention already ships as a vLLM plugin with community ports for llama.cpp and MLX. TurboQuant emerged from Google Research with community MLX implementations appearing within weeks. Within a year, sub-4-bit KV quantization and intelligent pruning will be default options in inference frameworks — not research experiments. This next wave directly extends how long coding agents can maintain coherent reasoning on a single task.&lt;/p&gt;

&lt;p&gt;Multi-agent memory sharing moves from research to production. Latent Briefing's representation-level compaction is one compelling approach, but it's part of a broader wave. Teams are exploring shared KV cache pools across co-located agents, lightweight context distillation protocols, and hierarchical memory architectures where agents at different levels of an orchestration tree maintain context at different granularities. The common thread: making delegation cheap by solving context transfer at the systems level rather than through brute-force token passing. Expect multi-agent coding workflows to adopt one or more of these techniques, dramatically cutting the cost of orchestrator-to-worker handoffs.&lt;/p&gt;

&lt;p&gt;External memory becomes expected, not optional. MemPalace and Hippo Memory are early but the pattern is clear: coding agents that remember project context across sessions will outperform those that don't, and developers will demand this capability. Claude Code's CLAUDE.md and Auto Memory are the first-party version of this trend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Medium-term
&lt;/h3&gt;

&lt;p&gt;Always-on agents become practical. The combination of compressed attention, efficient multi-agent sharing, and tiered external memory unlocks a new class of applications: coding agents that maintain continuous context over days or weeks. Not because the context window grows to millions of tokens, but because the system around it manages memory intelligently at every layer.&lt;/p&gt;

&lt;p&gt;Memory consolidation becomes the norm. Hippo Memory's bet — that forgetting is as important as remembering — is likely directionally right, even if the specific mechanisms evolve. As agents accumulate months of project history, storing everything becomes as costly as forgetting everything. The winning systems will almost certainly need some form of consolidation: compressing episodes into patterns, decaying noise, strengthening what's used. Human memory works this way for a reason.&lt;/p&gt;

&lt;p&gt;Models develop distinct memory tiers internally. Current models treat all tokens in the context window equally. Future architectures will likely differentiate between working memory (high-attention, recent, expensive) and reference memory (lower-attention, compressed, cheap) — mirroring what harnesses already do externally. When this happens, the external product layer and the internal model layer will start to merge.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this means for harness engineering
&lt;/h3&gt;

&lt;p&gt;This is where it gets concrete for builders.&lt;/p&gt;

&lt;p&gt;Harnesses will manage multiple memory tiers, not just context windows. Today, a harness manages one thing: what's in the context. Tomorrow, it manages in-context tokens, compressed KV cache segments, external vector stores, persistent project files, and cross-session summaries. The harness becomes a memory routing system.&lt;/p&gt;

&lt;p&gt;Memory routing becomes a first-class discipline. For every piece of information an agent encounters, the harness will need to make a routing decision: does this go in active context? Compressed cache? External store? Disk? Nowhere? Getting this routing right — fast, at scale, without human intervention — is the defining challenge of next-generation harness engineering.&lt;/p&gt;

&lt;p&gt;Compaction strategies become differentiating. Claude Code's three-layer compaction is state-of-the-art for a single-agent coding tool today. But as tasks get longer and multi-agent workflows become standard, compaction will need to become task-aware (like Latent Briefing), importance-weighted (like Hippo's decay model), and cache-preserving (like Claude Code's prefix protection). The teams that ship the most capable agents won't be the ones with the most sophisticated compaction pipeline in isolation — they'll be the ones who understand what's in development at each layer of the research stack, recognize where models are and aren't capable of managing memory on their own, and synthesize all of that into a coherent strategy for how memory should work across long-horizon tasks.&lt;/p&gt;

&lt;p&gt;The harness is becoming an operating system for agent memory. This isn't hyperbole. An OS manages memory tiers (registers, L1/L2 cache, RAM, disk), makes routing decisions transparently, and provides a clean abstraction to the application layer. Harnesses are converging on the same architecture for agent memory.&lt;/p&gt;

&lt;p&gt;Independent research is arriving at the same conclusion. A recent paper from Yu, Zhang, Ni et al. explicitly frames multi-agent memory as a computer architecture problem — proposing a three-layer hierarchy (I/O, cache, memory) with shared vs. distributed paradigms and formal consistency protocols. Their central argument: multi-agent memory consistency is the most pressing unsolved problem in agent systems, just as cache coherence was for multiprocessor systems decades ago.&lt;/p&gt;

&lt;p&gt;The parallel is exact. And it means the teams building harnesses today are, whether they realize it or not, building the memory management layer of a new computing paradigm.&lt;/p&gt;

&lt;p&gt;The memory wall for coding agents isn't permanent. It's an engineering problem being solved at every layer simultaneously — in attention research, in community-built memory products, and in production harness architectures. The builders who understand where these layers connect will build the most capable agents. And the ones who realize that the harness isn't just an execution wrapper but a memory management system — those are the ones building the infrastructure that everything else will run on.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] Weian Mao et al. "TriAttention: Efficient Tri-State KV Cache Compression for Long-Context Transformers." MIT, NVIDIA, Zhejiang University, 2026. Code: &lt;a href="https://github.com/WeianMao/triattention" rel="noopener noreferrer"&gt;https://github.com/WeianMao/triattention&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] Google Research. "TurboQuant: Redefining AI Efficiency with Extreme Compression." ICLR 2026. &lt;a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/" rel="noopener noreferrer"&gt;https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] Ramp Labs. "Latent Briefing: Efficient Multi-Agent Context Sharing via Representation-Level Compaction." 2026. &lt;a href="https://x.com/RampLabs/status/2042660310851449223" rel="noopener noreferrer"&gt;https://x.com/RampLabs/status/2042660310851449223&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] MemPalace. "MemPalace: Structured Spatial Memory Architecture for AI Agents." Code: &lt;a href="https://github.com/MemPalace/mempalace" rel="noopener noreferrer"&gt;https://github.com/MemPalace/mempalace&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] Hippo Memory. "Hippo Memory: Neuroscience-Inspired Memory with Forgetting and Consolidation." Code: &lt;a href="https://github.com/kitfunso/hippo-memory" rel="noopener noreferrer"&gt;https://github.com/kitfunso/hippo-memory&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] Yu, Zhang, Ni et al. "Multi-Agent Memory as a Computer Architecture Problem." arXiv:2603.10062, 2026. &lt;a href="https://arxiv.org/abs/2603.10062" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2603.10062&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>memory</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Rise of Edge AI — A New Layer in the Coding Agent Stack</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 02:13:32 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/the-rise-of-edge-ai-a-new-layer-in-the-coding-agent-stack-53hp</link>
      <guid>https://dev.to/mixture-of-experts/the-rise-of-edge-ai-a-new-layer-in-the-coding-agent-stack-53hp</guid>
      <description>&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Compression breakthroughs are collapsing the hardware barrier. TurboQuant achieves 6x memory reduction with zero quality loss, and PrismML's 1-bit Bonsai 8B fits a competitive model in 1.15 GB — 14x smaller than its 16-bit equivalent. Models that required data center GPUs now run on a MacBook Pro or even a phone.&lt;/li&gt;
&lt;li&gt;Edge AI is earning a permanent place in the coding agent stack, not replacing the cloud. The open-source capability gap has closed to roughly three months. Gemma 4 ships edge-first with native function-calling under Apache 2.0, and Reflection AI's $2.5B raise signals that enterprises and infrastructure providers are investing in locally-deployable coding models as a complement to cloud services.&lt;/li&gt;
&lt;li&gt;Reinforcement learning is making tiny models genuinely useful for agent orchestration. LiquidAI's 350M-parameter model — 1/20th the size of GPT-2 — achieves over 95% accuracy in multi-turn tool-calling, running on hardware as small as a Raspberry Pi. Tool use is the capability that separates a coding agent from a chatbot, and it no longer requires billions of parameters.&lt;/li&gt;
&lt;li&gt;The local runtime is being purpose-built for coding agents. Ollama's MLX integration delivers 2x faster decode on Apple Silicon with caching designed specifically for agentic coding patterns — long, iterative conversations with repeated file context.&lt;/li&gt;
&lt;li&gt;A fully edge-native coding stack is viable today for cost-constrained and regulated environments. When local inference is free at the margin, the question shifts from "is edge good enough?" to "why am I paying for something I can run myself?" — and for air-gapped or compliance-bound teams, edge AI isn't a fallback, it's the only way AI enters the workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next major shift in coding agents isn't about replacing the cloud. It's about complementing it — from your own hardware.&lt;/p&gt;

&lt;p&gt;Today, the most capable coding agents — Claude Code, Codex, Copilot — route every keystroke through remote inference servers. The assumption baked into the entire ecosystem is that frontier-quality AI requires frontier-scale hardware, which means renting compute from someone else. That assumption still holds for the hardest reasoning tasks. But a cascade of breakthroughs in the first quarter of 2026 is opening up a parallel track: model compression, edge-optimized releases, local runtime optimization, and reinforcement learning for small models are converging to make local AI not just possible, but genuinely useful for a growing class of developer workflows.&lt;/p&gt;

&lt;p&gt;Edge AI isn't arriving to kill the cloud. It's emerging as its own category — one that earns a permanent place in the development stack for specific, high-value scenarios where latency, privacy, cost, or availability matter more than peak reasoning power.&lt;/p&gt;

&lt;p&gt;This post walks through the evidence — paper by paper, release by release — and maps out the scenarios where edge AI will be the preferred choice for developers building and using coding agents. The audience is software engineers who use these tools daily, whether or not you've ever read a machine learning paper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The compression revolution: making big models edge-ready
&lt;/h2&gt;

&lt;p&gt;The most direct path to edge AI is making existing models smaller without making them dumber. Two breakthroughs in early 2026 moved the needle dramatically, bringing frontier-class capabilities within reach of consumer hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google's TurboQuant: 6x memory reduction, zero quality loss
&lt;/h3&gt;

&lt;p&gt;Google Research revealed TurboQuant in March 2026 — a compression algorithm that reduces the memory footprint of large language models while boosting speed and maintaining accuracy. The technique targets the key-value cache, which Google describes as a "digital cheat sheet" storing previously computed attention states so the model doesn't recompute them from scratch.&lt;/p&gt;

&lt;p&gt;TurboQuant is a two-step process. First, PolarQuant converts the traditional Cartesian vector representation into polar coordinates — reducing each vector to a radius (data strength) and direction (semantic meaning). Google's analogy: instead of "go 3 blocks East, 4 blocks North," you say "go 5 blocks at 37 degrees." Less data, same destination, and no expensive normalization steps. Second, a technique called Quantized Johnson-Lindenstrauss applies a 1-bit error-correction layer, reducing residual quantization noise while preserving the distance relationships that attention scores depend on.&lt;/p&gt;

&lt;p&gt;The numbers: 6x memory reduction in the KV cache with perfect downstream accuracy across long-context benchmarks using both Gemma and Mistral models. Computing attention with 4-bit TurboQuant runs 8x faster than 32-bit unquantized keys on NVIDIA H100 accelerators. And critically, TurboQuant quantizes to 3 bits with no additional training — it can be applied to existing models off the shelf.&lt;/p&gt;

&lt;p&gt;For software engineers, here's the translation: a model that previously required 48 GB of VRAM could fit in 8 GB. That's the difference between a data center GPU and a MacBook Pro.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caltech/PrismML: 1-bit models on your phone
&lt;/h3&gt;

&lt;p&gt;If TurboQuant is aggressive, PrismML's work — emerging from breakthrough research at Caltech — is radical. They've achieved true end-to-end 1-bit quantization: embeddings, attention layers, MLP layers, and the language model head are all compressed to a single bit per parameter. No higher-precision escape hatches.&lt;/p&gt;

&lt;p&gt;The result is their Bonsai 8B model: a model that competes with leading 8-billion-parameter models while occupying just 1.15 GB — 14x smaller than its 16-bit equivalent. PrismML measures this with an "Intelligence Density Score" of 1.06 per GB, compared to Qwen3 8B's 0.10 per GB. That's a 10.6x improvement in intelligence per unit of memory.&lt;/p&gt;

&lt;p&gt;What does this look like in practice? The Bonsai 8B runs at approximately 40 tokens per second on an iPhone 17 Pro and 131 tokens per second on an M4 Pro Mac — with an energy cost of just 0.068 mWh per token on the iPhone 17 Pro Max.&lt;/p&gt;

&lt;p&gt;An 8B-class model, competitive on benchmarks, running at interactive speeds on a phone. A year ago, that was a research fantasy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The edge-ready model wave: from cloud-only to run-anywhere
&lt;/h2&gt;

&lt;p&gt;Compression makes models smaller. But the edge story isn't just about shrinking existing models — it's about an entire class of models being designed, released, and optimized for local deployment. Open-weight releases, edge-first architectures, and purpose-built small models are all expanding what's possible without a cloud connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  The capability gap is closing fast
&lt;/h3&gt;

&lt;p&gt;Dave Friedman's analysis quantifies the trend. In 2023, closed-source models scored approximately 88% on MMLU benchmarks while open models managed 70.5% — a meaningful gap. By 2026, that gap is effectively zero on knowledge benchmarks and single digits on most reasoning tasks. Open-source models now trail the state of the art by approximately three months, down from roughly a year in late 2024.&lt;/p&gt;

&lt;p&gt;The efficiency story is equally compelling. DeepSeek's V3 model used 2.6 million GPU hours versus Llama 3 405B's 30.8 million — a tenfold efficiency improvement for comparable performance. DeepSeek's R1 reasoning model matched OpenAI's o1 at roughly 3% of the cost.&lt;/p&gt;

&lt;p&gt;For edge AI, the implication is direct: the models available for local deployment are no longer second-tier. Many of the tasks developers perform daily — code completion, documentation, refactoring, test generation — fall well within the capability range of models that can run on consumer hardware. The cloud retains its advantage for the hardest reasoning tasks, but the floor of "good enough for local" keeps rising.&lt;/p&gt;

&lt;p&gt;Capability-gap timeline: 2023 closed-source leads (MMLU 88% vs 70.5%) -&amp;gt; 2024 gap narrows (open trails by ~1 year) -&amp;gt; 2025 DeepSeek R1 matches o1 at 3% cost -&amp;gt; 2026 gap ≈ 0 on knowledge, open trails by ~3 months -&amp;gt; Edge-capable models reach "good enough" for daily tasks -&amp;gt; Cloud for frontier reasoning, edge for everything else.&lt;/p&gt;

&lt;p&gt;With LLM inference costs dropping roughly 10x annually and edge-capable models improving every quarter, the set of tasks that require cloud inference is shrinking. Cloud APIs will continue to lead on frontier reasoning, complex multi-step planning, and large-context tasks — but the everyday development workflows that consume the most tokens are increasingly viable on local hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemma 4: edge-first by design
&lt;/h3&gt;

&lt;p&gt;Google's Gemma 4 release in April 2026 is a landmark for the edge AI category. The Gemma 4 family ships in four sizes — E2B, E4B, 26B MoE, and 31B Dense — under an Apache 2.0 license, with the smaller variants explicitly designed for on-device deployment.&lt;/p&gt;

&lt;p&gt;The performance is no longer "good for a local model." It's simply good. The 31B model is the #3 open model in the world on the Arena AI text leaderboard. The 26B MoE is #6, outcompeting models 20x its size. The MoE architecture activates only 3.8 billion of its 26 billion total parameters during inference — frontier-level reasoning at a fraction of the compute cost.&lt;/p&gt;

&lt;p&gt;For edge deployment specifically, Gemma 4's E2B and E4B models run completely offline with near-zero latency across phones, Raspberry Pi, and NVIDIA Jetson Orin Nano. They feature 128K context windows, native multimodal capabilities (vision, audio), and — critically for coding agents — native function-calling, structured JSON output, and system instructions. These aren't toy models. They're agent-ready and edge-first, representing a new design philosophy: models built for the edge from the ground up, not cloud models shrunk down as an afterthought.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reflection AI: $2.5 billion bet on deployable coding models
&lt;/h3&gt;

&lt;p&gt;The capital markets are backing the edge thesis. Reflection AI, founded in 2024 by former DeepMind researchers Misha Laskin and Ioannis Antonoglou, is raising $2.5 billion at a $25 billion valuation — backed by NVIDIA and JPMorgan Chase. The company's valuation went from $545 million to $25 billion in under 12 months. A 46x increase.&lt;/p&gt;

&lt;p&gt;Reflection builds open-weight models focused explicitly on automating software development — AI systems that write, test, and maintain code. Positioned as "the DeepSeek of the West," they're building a model network for enterprises, research institutions, and universities — models designed to run on your infrastructure, not just through an API.&lt;/p&gt;

&lt;p&gt;When NVIDIA pours nearly a billion dollars into a coding AI lab building locally-deployable models, and JPMorgan participates through its Security and Resiliency Initiative, the strategic message is clear: edge-deployable AI models aren't a research curiosity. They're a category that enterprises, governments, and infrastructure providers are investing in as a complement to cloud-based AI services.&lt;/p&gt;

&lt;h2&gt;
  
  
  The small model breakthrough: RL changes everything
&lt;/h2&gt;

&lt;p&gt;Compression makes big models edge-ready. Open weights and edge-first architectures give you deployment flexibility. But the third force might be the most surprising: small models are getting dramatically smarter through reinforcement learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  LiquidAI: a 350-million-parameter model that can use tools
&lt;/h3&gt;

&lt;p&gt;LiquidAI's LFM 2.5 350M is a 350-million-parameter model — roughly 1/20th the size of GPT-2 — that delivers performance previously associated with models many times its size. The key innovation is applying large-scale reinforcement learning to a small model after expanded pre-training (28 trillion tokens, up from 10 trillion).&lt;/p&gt;

&lt;p&gt;The results redefine what "small" means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;76.96% on IFEval (instruction following), up from 64.96% in the previous version&lt;/li&gt;
&lt;li&gt;44.11 on BFCLv3 (tool use), roughly double the prior version's 22.95&lt;/li&gt;
&lt;li&gt;Over 95% accuracy in multi-turn tool-calling interactions across smart home, banking, and terminal use cases&lt;/li&gt;
&lt;li&gt;Runs at 40,400 tokens per second on an NVIDIA H100, and inference works on everything from a Raspberry Pi 5 to an Apple M5 Max&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 350-million-parameter model with reliable tool use. Think about what that means for coding agents. Tool use — calling functions, reading files, executing shell commands — is the foundational capability that separates a coding agent from a chatbot. If a model small enough to run on a Raspberry Pi can reliably call tools, the minimum hardware bar for a useful coding agent drops to essentially nothing.&lt;/p&gt;

&lt;p&gt;LiquidAI's model isn't recommended for complex math, code generation, or creative writing — those tasks still demand larger models. But for the orchestration layer — deciding which tools to call, in what order, with what arguments — a 350M model with strong tool-calling accuracy could serve as a lightweight local coordinator that dispatches heavier tasks to larger models only when necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runtime is ready: Ollama's Apple Silicon moment
&lt;/h2&gt;

&lt;p&gt;Models don't run in a vacuum. They need runtime infrastructure optimized for local hardware. Ollama's March 2026 update delivered exactly this.&lt;/p&gt;

&lt;p&gt;Ollama 0.19 is now built on Apple's MLX framework, directly leveraging the unified memory architecture of Apple Silicon. The performance gains are substantial: 1.6x faster prefill and roughly 2x faster decode speed compared to the previous version. On M5-series chips, Ollama taps the new GPU Neural Accelerators for further acceleration.&lt;/p&gt;

&lt;p&gt;But the most telling detail is what Ollama chose to optimize for. Their announcement explicitly names coding agents — Claude Code, OpenCode, Codex — as the primary beneficiaries. The new caching system reuses context across conversations, stores intelligent checkpoints within prompts, and implements smarter eviction policies where shared prefixes survive longer. These are features designed specifically for the pattern of agentic coding: long, iterative conversations where the model keeps returning to the same files and context.&lt;/p&gt;

&lt;p&gt;The infrastructure layer is no longer an afterthought. The local runtime is being purpose-built for coding agents — a clear signal that edge AI is maturing from experiment to product category.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trust and control gap: where edge AI earns its place
&lt;/h2&gt;

&lt;p&gt;There's a reason edge AI matters that goes deeper than latency and cost. Stanford's 2026 AI Index quantified a growing disconnect: only 10% of Americans say they're more excited than concerned about AI, compared to 56% of AI experts. On whether AI will help with jobs, 73% of experts say yes — only 23% of the public agrees.&lt;/p&gt;

&lt;p&gt;This trust gap creates real demand for alternatives. Frontier models like Anthropic's Mythos are expensive to serve, and AI power demand is now comparable to Switzerland's entire national electricity consumption. For developers and organizations who need AI capabilities but have concerns about data sovereignty, cost predictability, or availability, cloud-only isn't always the right answer.&lt;/p&gt;

&lt;p&gt;This is precisely the market that edge AI serves. It's not about choosing sides in a cloud-versus-local debate — it's about recognizing that different scenarios call for different deployment models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data-sensitive development: When working with proprietary codebases, regulated data, or pre-disclosure work, local inference means your code never leaves your machine.&lt;/li&gt;
&lt;li&gt;Cost-predictable workflows: For high-volume, routine tasks (linting, code completion, documentation), local models eliminate per-token costs entirely.&lt;/li&gt;
&lt;li&gt;Offline and low-latency scenarios: Air-gapped environments, travel, unreliable networks, or latency-sensitive workflows where round-trip times to a cloud API are unacceptable.&lt;/li&gt;
&lt;li&gt;Developer autonomy: The ability to fine-tune, customize, and control the model stack without vendor dependency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Edge AI isn't replacing cloud AI. It's filling gaps that cloud AI structurally cannot — and giving developers more options in how they architect their workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Predictions: how edge AI reshapes the coding agent stack
&lt;/h2&gt;

&lt;p&gt;Here's where the evidence points. These aren't about cloud AI disappearing — they're about a new layer emerging alongside it, with its own strengths and use cases.&lt;/p&gt;

&lt;h4&gt;
  
  
  Hybrid agent architectures become the default
&lt;/h4&gt;

&lt;p&gt;Coding agents will increasingly run a lightweight local model for orchestration, tool-calling, and context management, while dispatching complex reasoning tasks to cloud models when the task demands it. LiquidAI's 350M model demonstrates that tool-calling reliability doesn't require billions of parameters. Gemma 4's E4B shows that meaningful code understanding fits in a phone-sized footprint. The architecture will be hybrid by design — local for speed, cost, and privacy; cloud for frontier reasoning when needed.&lt;/p&gt;

&lt;h4&gt;
  
  
  The always-on local coding daemon emerges
&lt;/h4&gt;

&lt;p&gt;Today, you start a coding agent session that connects to a remote API. Within two years, your IDE will also ship with a background process — a daemon — running a compressed model locally, always warm, always available. Ollama's caching improvements (cross-conversation reuse, intelligent checkpoints) are the early infrastructure for exactly this pattern. This local daemon handles the fast, repetitive work — completions, refactors, linting suggestions — while cloud agents remain available for deep reasoning and complex multi-file tasks.&lt;/p&gt;

&lt;h4&gt;
  
  
  Edge creates a new cost tier for AI-assisted development
&lt;/h4&gt;

&lt;p&gt;As compression techniques like TurboQuant and Bonsai-style 1-bit quantization make local inference effectively free, a new pricing tier emerges: tasks that can run locally cost nothing at the margin. This doesn't eliminate cloud AI's value proposition — frontier reasoning, large-context synthesis, and model-as-a-service convenience remain worth paying for. But for the high-volume, routine tokens that make up the majority of a developer's daily AI usage, local inference is a compelling alternative. The strategic differentiation shifts toward which tasks each deployment model handles best.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data sovereignty becomes a first-class developer concern
&lt;/h4&gt;

&lt;p&gt;When a coding agent can run locally, proprietary code never has to leave your machine. For enterprises in regulated industries — finance, healthcare, defense — this unlocks AI-assisted development in contexts where cloud APIs were never an option. For individual developers, it means control over what data is shared and with whom. "Runs locally" will become a first-class feature in coding tool evaluations, not just a nice-to-have.&lt;/p&gt;

&lt;h4&gt;
  
  
  Edge-optimized coding models become a distinct category
&lt;/h4&gt;

&lt;p&gt;Reflection AI is raising $2.5 billion specifically to build deployable models for automated software development. Gemma 4 already ships with native function-calling and code generation in edge-friendly form factors. The trajectory is clear: by mid-2027, models fine-tuned specifically for coding, compressed to run on consumer hardware, and wrapped in polished local agent harnesses will be a recognized product category — not replacements for cloud coding agents, but purpose-built alternatives optimized for the scenarios where local deployment wins.&lt;/p&gt;

&lt;h4&gt;
  
  
  Some developers go fully edge — and never look back
&lt;/h4&gt;

&lt;p&gt;The preceding predictions frame edge AI as a complement to the cloud. But for a meaningful segment of developers, edge won't just be one layer — it will be the entire stack. The evidence already supports it: Bonsai 8B delivers competitive code understanding in 1.15 GB. Gemma 4's E4B provides native function-calling, 128K context, and structured output — everything a coding agent needs to operate autonomously. LiquidAI's 350M model handles tool orchestration on a Raspberry Pi. Ollama's runtime is purpose-built for agentic coding patterns. Stack these together and every component of a self-contained coding agent — orchestration, code comprehension, tool use, and runtime — runs on consumer hardware today.&lt;/p&gt;

&lt;p&gt;Two populations will drive this shift. First, cost-constrained developers — indie builders, students, and developers in emerging markets where per-token API costs are a real barrier. When local inference is free at the margin, the calculus isn't "is edge good enough?" — it's "why am I paying for something I can run myself?" Second, developers in regulated and air-gapped environments — defense contractors, healthcare organizations bound by HIPAA, government agencies with no external network access. For them, "cloud is not an option" isn't a preference; it's a hard constraint. Full-edge AI doesn't just complement their workflow — it's the only way AI enters their workflow at all.&lt;/p&gt;

&lt;p&gt;The natural objection is that models alone aren't enough — that cloud coding agents like Claude Code and Codex derive their real advantage from the harness, not just the model. The harness is the orchestration layer: the tool-calling logic, the context management, the workflow patterns that turn a raw model into a useful coding partner. And today, that's a real advantage. But it's an eroding one. The developer community is rapidly learning how to build its own harnesses — open-source agent frameworks, custom tool integrations, workflow automation that codifies exactly how a specific engineer or team works. Every month, more developers ship their own agentic workflows tailored to their stack, their codebase, their preferences. Simultaneously, the models themselves are getting better at leveraging these harnesses. A more capable local model doesn't just generate better code — it follows tool-calling conventions more reliably, handles multi-step workflows with less hand-holding, and recovers from errors more gracefully. The harness becomes easier to build and more effective to run as model quality improves. The result: the gap between a polished cloud agent's harness and what a motivated developer can assemble locally is closing from both directions.&lt;/p&gt;

&lt;p&gt;This won't be the default path for most developers. Cloud agents will retain clear advantages in frontier reasoning, massive-context synthesis, and zero-setup convenience. But the floor of "good enough for a full day's work" is rising fast. By late 2027, a developer with a modern laptop and no internet connection will be able to run a coding agent that handles completions, refactors, test generation, documentation, and tool-calling — all locally, all free. For the populations where cost or compliance makes cloud untenable, that's not a consolation prize. It's a better fit.&lt;/p&gt;

&lt;p&gt;Convergence picture: Compression (TurboQuant, Bonsai 1-bit) + Edge-Ready Models (Gemma 4, Reflection AI) + RL for Small Models (LiquidAI LFM 2.5) + Local Runtimes (Ollama + MLX) -&amp;gt; Edge AI layer in coding agents -&amp;gt; hybrid architecture (edge for speed &amp;amp; privacy, cloud for frontier reasoning), always-on local daemon alongside cloud agents, data sovereignty as a first-class feature, full-edge developers in cost-constrained and regulated environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  A new layer in the stack
&lt;/h2&gt;

&lt;p&gt;Every layer of the stack is converging on the same conclusion: edge AI is ready to be a real part of the development workflow. Compression researchers are proving you can shrink models 14x without meaningful quality loss. Google is releasing edge-first models under Apache 2.0. A startup valued at $25 billion is building locally-deployable coding AI. Apple's own ML framework is being wired directly into local agent runtimes. A 350-million-parameter model can reliably call tools. And developers are increasingly asking for options beyond cloud-only.&lt;/p&gt;

&lt;p&gt;These aren't independent trends. They're the emergence of a new market category: edge AI for software development.&lt;/p&gt;

&lt;p&gt;For software engineers, the implication is practical. The coding agent stack you use a year from now will likely include both cloud and local models, each handling the tasks they're best suited for. Cloud APIs aren't going anywhere — frontier reasoning, massive-context synthesis, and the convenience of hosted inference remain valuable. But alongside them, a local layer will handle the fast, private, cost-free work that makes up the bulk of daily AI-assisted development.&lt;/p&gt;

&lt;p&gt;The developers who thrive will be the ones who understand both layers — when to reach for cloud reasoning power and when local inference is the smarter choice. Edge AI isn't the end of the cloud era. It's the beginning of a more nuanced one.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] "Google says new TurboQuant compression can lower AI memory usage without sacrificing quality." Ars Technica, March 2026. &lt;a href="https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality" rel="noopener noreferrer"&gt;https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] "Caltech Researchers Claim Radical Compression of High Fidelity AI Models." Wall Street Journal, 2026. &lt;a href="https://www.wsj.com/cio-journal/caltech-researchers-claim-radical-compression-of-high-fidelity-ai-models-e66f31c9" rel="noopener noreferrer"&gt;https://www.wsj.com/cio-journal/caltech-researchers-claim-radical-compression-of-high-fidelity-ai-models-e66f31c9&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] "Bonsai 8B: 1-bit models for mobile." PrismML, 2026. &lt;a href="https://prismml.com/news/bonsai-8b" rel="noopener noreferrer"&gt;https://prismml.com/news/bonsai-8b&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] Dave Friedman. "Closed Source vs Open Source AI: A Shrinking Moat." Substack, 2026. &lt;a href="https://davefriedman.substack.com/p/closed-source-vs-open-source-ai-a" rel="noopener noreferrer"&gt;https://davefriedman.substack.com/p/closed-source-vs-open-source-ai-a&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] "Gemma 4: Our most capable open models to date." Google Blog, April 2, 2026. &lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" rel="noopener noreferrer"&gt;https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] "Nvidia-backed Reflection AI eyes $25 billion valuation." Reuters, March 2026. &lt;a href="https://www.reuters.com/business/nvidia-backed-reflection-ai-eyes-25-billion-valuation-wsj-reports-2026-03-26/" rel="noopener noreferrer"&gt;https://www.reuters.com/business/nvidia-backed-reflection-ai-eyes-25-billion-valuation-wsj-reports-2026-03-26/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[7] "LFM2.5-350M: No Size Left Behind." Liquid AI Blog, 2026. &lt;a href="https://www.liquid.ai/blog/lfm2-5-350m-no-size-left-behind" rel="noopener noreferrer"&gt;https://www.liquid.ai/blog/lfm2-5-350m-no-size-left-behind&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[8] "Ollama is now powered by MLX on Apple Silicon." Ollama Blog, March 30, 2026. &lt;a href="https://ollama.com/blog/mlx" rel="noopener noreferrer"&gt;https://ollama.com/blog/mlx&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[9] "2026 AI Index Report." Stanford HAI, 2026. &lt;a href="https://hai.stanford.edu/ai-index/2026-ai-index-report" rel="noopener noreferrer"&gt;https://hai.stanford.edu/ai-index/2026-ai-index-report&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Claude Opus 4.7: Anthropic's Agentic Reliability Release, Explained</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 01:52:27 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/claude-opus-47-anthropics-agentic-reliability-release-explained-1ckd</link>
      <guid>https://dev.to/mixture-of-experts/claude-opus-47-anthropics-agentic-reliability-release-explained-1ckd</guid>
      <description>&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Opus 4.7 posts the strongest coding numbers of any generally-available frontier model: 87.6% on SWE-Bench Verified (up from 80.8% on Opus 4.6) and 64.3% on SWE-Bench Pro (up from 53.4%). On CursorBench it hits 70% versus Opus 4.6's 58%. The benchmark jump is real, but it's not the most interesting change.&lt;/li&gt;
&lt;li&gt;The release is about agent reliability, not just capability. Anthropic's own framing emphasizes that Opus 4.7 achieves the highest quality-per-tool-call ratio they've measured, with markedly lower rates of looping and better recovery from mid-run tool failures. For engineers running long autonomous jobs, that matters more than a benchmark delta.&lt;/li&gt;
&lt;li&gt;Two new surfaces to learn: xhigh effort level and Task Budgets (public beta). xhigh sits between high and max and is the new default in Claude Code. Task Budgets let you cap token spend across a multi-step run so the model prioritizes work instead of burning compute on the first sub-task.&lt;/li&gt;
&lt;li&gt;/ultrareview is a dedicated code-review session — a separate run that re-reads the diff with a reviewer's mindset and flags bugs and design issues. Pro and Max users get three free ultrareviews to try it.&lt;/li&gt;
&lt;li&gt;Drop-in migration: same API shape, same $5 / $25 per million tokens as Opus 4.6. The model ID is claude-opus-4-7, available on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Prompts from 4.6 generally work, though the stricter instruction-following may require some retuning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anthropic released Claude Opus 4.7 today. On paper it's an incremental point release in the Claude 4.x line, priced identically to Opus 4.6 and exposed through the same API surface. But reading through the release notes, the third-party benchmark coverage, and the partner reports, a different story emerges: this isn't a benchmark release with a reliability footnote. It's a reliability release with a benchmark footnote.&lt;/p&gt;

&lt;p&gt;For software engineers shipping production AI features — especially anyone running coding agents, code review pipelines, or multi-step autonomous workflows — the changes in Opus 4.7 map directly onto the failure modes that actually waste engineering time. Looping agents. Silent error recovery that wasn't. Ballooning token spend on a six-hour run. This post walks through what's new, what the numbers actually say, what early partners are reporting, and where Opus 4.7 should and shouldn't land in your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark picture
&lt;/h2&gt;

&lt;p&gt;Opus 4.7 leads the publicly-available frontier field on most coding benchmarks, but the delta is uneven across workloads. Here's the cleanest view of the numbers Anthropic and third parties have reported so far:&lt;/p&gt;

&lt;p&gt;Benchmarks (Opus 4.7 -&amp;gt; Opus 4.6 -&amp;gt; Notable peer):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncg4mgo0xaudkzk4mf9y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncg4mgo0xaudkzk4mf9y.png" alt=" " width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two numbers deserve particular attention. On SWE-Bench Pro — the harder, larger, multi-repo variant that tracks real production-style issues — Opus 4.7 moves from 53.4% to 64.3%, an ~11-point jump. The visual acuity benchmark moves from 54.5% to 98.5%, which is the quantitative shadow of Anthropic's other vision claim: Opus 4.7 accepts images up to 2,576 pixels on the long edge, roughly 3x the resolution Opus 4.6 could ingest. Engineers generating UI mockups, reading dense dashboards, or inspecting failing screenshots should feel this immediately.&lt;/p&gt;

&lt;p&gt;One weakness worth flagging: Opus 4.7 trails GPT-5.4 meaningfully on BrowseComp (79.3% vs 89.3%). If your agent's bottleneck is navigating the open web — research agents, browser-based RPA, deep-research workflows — Claude is not the clear winner here.&lt;/p&gt;

&lt;p&gt;Anthropic also ran third-party evaluations with partners, and those are the numbers most aligned with real production work. On Rakuten-SWE-Bench (an internal benchmark constructed from actual Rakuten production tasks), Opus 4.7 resolves 3x more tasks than Opus 4.6, with double-digit improvements in code quality and test quality scores. Databricks reports 21% fewer errors on OfficeQA Pro, their document-reasoning benchmark, when the model is working from source documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually changed in how the model works
&lt;/h2&gt;

&lt;p&gt;The benchmark gains matter, but the new control surfaces and behavioral changes are where Opus 4.7 will show up in daily engineering work.&lt;/p&gt;

&lt;h3&gt;
  
  
  xhigh: a new reasoning effort level
&lt;/h3&gt;

&lt;p&gt;Claude's effort parameter already exposed minimal, low, medium, high, and max. Opus 4.7 inserts a new xhigh level between high and max. The practical point is that max is expensive and often latency-prohibitive for interactive work, while high sometimes under-reasons on hard tasks. xhigh gives you a middle rung. Anthropic has raised the Claude Code default to xhigh across all plans, which means existing Claude Code users will feel slightly slower, slightly smarter behavior by default starting today.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adaptive extended thinking
&lt;/h3&gt;

&lt;p&gt;In Opus 4.6, extended thinking was effectively all-or-nothing — enabling it meant the model invested reasoning effort even on trivial queries, paying for itself sometimes and burning tokens for nothing other times. Opus 4.7 makes extended thinking context-aware. With it enabled, the model decides per-query how much depth a problem warrants: simple questions return quickly, complex ones get proportionally more reasoning. The practical effect is that you can leave extended thinking on without paying a flat latency tax on every request — meaningful for production deployments where request difficulty varies widely across the workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task Budgets (public beta)
&lt;/h3&gt;

&lt;p&gt;Task Budgets let you hand the model a token budget for a multi-step task so it can prioritize work across sub-tasks rather than burning through its budget on step one. This is a meaningful primitive for anyone running long agent jobs in production — the classic failure mode where an agent exhausts context on exploration and then has nothing left for execution now has a native knob.&lt;/p&gt;

&lt;h3&gt;
  
  
  /ultrareview
&lt;/h3&gt;

&lt;p&gt;The /ultrareview slash command kicks off a dedicated review session that re-reads a diff and surfaces bugs and design issues that a careful human reviewer would catch. Unlike asking Claude to review its own work inline, this runs as a separate session with a reviewer's prompt posture. Pro and Max users get three free ultrareviews to try; beyond that it's metered as normal usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic reliability: the less-flashy changes
&lt;/h3&gt;

&lt;p&gt;The behavioral deltas are the ones that don't fit cleanly on a benchmark chart. Anthropic reports that Opus 4.7 loops on roughly 1 in 18 queries less often than prior Opus versions, keeps executing through tool failures that used to halt Opus 4.6, and devises its own verification steps before reporting a task complete. The concrete example Anthropic published — having the model build a Rust text-to-speech engine from scratch (neural model, SIMD kernels, browser demo) and then feed its own output through a speech recognizer to check that it matched the Python reference — is the clearest expression of what "verifies its own outputs" means in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  More conservative tool use
&lt;/h3&gt;

&lt;p&gt;Opus 4.7 is noticeably more reluctant to call tools autonomously than Opus 4.6. It defaults to answering from its training knowledge unless you point it at a source. This is the right default for production agents — fewer surprise tool calls, lower variance in cost and latency — but it changes how you should prompt. If you want the model to search the web, query a connector, or read from a specific knowledge source, name the source explicitly in the prompt ("search the web for X," "check the Slack channel," "read the file at this path"). Inspect the thinking trace afterward to verify which sources actually got used.&lt;/p&gt;

&lt;p&gt;Workflow flow: Task arrives -&amp;gt; xhigh reasoning (new default in Claude Code) -&amp;gt; Long-running multi-step work -&amp;gt; if token spend gets large, Task Budgets redistribute effort, otherwise continue -&amp;gt; if a tool call fails, graceful recovery keeps going, otherwise continue -&amp;gt; Self-verify output before reporting done -&amp;gt; /ultrareview (separate review session) -&amp;gt; Ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  What early partners are reporting
&lt;/h2&gt;

&lt;p&gt;Early-access partners' reports give a more grounded picture than benchmarks alone. The useful signal across their summaries is remarkably consistent: the improvements they highlight are about reliability under autonomy, not raw capability ceilings.&lt;/p&gt;

&lt;p&gt;Rakuten's engineering leadership has emphasized that the uplift on their internal benchmark translated into real movement in the quality metrics their teams care about — not just pass/fail on tasks, but code quality and test quality rising together. Databricks' framing of the OfficeQA Pro gain is practical: their users work against source documents, and a 21% drop in errors means fewer hallucinated citations and fewer manual re-runs.&lt;/p&gt;

&lt;p&gt;Three other partner reports from the enterprise early-access group paint a consistent picture. A financial technology platform observed the model catching its own logical errors during the planning phase rather than during execution — a behavioral shift that matters because plan-time errors are orders of magnitude cheaper than execution-time errors. A code review platform saw a greater-than-10% improvement in bug detection recall while holding precision steady, which is a harder combination to get than either metric alone. An autonomous workflow company reported ~14% gains in task success alongside a third the tool errors at fewer tokens — a rare case where quality and efficiency moved in the same direction.&lt;/p&gt;

&lt;p&gt;The common thread: the behaviors that get better are the ones that make the difference between "impressive demo" and "safe to leave running overnight."&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually use this as a software engineer
&lt;/h2&gt;

&lt;p&gt;If you're building with Claude today, here's the practical playbook.&lt;/p&gt;

&lt;p&gt;Migrate opportunistically, not urgently. The API is drop-in compatible, pricing is unchanged, and the model ID is claude-opus-4-7. Run a shadow evaluation on your existing agent traces before flipping production traffic — not because migration is risky, but because the stricter instruction-following can expose prompts that were implicitly relying on Opus 4.6's looser interpretation. Concretely: Opus 4.7 takes directives more literally, so repeated emphasis ("be brief, really brief, don't ramble") and defensive padding ("skip the obvious parts") now execute more precisely than you may have intended. Prefer a single clear instruction over layered emphasis, and audit project- or system-level prompts that grew through accretion.&lt;/p&gt;

&lt;p&gt;Default to xhigh, not max. For interactive coding work, xhigh is the sweet spot that the Claude Code team has already chosen as their new default. Save max for tasks you know need it and can afford to wait on.&lt;/p&gt;

&lt;p&gt;Reach for Task Budgets on anything multi-step. If you're orchestrating agents that run for more than a few minutes — research, refactors, migration scripts, data pipeline debugging — Task Budgets are the right primitive to prevent the classic "spent 80% of tokens exploring, 20% executing" failure. Start conservative; the knob rewards iteration.&lt;/p&gt;

&lt;p&gt;Put /ultrareview in your PR flow, but not as a rubber stamp. The most useful place for /ultrareview is between "Claude implemented it" and "human merges it" — a separate review session that catches the class of bugs a tired reviewer misses. It is not a replacement for a human reviewer on anything with security, compliance, or customer-data implications.&lt;/p&gt;

&lt;p&gt;Don't reach for Opus 4.7 for open-web research agents. The BrowseComp gap to GPT-5.4 is real and meaningful. If your agent's job is navigating the open web, run an A/B on both models before committing.&lt;/p&gt;

&lt;p&gt;Be explicit about which sources you want the model to consult. Because Opus 4.7 leans toward answering from its own knowledge before reaching for tools, prompts that worked on 4.6 by implicitly assuming "Claude will obviously search the web for this" can return stale or training-cutoff answers on 4.7. Name the source in the prompt: search the web for…, query the connector…, read this file at…. This is also a quiet quality-of-life win — your traces become easier to audit when tool selection is in the prompt instead of the model's discretion.&lt;/p&gt;

&lt;p&gt;Watch the vision path — and skip the pre-processing. If your stack uses Claude to look at mockups, screenshots, PDFs of dashboards, or generated UIs, the 3x resolution jump and the visual acuity benchmark jump (54.5% -&amp;gt; 98.5%) are the changes most likely to show up as noticeably better outputs without any prompt changes. The corollary: pipelines that pre-cropped, downsampled, or upscaled images to work around 4.6's resolution limits should be retired. Send the original — Opus 4.7 reads small axis labels, dense table cells, and footnotes natively.&lt;/p&gt;

&lt;p&gt;Read the system card. Anthropic published the Opus 4.7 system card alongside the release. Notable: low rates of deception and sycophancy, and stronger resistance to prompt injection than Opus 4.6, but modestly weaker on overly-detailed harm-reduction advice on controlled substances. If your deployment has safety-sensitive surfaces, read it before you ship&lt;/p&gt;

&lt;h2&gt;
  
  
  What Opus 4.7 signals about the direction
&lt;/h2&gt;

&lt;p&gt;A useful frame for thinking about this release: Anthropic is optimizing harder for autonomy reliability than for peak capability. The agentic-search gap to GPT-5.4 is notable because it's the one place where Anthropic clearly chose not to catch up in this release. The numbers they did move — quality-per-tool-call, loop resistance, mid-run error recovery, self-verification — are the ones that determine whether an agent is shippable, not just demonstrable.&lt;/p&gt;

&lt;p&gt;For software engineers, that's a meaningful product posture. The next year of AI engineering work is going to be dominated by "can I actually trust this thing to run without me watching?" The features in Opus 4.7 — xhigh as a cheaper path to deep reasoning, Task Budgets as a primitive for long runs, /ultrareview as a separate-session review gate, and the underlying reliability behaviors — are all calibrated to that question. Worth adopting, worth instrumenting, worth testing before you trust it on anything that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] Anthropic, "Introducing Claude Opus 4.7." April 16, 2026. &lt;a href="https://www.anthropic.com/news/claude-opus-4-7" rel="noopener noreferrer"&gt;https://www.anthropic.com/news/claude-opus-4-7&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] OfficeChai, "Anthropic Releases Claude Opus 4.7, Beats GPT-5.4, Gemini 3.1 Pro On Most Benchmarks." April 16, 2026. &lt;a href="https://officechai.com/ai/ckaude-opus-4-7-benchmarks/" rel="noopener noreferrer"&gt;https://officechai.com/ai/ckaude-opus-4-7-benchmarks/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] Anthropic, "Claude Opus 4.7 — product page." April 16, 2026. &lt;a href="https://www.anthropic.com/claude/opus" rel="noopener noreferrer"&gt;https://www.anthropic.com/claude/opus&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] Anthropic, "Working with Claude Opus 4.7." April 16, 2026. &lt;a href="https://claude.com/resources/tutorials/working-with-claude-opus-4-7" rel="noopener noreferrer"&gt;https://claude.com/resources/tutorials/working-with-claude-opus-4-7&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>claude</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Open Claude Design: A Weekend Harness Built on Atomic</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 01:44:02 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/open-claude-design-a-weekend-harness-built-on-atomic-2k22</link>
      <guid>https://dev.to/mixture-of-experts/open-claude-design-a-weekend-harness-built-on-atomic-2k22</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/FtnbwW95pgE"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Anthropic released Claude Design (&lt;a href="https://www.anthropic.com/news/claude-design-anthropic-labs" rel="noopener noreferrer"&gt;https://www.anthropic.com/news/claude-design-anthropic-labs&lt;/a&gt;) on April 17, 2026 — a conversational tool for producing prototypes, slides, and marketing collateral, with a design-system import step, a refinement loop, and a Claude Code handoff bundle at the end.&lt;/p&gt;

&lt;p&gt;Three days later we shipped open-claude-design (&lt;a href="https://github.com/flora131/atomic/tree/main/src/sdk/workflows/builtin/open-claude-design):" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic/tree/main/src/sdk/workflows/builtin/open-claude-design):&lt;/a&gt; an open-source replica implemented as a built-in Atomic workflow. Five deterministic phases, the same pipeline ported across three different coding agents (Claude Agent SDK, Copilot CLI, opencode) — roughly 500 lines of typescript orchestration per provider. The full source lives at src/sdk/workflows/builtin/open-claude-design.&lt;/p&gt;

&lt;p&gt;We didn't rebuild Claude Code to do this. We built a thin harness around it.&lt;/p&gt;

&lt;p&gt;That distinction is the point of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline
&lt;/h2&gt;

&lt;p&gt;Claude Design's UX is a conversation, but underneath it's a pipeline. We reverse-engineered the phases from the announcement and from the partner quotes ("20+ prompts to 2 prompts" is a tell — there's a deterministic skeleton under the chat).&lt;/p&gt;

&lt;p&gt;Pipeline flow:&lt;br&gt;
Phase 1: Design System Onboarding (parallel headless fan-out + HIL approval) -&amp;gt; Phase 2: Import (URL / file / codebase capture, headless) -&amp;gt; Phase 3: Generation (first design version, visible) -&amp;gt; Phase 4: Refinement Loop (≤5 iterations, HIL + parallel critique). The loop either iterates back on itself or, on approved / "ship it", moves to Phase 5: Export + Handoff (Claude Code / Copilot CLI / opencode).&lt;/p&gt;

&lt;p&gt;Headless stages run on Sonnet with bypassPermissions for cost and speed — but only in the Claude provider, where the Agent SDK lets us pin a per-stage model. The Copilot CLI and opencode providers don't expose that knob, so their headless stages inherit whatever orchestrator model the user invoked the workflow with. Visible stages inherit the orchestrator model (Opus) across all three providers and surface to the user. The refinement loop is a bounded human-in-the-loop cycle with early exit on completion signal phrases ("approved", "ship it", "done").&lt;/p&gt;

&lt;p&gt;Inside Phase 4, the refinement quality comes from pairing two tools: the impeccable skill drives the creative pass (taste, hierarchy, distinctive aesthetics over generic AI defaults), while the Playwright CLI captures screenshots of the rendered output so a critique sub-agent can inspect what actually shipped, not what the model thinks shipped. Visual grounding + structured critique closes the loop that a text-only refinement would leave open — the agent sees its own mistakes instead of hallucinating past them.&lt;/p&gt;

&lt;p&gt;The full topology — including the three parallel codebase-analysis sub-agents in Phase 1 and the parallel critique + screenshot validation in Phase 4 — is laid out in the workflow source.&lt;/p&gt;
&lt;h2&gt;
  
  
  The workflow SDK is the whole trick
&lt;/h2&gt;

&lt;p&gt;Here's a trimmed version of the Claude provider for Phase 1 — the parallel fan-out followed by a human-in-the-loop approval stage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Layer 1: three headless agents analyze the codebase in parallel&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ds-locator&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nf"&gt;buildDesignLocatorPrompt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;root&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;codebase-locator&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;HEADLESS_OPTS&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ds-analyzer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nf"&gt;buildDesignAnalyzerPrompt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;root&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;codebase-analyzer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;HEADLESS_OPTS&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ds-patterns&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nf"&gt;buildDesignPatternPrompt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;root&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;codebase-pattern-finder&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;HEADLESS_OPTS&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="c1"&gt;// Layer 2: visible agent reviews the findings with the user&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;design-system-builder&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;buildDesignSystemBuilderPrompt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="nx"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;locatorOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;analyzerOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;patternsOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things to notice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ctx.stage is just a function around a session. The orchestration is plain TypeScript — Promise.all, for loops, early break on signal phrases. No DSL. No YAML. No graph declaration.&lt;/li&gt;
&lt;li&gt;s.session.query calls the coding agent's native harness. We're not reimplementing Claude Code's tool loop, its permission model, or its subagent dispatch — we're calling into them. agent: "codebase-locator" points at an existing Atomic subagent; HEADLESS_OPTS sets bypassPermissions and forces Sonnet.&lt;/li&gt;
&lt;li&gt;The orchestrator picks the minimum toolset for each stage. Headless analyzers get bypassPermissions. Visible stages inherit Opus. The refinement loop gets AskUserQuestion. Each stage sees only what it needs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The headless model is also a knob, not a fixed choice. The HEADLESS_OPTS constant pins the sub-agents to Sonnet by default because the analysis stages are well-scoped and cost-sensitive, but you can swap it to Opus for harder codebases, or drop the model field entirely to inherit whatever the orchestrator is running. One line, repo-wide — pick your point on the cost/performance curve.&lt;/p&gt;

&lt;p&gt;Prompts are the other knob, and usually the more important one. Each stage's instructions are a plain TypeScript function — buildDesignLocatorPrompt, buildDesignAnalyzerPrompt, the refinement critique prompt — so tailoring outputs to your stack means editing a string, not reconfiguring the pipeline. Want the analyzer to look specifically for shadcn tokens, or the generator to prefer Tailwind over inline styles, or the critique to hammer on accessibility over aesthetics? Edit the prompt. Swapping models gets you capacity; adjusting the instructions is what dials in taste, framework conventions, and the specific shape of output you want for your project. The two knobs are complementary — you'll almost always reach for the prompt first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The workflow-creator skill got us 90% of the way there
&lt;/h2&gt;

&lt;p&gt;The non-obvious part was the pipeline shape, not the code. Once we knew what phases we wanted, the workflow-creator skill scaffolds the defineWorkflow().run().compile() structure, the ctx.stage calls, the WorkflowInput schema, and the provider split (Claude vs. Copilot vs. opencode).&lt;/p&gt;

&lt;p&gt;Our actual work was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phase 1 product analysis — watched the Claude Design demo, read the announcement, listed the phases.&lt;/li&gt;
&lt;li&gt;Scaffold via workflow-creator — described the five phases and the topology, got back a working provider skeleton.&lt;/li&gt;
&lt;li&gt;Tweak prompts and behavior — adjusted the stage prompts, model assignments, and early-exit conditions until the pipeline produced what we wanted.&lt;/li&gt;
&lt;li&gt;Test across the three agents — ran the same workflow under Claude, Copilot CLI, and opencode.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The research artifacts — the product analysis, the SDK mapping, the RFC — all live alongside the workflow source on GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same pipeline, three coding agents
&lt;/h2&gt;

&lt;p&gt;Because the SDK's only abstraction over the agent is s.session.query(...), porting to a different coding agent is mechanical. The Copilot CLI provider is the same five phases; it just passes different stage options and deals with Copilot's SessionEvent[] message format on the way out:&lt;/p&gt;

&lt;p&gt;atomic workflow -n open-claude-design -a claude    --prompt "Landing page for a dev tool"&lt;br&gt;
atomic workflow -n open-claude-design -a copilot   --prompt "Landing page for a dev tool"&lt;br&gt;
atomic workflow -n open-claude-design -a opencode  --prompt "Landing page for a dev tool"&lt;/p&gt;

&lt;p&gt;One workflow, three harnesses, identical CLI surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "thin harness" is the right frame
&lt;/h2&gt;

&lt;p&gt;The temptation when you want agent X to do task Y is to build a new agent. It's the wrong instinct. Coding agents are already harnesses — they have a tool loop, a permission model, subagents, skills, MCP. Rebuilding that is how you end up with a 50K-line framework that's worse than what you wrapped.&lt;/p&gt;

&lt;p&gt;A thin harness inverts the relationship:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don't own the agent's inner loop. Claude Code keeps its tool-use cycle. Copilot CLI keeps its session machinery. opencode keeps its own runtime. Your code never reimplements any of them.&lt;/li&gt;
&lt;li&gt;You own the outer pipeline. Which stages run, in what order, under what model, with what permissions, with what early-exit conditions. This is the part that's actually workflow-specific.&lt;/li&gt;
&lt;li&gt;The abstraction is one function. s.session.query(prompt, opts). Everything above it — Promise.all, for, if — is TypeScript you already know.&lt;/li&gt;
&lt;li&gt;You pick the minimum toolset per stage. Headless analyzers don't get write permissions. Visible creative stages inherit Opus. Each stage sees what it needs and nothing more — the cheapest way to keep a long pipeline coherent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What you give up. Claude Design's chat UX streams tokens straight into a rendered preview — it feels fast because the product is purpose-built around that loop. A CLI workflow with discrete phases and HIL gates won't match that feel, and shouldn't try. You're trading perceived latency for a pipeline you can read, fork, and re-point at any coding agent. If you want the streaming feel back, that's what the next paragraph is for — the workflow SDK doesn't care whether the frontend is a CLI, a web app, or a chat surface.&lt;/p&gt;

&lt;p&gt;Claude Design is a product. Open Claude Design is a recipe. The recipe runs on whatever coding agent you already trust, in your own repo, against your own design system, exported to whatever you want. You can read every line.&lt;/p&gt;

&lt;p&gt;And because the pipeline is just TypeScript, you can fork it, add a phase, swap a model, change the early-exit conditions, or bolt a vercel deploy step onto Phase 5. Or go further — build your own harness entirely, wrap it in whatever UX you want (a web app, a desktop shell, a chat surface, a VS Code extension), and let the workflow SDK be the thing underneath. The CLI is one frontend; nothing stops you from writing another. That's the part that matters. Not the workflow — the fact that building the next workflow, or the next harness around it, is a weekend.&lt;/p&gt;

&lt;p&gt;This is what coding at scale looks like from here on out: teams won't just use coding agents, they'll build thin harnesses like open-claude-design to orchestrate them across every dev workflow they run.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] "open-claude-design — workflow source." Atomic, GitHub. &lt;a href="https://github.com/flora131/atomic/tree/main/src/sdk/workflows/builtin/open-claude-design" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic/tree/main/src/sdk/workflows/builtin/open-claude-design&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] Anthropic, "Claude Design — Anthropic Labs." April 17, 2026. &lt;a href="https://www.anthropic.com/news/claude-design-anthropic-labs" rel="noopener noreferrer"&gt;https://www.anthropic.com/news/claude-design-anthropic-labs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] "Atomic — agent workflow toolkit." GitHub. &lt;a href="https://github.com/flora131/atomic" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] "Atomic workflow architecture." alexlavaee.me, 2026. &lt;a href="https://alexlavaee.me/blog/atomic-workflow" rel="noopener noreferrer"&gt;https://alexlavaee.me/blog/atomic-workflow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] "Harness engineering: why coding agents need infrastructure." alexlavaee.me, 2026. &lt;a href="https://alexlavaee.me/blog/harness-engineering-why-coding-agents-need-infrastructure" rel="noopener noreferrer"&gt;https://alexlavaee.me/blog/harness-engineering-why-coding-agents-need-infrastructure&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>design</category>
      <category>programming</category>
    </item>
    <item>
      <title>Atomic's Workflow SDK: Deterministically Extending Coding Agents</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 01:35:43 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/atomics-workflow-sdk-deterministically-extending-coding-agents-29ph</link>
      <guid>https://dev.to/mixture-of-experts/atomics-workflow-sdk-deterministically-extending-coding-agents-29ph</guid>
      <description>&lt;p&gt;Coding agents are great at day-to-day work. What they still can't do reliably — and what keeps you babysitting every step — is finish a long-running, complex task while following your team's specific guardrails. After thousands of hours shipping with coding agents, I've landed on what actually helps me amplify what coding agents already do well into long-running, ambiguous tasks and am open-sourcing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Neither a coding agent nor a general framework closes this gap
&lt;/h2&gt;

&lt;p&gt;Coding agents alone can't do it. By design, they ship as strong harnesses built for day-to-day coding — and they're genuinely good at the hard parts of that job: context management, memory, tool orchestration, and sub-agent dispatch inside a session. What they can't reliably do is follow your specific guardrails on long-running, ambiguous, complex work. For example, there's no built-in way to migrate a 300-file React 17→19 upgrade in the dependency order your senior engineers mapped out, run your team's regression gate between each batch, pause for human review on the files you flagged as high-risk, and keep the branch green end-to-end.&lt;/p&gt;

&lt;p&gt;The second you reach for a general agent framework to get that structure, you're wrapping the coding-agent SDK inside their graph nodes. Thousands of lines of net-new code to rebuild a tool loop, permission model, sub-agent dispatcher, and context manager — all things your coding agent already has, except worse.&lt;/p&gt;

&lt;p&gt;Others skip the framework and build a custom harness around the raw model. Same problem at a different layer: you get structure, not your guardrails — the constraints, review bars, and team-specific requirements that actually determine whether the output is usable.&lt;/p&gt;

&lt;p&gt;None of these paths give you an easy way to put gutters around the coding agent. Workflows are gutters: guardrails that keep the agent on your team's path through long-running or ambiguous work, so you're not watching every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Inside a session, a coding agent can add a feature, fix a bug, refactor a module. It's fine.&lt;/p&gt;

&lt;p&gt;Outside a session is where the failures live:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call triage: an alert fires; the agent loses the trace context the moment the session resets.&lt;/li&gt;
&lt;li&gt;Complex refactors in large codebases: constraints drift by the third or fourth session.&lt;/li&gt;
&lt;li&gt;Team review standards: every engineer prompts the agent slightly differently, so every engineer gets slightly different results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of your time goes into babysitting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Atomic does
&lt;/h2&gt;

&lt;p&gt;Atomic (&lt;a href="https://github.com/flora131/atomic" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic&lt;/a&gt;) is a TypeScript SDK that enhances a coding agent by wrapping configurable, deterministic structure around it. The agent's harness — tool-use, context management, sub-agents, permission model — stays intact and keeps doing what it's good at. Atomic adds the outer pipeline that encodes your specific guardrails, so the agent's execution actually follows them on long-running, ambiguous work.&lt;/p&gt;

&lt;p&gt;A workflow is plain TypeScript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineWorkflow&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@bastani/atomic/workflows&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineWorkflow&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;review-and-fix&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;review&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Review the diff against our UX standards.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fix&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Address the findings in &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every ctx.stage is a real coding-agent session in its own tmux pane — Claude Code, Copilot CLI, or opencode, interchangeable with a single flag. Data flows between stages only through explicit transcript reads. Topology — parallel fan-out, serial dependencies — comes from await and Promise.all, not a graph DSL. .compile() freezes the graph, so the only variance between runs is the LLM's output.&lt;/p&gt;

&lt;p&gt;That's the entire mental model. You're aligning the coding agent's execution with your team's explicit goals, so long-running and ambiguous work finally becomes tractable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;The non-obvious part of any of this is the shape of the pipeline — which stages, which run parallel, where the human gate goes. Ask Atomic's workflow-creator skill to encode your workflow in natural language; it hands you a working skeleton in minutes. That's our actual dev loop — not hand-writing topology.&lt;/p&gt;

&lt;p&gt;Example flow: PR opened -&amp;gt; Fan-out into parallel coding-agent sessions (accessibility, spacing, copy, reuse) -&amp;gt; Aggregate findings -&amp;gt; HIL approval gate -&amp;gt; approve unblocks merge, or changes requested kicks off a Ralph loop (plan -&amp;gt; implement -&amp;gt; review -&amp;gt; debug) that feeds back into Aggregate findings.&lt;/p&gt;

&lt;p&gt;Workflows teams have already built:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;UX review gate on every PR. On pull_request.opened, dispatch a fleet of coding-agent sessions specialized on your design system, each reviewing the diff along a different axis (accessibility, spacing, copy, reuse). Merge is blocked until a human approves.&lt;/li&gt;
&lt;li&gt;50-persona feedback gate pre-PR. Before a feature PR opens, dispatch 50 headless sessions in parallel — each primed with a distinct persona (the skeptical CFO, the power-user admin, the first-time mobile user, the accessibility-dependent reviewer). Feedback rolls into one report with tasks. A human picks what to implement; Atomic's built-in Ralph loop (planner → orchestrator → worker → reviewer → debugger) executes and raises back to the human.&lt;/li&gt;
&lt;li&gt;Support ticket → root cause → draft PR. A webhook drops tickets into the workflow. Agents research the codebase, write the root cause back onto the ticket, and attempt a fix in a sandboxed branch. A human gate reviews the diff and evidence; the PR only opens on approval.&lt;/li&gt;
&lt;li&gt;Production regression triage. A workflow listens to observability alerts, pulls the failing trace, deep-researches the codebase, and dispatches a session to localize the regression against recent commits. High-confidence fix? Draft PR with a repro. Low-confidence? On-call gets a ranked shortlist of suspects instead of a raw stack trace.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every one of these lives in your repo as a TypeScript file. You run it, diff it, fork it, code-review it. Sharing across the team is merging the file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandboxed by default
&lt;/h2&gt;

&lt;p&gt;Workflows run with the coding agent's permission checks disabled — which is how you get one-shot execution without constant approval prompts, and why you should never run them on your host. Atomic ships three devcontainer features on GHCR (Claude, Copilot, opencode) with Bun, the CLI, playwright-cli, and config templates pre-baked. "Try this workflow" is code . plus rebuild-and-reopen-in-container, not an hour of setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Systems, not prompts
&lt;/h2&gt;

&lt;p&gt;Move from prompting to systems thinking. Define the pipeline once. Run it the same way every time. Stop babysitting.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] "Atomic — open-source TypeScript SDK for coding-agent workflows." GitHub. &lt;a href="https://github.com/flora131/atomic" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] "Atomic example workflows." GitHub. &lt;a href="https://github.com/flora131/atomic/tree/main/.atomic/workflows" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic/tree/main/.atomic/workflows&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] "Atomic SDK source." GitHub. &lt;a href="https://github.com/flora131/atomic/tree/main/src/sdk" rel="noopener noreferrer"&gt;https://github.com/flora131/atomic/tree/main/src/sdk&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] "Open Claude Design: a weekend harness built on Atomic." alexlavaee.me, 2026. &lt;a href="https://alexlavaee.me/blog/open-claude-design-atomic-harness" rel="noopener noreferrer"&gt;https://alexlavaee.me/blog/open-claude-design-atomic-harness&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] "Harness engineering: why coding agents need infrastructure." alexlavaee.me, 2026. &lt;a href="https://alexlavaee.me/blog/harness-engineering-why-coding-agents-need-infrastructure" rel="noopener noreferrer"&gt;https://alexlavaee.me/blog/harness-engineering-why-coding-agents-need-infrastructure&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] "Atomic: automated procedures and memory for AI coding agents." alexlavaee.me, 2025. &lt;a href="https://alexlavaee.me/blog/atomic-workflow" rel="noopener noreferrer"&gt;https://alexlavaee.me/blog/atomic-workflow&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>automation</category>
      <category>productivity</category>
    </item>
    <item>
      <title>GPT-5.5: The Honest Take on OpenAI's Response to Opus 4.7</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 01:30:12 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/gpt-55-the-honest-take-on-openais-response-to-opus-47-3m58</link>
      <guid>https://dev.to/mixture-of-experts/gpt-55-the-honest-take-on-openais-response-to-opus-47-3m58</guid>
      <description>&lt;p&gt;OpenAI released GPT-5.5 today, exactly one week after Anthropic shipped Claude Opus 4.7. The timing is not subtle. Opus 4.7 took the SWE-Bench Verified crown at 87.6% and put Anthropic at the top of most third-party coding leaderboards; GPT-5.5 is the direct response. Worth flagging upfront: SWE-Bench Verified scores at this tier should be read with heavy skepticism. Every frontier lab has plausibly trained on or adjacent to this data, and Anthropic itself has acknowledged memorization signals on related SWE-Bench splits. Treat any Verified or Pro number in this post as a directional signal, not a trustworthy measurement — we include them because they are what the labs report, not because we think they carry much weight.&lt;/p&gt;

&lt;p&gt;The release is interesting for software engineers not because it "wins" — the verdict is more mixed than OpenAI's launch post suggests — but because of the specific benchmarks it wins on, the specific ones it doesn't, and the pricing decision that frames everything else. This post walks through what changed, how OpenAI built and served it, what the numbers actually say relative to Opus 4.7 and Gemini 3.1 Pro, and what the first day of real usage is surfacing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark picture
&lt;/h2&gt;

&lt;p&gt;The cleanest summary: GPT-5.5 is state-of-the-art on a subset of coding and math benchmarks, nominally behind Opus 4.7 on SWE-Bench Pro (a benchmark we'd largely discount given widespread memorization evidence), and behind both Opus 4.7 and Gemini 3.1 Pro on several agent/tool-use workloads.&lt;/p&gt;

&lt;p&gt;Benchmarks (GPT-5.5 -&amp;gt; GPT-5.4 -&amp;gt; Opus 4.7 -&amp;gt; Gemini 3.1 Pro):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbp7jsxnmh7j064yo0k6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbp7jsxnmh7j064yo0k6e.png" alt=" " width="800" height="538"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All figures from OpenAI's release; asterisks and memorization caveats discussed below.&lt;/p&gt;

&lt;p&gt;*We'd argue SWE-Bench Pro (and SWE-Bench Verified) should be heavily discounted at this point. Both benchmarks have known memorization issues: Anthropic's own Opus 4.7 notes flag "evidence of memorization" on the benchmark, and Scale's public leaderboard methodology documents this as a known failure mode. When every frontier lab has plausibly seen the data, the scores tell you more about training set overlap than model capability. Use them as a floor, not a ranking — and weight Terminal-Bench 2.0, OSWorld-Verified, Expert-SWE, and your own task-specific evals far more heavily.&lt;/p&gt;

&lt;p&gt;One real gap deserves attention: Terminal-Bench 2.0. A 13-point lead over Opus 4.7 is the largest single-benchmark gap between today's frontier coding models, and this benchmark is newer and harder to pre-train against than the SWE-Bench family. If your agent workload is long-running terminal sessions — sandboxed CI jobs, reproduction scripts, multi-step shell workflows — GPT-5.5 leads it clearly. On MCP Atlas, Opus 4.7 still edges ahead, which matters more for tool-heavy agent workloads than the SWE-Bench Pro delta does.&lt;/p&gt;

&lt;p&gt;As with the last several releases, the honest framing is that no single model is best at everything. Which model wins depends on which benchmark you pick, which scaffold runs the evaluation, and what your actual workload looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the model was built and served
&lt;/h2&gt;

&lt;p&gt;OpenAI's most concrete technical claim in the release is about serving, not training: GPT-5.5 matches GPT-5.4 per-token latency in production while performing at a higher level of intelligence. For a larger, more capable model, holding latency flat is a non-trivial infrastructure result.&lt;/p&gt;

&lt;p&gt;The stated method has two parts.&lt;/p&gt;

&lt;p&gt;Co-designed with NVIDIA GB200 and GB300 NVL72 systems. OpenAI says GPT-5.5 was trained on and served from Blackwell-class hardware, and that the serving stack was optimized in lockstep with the model. The release post specifically credits Codex and GPT-5.5 itself for helping identify infrastructure optimizations — model-assisted systems work, which is increasingly how frontier labs describe their inference stacks.&lt;/p&gt;

&lt;p&gt;Dynamic load balancing replaced static chunking. Before GPT-5.5, OpenAI split requests on each accelerator into a fixed number of chunks. Codex analyzed weeks of production traffic and wrote custom heuristic algorithms to partition work dynamically based on request shape, reportedly increasing token generation speed by over 20%.&lt;/p&gt;

&lt;p&gt;This doesn't change anything about how you build with the model, but it's worth internalizing: much of the per-token efficiency story is about the serving system, not the weights. It also explains the pricing: GPT-5.5 is a larger, more expensive model to serve than GPT-5.4, and the 2x API rate hike partially reflects that cost even after the dynamic batching wins.&lt;/p&gt;

&lt;p&gt;One other architectural note worth flagging for engineers planning migrations: the 1M context window is now supported both in Codex (standard) and in the forthcoming API endpoint. Long-context performance looks materially better than GPT-5.4. On OpenAI MRCR v2 8-needle from 512K–1M, GPT-5.5 scores 74.0% vs GPT-5.4's 36.6%. On Graphwalks BFS at 1M tokens, F1 goes from 9.4% (GPT-5.4) to 45.4%. This is probably the largest generational jump in GPT-5.5 and gets less attention than the coding numbers.&lt;/p&gt;

&lt;p&gt;That said: Opus 4.7 still beats GPT-5.5 on several mid-range long-context evaluations. And the honest caveat from prior GPT-5 releases still applies — a 1M window where the last 400K tokens are unreliable is functionally smaller than the marketing suggests. Treat the 1M number as a real improvement over GPT-5.4, not as a license to stop managing context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing, and why it's the most-discussed part of the release
&lt;/h2&gt;

&lt;p&gt;The pricing on gpt-5.5 is $5 per 1M input tokens and $30 per 1M output tokens, with batch and flex at half that rate and priority at 2.5x. gpt-5.5-pro is $30 / $180. Compared to GPT-5.4 ($2.50 / $15), the base model is a flat 2x price increase.&lt;/p&gt;

&lt;p&gt;This dominated the first day's discussion. One Hacker News commenter summarized it bluntly: this is roughly 3x the price of GPT-5.1 released six months earlier. OpenAI's counter-argument, repeated several times in the release and by employees in community threads, is that GPT-5.5 uses meaningfully fewer tokens for the same task — so price per completed unit of work can be lower even when price per million tokens is higher.&lt;/p&gt;

&lt;p&gt;Both things are true. The practical implication depends on where you're running it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In Codex and ChatGPT subscriptions, OpenAI says the per-task token reduction is enough that most users get better results with fewer tokens at their existing tier. This matches early reports from subscribers on the HN thread.&lt;/li&gt;
&lt;li&gt;In the API, the math is workload-dependent. If your app sends short prompts and GPT-5.5 produces 30% fewer output tokens than GPT-5.4, you're still paying ~40% more per call. If you're running long-horizon agent loops where GPT-5.5 cuts tokens by half, the net cost can drop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cleanest read is: assume API costs go up unless you measure otherwise. The Decoder's coverage put it plainly — "despite the higher price tag, GPT-5.5 is more efficient and needs fewer tokens for comparable tasks" is marketing language for "you'll pay more per token but possibly less per outcome".&lt;/p&gt;

&lt;h2&gt;
  
  
  What early users are actually reporting
&lt;/h2&gt;

&lt;p&gt;Day-one impressions are a mix of genuine enthusiasm from early-access partners and skepticism from the broader community.&lt;/p&gt;

&lt;p&gt;Positive reports. Several early-access partners highlighted strong long-horizon coding results. Dan Shipper (CEO of Every) credited GPT-5.5 with unusual conceptual clarity, pointing to a refactor it produced that matched the solution one of his senior engineers eventually landed on — and that GPT-5.4 had not been able to find. Pietro Schirano (MagicPath) reported GPT-5.5 merging hundreds of changes from a refactor branch into a main branch that had also moved significantly, in a single ~20-minute pass. Michael Truell at Cursor emphasized persistence, noting the model stays on task materially longer before stopping early.&lt;/p&gt;

&lt;p&gt;These are real signals — long-horizon coding and cross-branch reasoning are exactly where Expert-SWE (OpenAI's internal benchmark) shows a 5-point lift over GPT-5.4.&lt;/p&gt;

&lt;p&gt;Skeptical reports. The top Hacker News thread as of this writing flags the opposite failure mode: one developer reported GPT-5.5 refusing to perform a quick, benign subtask that GLM, Kimi, and MiniMax all completed — and dropping OpenAI as a result. Another recurring complaint in the thread is around model "motivation" — GPT-5.5 and GPT-5.4 both yielding control mid-task or declining work that was explicitly requested. Benchmark fatigue was also visible: commenters pushed back on OpenAI's "strongest and fastest model yet" framing as boilerplate launch language.&lt;/p&gt;

&lt;p&gt;Coding verdict emerging. The pragmatic consensus in the HN thread is to hold off on swapping Claude out for coding work until independent SWE-Bench numbers are published and verified, with several commenters calling out Opus 4.7 as still the strongest option for long-horizon refactors. That's an overstatement given GPT-5.5's Expert-SWE lead, but the underlying point — wait for independent evaluations — is correct. Vals.ai and Scale typically publish third-party numbers within a few weeks of release, and those numbers are what to watch for.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares to the alternatives, task by task
&lt;/h2&gt;

&lt;p&gt;Given the mixed benchmark picture, a single "which model is best" answer doesn't exist. Based on first-day evidence, here's a reasonable task-routing view.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long-running terminal agents -&amp;gt; GPT-5.5 (+13pt lead over Opus 4.7 on Terminal-Bench 2.0)&lt;/li&gt;
&lt;li&gt;Real GitHub issue resolution (multi-file patches) -&amp;gt; Opus 4.7 (tentative; nominally 64.3% vs 58.6% on SWE-Bench Pro, but memorization caveats mean this ordering is weak — run your own eval before committing)&lt;/li&gt;
&lt;li&gt;MCP tool-heavy agents -&amp;gt; Opus 4.7 (79.1% vs 75.3% on MCP Atlas)&lt;/li&gt;
&lt;li&gt;Deep research / web-browsing agents -&amp;gt; Gemini 3.1 Pro or GPT-5.4 Pro (both lead BrowseComp; GPT-5.5 Pro closes the gap to 90.1% but the Pro pricing is steep)&lt;/li&gt;
&lt;li&gt;Hard math / theorem work -&amp;gt; GPT-5.5 Pro (39.6% on FrontierMath Tier 4 — no close peer)&lt;/li&gt;
&lt;li&gt;Long-context (512K–1M tokens) -&amp;gt; GPT-5.5 (MRCR 8-needle 512K–1M at 74.0% vs GPT-5.4's 36.6% and Opus 4.7's 32.2%)&lt;/li&gt;
&lt;li&gt;Cost-sensitive coding -&amp;gt; GPT-5.4 or Opus 4.6/4.7 (GPT-5.5 is 2x GPT-5.4's API price)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams running coding agents in production, the least controversial take is: Opus 4.7 is still the safer default for multi-file refactor work — though we'd emphasize that the SWE-Bench Pro lead driving that conclusion is the weakest part of the evidence, given the benchmark's memorization problems. GPT-5.5 is the new default for terminal-heavy agent loops and for the long-context workloads where GPT-5.4 was unreliable. If you're already routing between models, GPT-5.5 slots in without changing the overall shape of your stack — and if you care about multi-file coding specifically, the honest advice is to run a real eval on your own codebase rather than trusting the leaderboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this release actually signals
&lt;/h2&gt;

&lt;p&gt;Two things stand out beyond the benchmarks.&lt;/p&gt;

&lt;p&gt;The pace has not slowed. Opus 4.7 shipped April 16. GPT-5.5 shipped April 23. Anthropic's restricted Claude Mythos Preview is already being referenced in OpenAI's comparison tables. If you're planning infrastructure, assume another frontier model drop within 4–8 weeks, and design so that swapping the model behind your scaffold is cheap.&lt;/p&gt;

&lt;p&gt;Pricing is now a product decision, not just a cost. OpenAI doubling the API rate of its flagship while routing cheaper alternatives (GPT-5.4, GPT-4.1) through the same interface is a conscious segmentation. It matches Anthropic's Opus/Sonnet split and Google's Pro/Ultra split. For most engineering workloads, the right question is no longer "what's the best model?" but "what's the cheapest model that clears my quality bar for this specific task?"&lt;/p&gt;

&lt;p&gt;GPT-5.5 doesn't answer that question for you. But it changes the shape of the answer: on terminal agents and long-context work, it's probably worth the premium. On most other shapes of coding work, Opus 4.7 or GPT-5.4 still wins on price-per-quality. As always, measure before migrating.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] OpenAI, "Introducing GPT-5.5." April 23, 2026. &lt;a href="https://openai.com/index/introducing-gpt-5-5/" rel="noopener noreferrer"&gt;https://openai.com/index/introducing-gpt-5-5/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] Hacker News, "GPT 5.5 Released in Codex" discussion thread. April 21–23, 2026. &lt;a href="https://news.ycombinator.com/item?id=47858903" rel="noopener noreferrer"&gt;https://news.ycombinator.com/item?id=47858903&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] The Decoder, "OpenAI unveils GPT-5.5, claims a 'new class of intelligence' at double the API price." April 23, 2026. &lt;a href="https://the-decoder.com/openai-unveils-gpt-5-5-claims-a-new-class-of-intelligence-at-double-the-api-price/" rel="noopener noreferrer"&gt;https://the-decoder.com/openai-unveils-gpt-5-5-claims-a-new-class-of-intelligence-at-double-the-api-price/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] Hacker News, "GPT-5.5" discussion thread. April 23, 2026. &lt;a href="https://news.ycombinator.com/item?id=47879092" rel="noopener noreferrer"&gt;https://news.ycombinator.com/item?id=47879092&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] VentureBeat, "OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0." April 23, 2026. &lt;a href="https://venturebeat.com/technology/openais-gpt-5-5-is-here-and-its-no-potato-narrowly-beats-anthropics-claude-mythos-preview-on-terminal-bench-2-0" rel="noopener noreferrer"&gt;https://venturebeat.com/technology/openais-gpt-5-5-is-here-and-its-no-potato-narrowly-beats-anthropics-claude-mythos-preview-on-terminal-bench-2-0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] Scale Labs, "SWE-Bench Pro Leaderboard." Retrieved April 23, 2026. &lt;a href="https://labs.scale.com/leaderboard/swe_bench_pro_public" rel="noopener noreferrer"&gt;https://labs.scale.com/leaderboard/swe_bench_pro_public&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>coding</category>
      <category>programming</category>
    </item>
    <item>
      <title>Software Quality Has Never Been More Vulnerable</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 01:25:06 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/software-quality-has-never-been-more-vulnerable-52ol</link>
      <guid>https://dev.to/mixture-of-experts/software-quality-has-never-been-more-vulnerable-52ol</guid>
      <description>&lt;p&gt;Anthropic published a postmortem recently. The document was specific, technical, self-critical, and honest about what their full pre-release pipeline failed to catch. Three separate issues degraded Claude Code between March 4 and April 20. All three were fixed by v2.1.116 on April 20, and usage limits were reset for every subscriber on April 23.&lt;/p&gt;

&lt;p&gt;But the document is also a mirror. The conditions it describes — continuous change across weights, prompts, scaffolds, and caches; evaluation coverage that trails release velocity; internal dogfooding that drifts from external usage; regressions that hide inside normal output variance for weeks — are not conditions unique to one lab or one product. They are the working conditions of the entire AI-assisted software industry right now.&lt;/p&gt;

&lt;p&gt;We are in the era where AI coding has lifted the ceiling on how fast teams can ship, and we have not yet lifted the ceiling on how fast we can verify what we shipped. Software has never been more vulnerable than it is right now, and the Claude Code postmortem is the clearest public evidence we have of why.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the postmortem actually covers
&lt;/h2&gt;

&lt;p&gt;Three issues, stacked.&lt;/p&gt;

&lt;p&gt;A reasoning-effort default change (March 4 – April 7). Claude Code's default reasoning effort was switched from high to medium because high made the UI feel frozen. It was a reasonable tradeoff on paper — lower latency for tasks that didn't need deep reasoning. In practice, users felt the capability drop immediately. The team reverted, and the current defaults are xhigh for Opus 4.7 and high for other models.&lt;/p&gt;

&lt;p&gt;A caching bug that cleared reasoning every turn (March 26 – April 10). A prompt caching optimization for idle sessions shipped with a broken header flag. The clear_thinking_20251015 flag was meant to fire once. It fired every turn. The downstream effect was forgetfulness, repetition, and odd tool choices — exactly the pattern users reported. The issue was masked in internal usage by two unrelated concurrent experiments. It was eventually surfaced by back-testing Claude Code Review with Opus 4.7 against the offending pull request; Opus 4.6 had missed it. The fix shipped April 10.&lt;/p&gt;

&lt;p&gt;A verbosity reduction in the system prompt (April 16 – April 20). The prompt added length limits on text between tool calls and on final responses. It passed weeks of eval runs. Broader ablations during the investigation revealed a flat 3% intelligence drop on both Opus 4.6 and 4.7 — small in isolation, real in aggregate. Reverted April 20.&lt;/p&gt;

&lt;p&gt;The postmortem is explicit that each of these passed "multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding." It is also explicit that users, via /feedback and public posts, were the mechanism that surfaced the problems at the speed they did.&lt;/p&gt;

&lt;p&gt;All of that is in the document. Read it. It's a good document.&lt;/p&gt;

&lt;h2&gt;
  
  
  The conditions the document describes are everyone's conditions now
&lt;/h2&gt;

&lt;p&gt;Here is the part of the postmortem that deserves more attention than the individual bugs.&lt;/p&gt;

&lt;p&gt;Anthropic's summary of why detection took time: "each change affected different traffic segments on different schedules. Early reports in March were difficult to distinguish from normal variation, and neither internal usage nor standard evals initially reproduced the issues."&lt;/p&gt;

&lt;p&gt;That is not a description of a broken process. It is a description of the operating environment that every AI-assisted software product now lives in.&lt;/p&gt;

&lt;p&gt;Consider what shipped under the Claude Code surface in those six weeks — a reasoning effort default, a caching optimization, a system prompt edit. None of those are "model releases" in the traditional sense. They are small, continuous tuning knobs that are part of the product every AI-native team ships. And any one of them can independently introduce a regression that looks, to a user, like "the model got worse."&lt;/p&gt;

&lt;p&gt;Now generalize outward. Every team building on frontier models is continuously tuning prompts, swapping models, adjusting temperature and reasoning effort, reworking tool definitions, rebuilding RAG indices, editing agent scaffolds. Most of those teams have a small fraction of Anthropic's eval infrastructure. Most of them have no /feedback channel, no dedicated developer relations account, no mechanism for surfacing regressions from users in hours.&lt;/p&gt;

&lt;p&gt;The Claude Code postmortem is the rare case where the conditions of AI-native software development were written out in public. The conditions themselves are universal.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI coding raised the ceiling on velocity. Verification didn't follow.
&lt;/h2&gt;

&lt;p&gt;The second thing to sit with is that AI-assisted development has changed how fast software can be produced, and this reshapes the risk profile of everything shipped with it.&lt;/p&gt;

&lt;p&gt;A small team in 2026 can land work that would have taken ten engineers in 2022. Coding agents — Claude Code, Codex, Copilot, Cursor, the rest — produce real, shippable code at a throughput that makes the old cadence look obsolete. Labs use their own agents to accelerate their own pipelines; OpenAI's latest release notes credit Codex with infrastructure optimizations on GPT-5.5 itself. The recursion is explicit. Frontier labs are writing more of their software with AI. Product teams are writing more of their software with AI. The ceiling on how much code gets shipped per week, everywhere, has moved up sharply.&lt;/p&gt;

&lt;p&gt;What has not kept pace is the ability to verify the behavior of AI-powered systems at the same velocity. Traditional CI was built around the assumption that software is deterministic, that a green test suite means something stable, and that regressions are rare because the artifact is frozen between releases. None of those assumptions hold cleanly for LLM-powered products. The artifact is not frozen — it's a live composition of weights, prompts, tools, and retrieval state. The tests don't catch behavioral regressions that fall inside output variance. The regression is rare per change, but the change rate is extreme.&lt;/p&gt;

&lt;p&gt;The gap between "how fast we can ship" and "how fast we can verify what we shipped" is larger today than at any point in the history of the industry. That gap is where vulnerability lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  This isn't a lab problem, it's a paradigm problem
&lt;/h2&gt;

&lt;p&gt;There is a tempting misreading of the postmortem that frames it as a story about Anthropic specifically — their process, their evals, their engineers. That's the wrong frame, and it misses the more important point.&lt;/p&gt;

&lt;p&gt;Every part of the document describes a structural condition that generalizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuous tuning is now part of the product surface. The verbosity prompt is the clearest example. A single instruction about output length, inside a system prompt, caused a measurable intelligence drop invisible to standard evals. Every AI product team edits system prompts. Every one of them is one prompt change away from a similar effect.&lt;/li&gt;
&lt;li&gt;Output variance masks real regressions. "Difficult to distinguish from normal variation" is not an Anthropic phrase. It is the default state of every LLM-powered product. Noise is loud, and real drift hides inside it.&lt;/li&gt;
&lt;li&gt;Internal usage drifts from external. Staff at any AI lab, and at most AI-native product companies, run builds that are subtly different from what users see — early access to models, experimental flags, different rollout cohorts. "Dogfooding" as a guarantee gets weaker the further internal diverges from external.&lt;/li&gt;
&lt;li&gt;Users are now part of the evaluation loop. Not by choice, and not unique to Claude Code. The fastest regression-detection mechanism for almost every AI product in 2026 is a user noticing something felt off and having a channel to say so.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is an indictment. It's a description. The question worth asking isn't "how did this happen at Anthropic." It's "given these are the conditions everywhere, what should responsible AI-assisted shipping look like?"&lt;/p&gt;

&lt;h2&gt;
  
  
  What this should change about how we ship
&lt;/h2&gt;

&lt;p&gt;A few honest consequences if the framing above is right.&lt;/p&gt;

&lt;p&gt;Treat prompt edits as model edits. The verbosity incident is the proof point. A prompt change is a capability change. If you'd gate a model swap behind a full eval suite, gate a prompt edit the same way. The per-model ablations Anthropic is committing to running on every system prompt change are a good template.&lt;/p&gt;

&lt;p&gt;Budget soak periods into release cadence. Among Anthropic's corrective actions: "soak periods, broader evaluation suites, and gradual rollouts to catch issues earlier." The implicit admission is that the previous cadence didn't leave room for them. Most AI product teams are in the same position, and many have far less room than Anthropic did.&lt;/p&gt;

&lt;p&gt;Close the internal-to-external build gap, however you can. This is hard. Staff getting early access to new models is how labs and product teams move fast. But the further internal builds drift from external, the less your dogfooding tells you. One commitment from the postmortem worth copying: have the people who ship the software actually use the shipped software, in the same configuration users see.&lt;/p&gt;

&lt;p&gt;Build a real user feedback path before you need one. A /feedback command, a dedicated community channel, a developer relations account that actually reads what's posted — the Claude Code postmortem makes clear that these are not nice-to-haves. They are the primary mechanism by which real-world regressions get caught in hours instead of weeks. Most AI products don't have this. The ones that will survive the next cycle of release velocity will.&lt;/p&gt;

&lt;p&gt;For users: keep your own evals. If you're running AI-assisted work that matters, do not rely on any provider's internal quality bar to hold steady through continuous silent changes. Keep a small suite of your own tasks that you re-run periodically. You don't need much — a handful of representative prompts that produce outputs you can compare over time is enough to notice drift early.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The Claude Code postmortem is a good document from a team that did the right thing in publishing it. The story it tells is not about one lab or one product. It's about the working conditions of AI-assisted software development in 2026 — conditions under which everyone is shipping faster than anyone can verify, and real regressions routinely hide inside output variance for weeks before users surface them.&lt;/p&gt;

&lt;p&gt;Software has never been more vulnerable than it is right now. Not because anyone is being careless. Because the ceiling on velocity moved up sharply and the ceiling on verification didn't follow.&lt;/p&gt;

&lt;p&gt;The labs are aware. Anthropic just wrote out in public what the condition looks like. The rest of the industry should read the document as a mirror, not a scoreboard, and act accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] Anthropic Engineering. "Claude Code quality issues: postmortem summary." April 23, 2026. &lt;a href="https://www.anthropic.com/engineering/april-23-postmortem" rel="noopener noreferrer"&gt;https://www.anthropic.com/engineering/april-23-postmortem&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>programming</category>
      <category>testing</category>
    </item>
    <item>
      <title>DeepSeek V4: What's Inside, How It Compares, and Where It Actually Wins</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 01:19:04 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/deepseek-v4-whats-inside-how-it-compares-and-where-it-actually-wins-5ba6</link>
      <guid>https://dev.to/mixture-of-experts/deepseek-v4-whats-inside-how-it-compares-and-where-it-actually-wins-5ba6</guid>
      <description>&lt;p&gt;DeepSeek V4 shipped on April 24, 2026 — four days after Moonshot's Kimi K2.6, one day after OpenAI's GPT-5.5. Two MIT-licensed models, both 1M-context: V4-Pro at 1.6T total / 49B active, and V4-Flash at 284B / 13B active.&lt;/p&gt;

&lt;p&gt;The headline number is the price: $3.48 per million output tokens for V4-Pro vs $25 for Claude Opus 4.7 and $30 for GPT-5.5. (DeepSeek is also running a launch promo at 75% off — $0.87/M output — through May 5, 2026, which widens the gap further during the evaluation window.) That's a 7-9x gap at the standard rate, against a model that's within ~5-7 points of the closed frontier on most coding benchmarks. That gap is large enough to make many teams reconsider their model routing decisions.&lt;/p&gt;

&lt;p&gt;But price isn't the complete picture. V4 performs well on some workloads and poorly on others, and integration is more difficult than the marketing suggests. Here's the assessment, engineer reports, and what's new under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where V4 actually wins (and where it doesn't)
&lt;/h2&gt;

&lt;p&gt;Three frontier-class models shipped in nine days, and no single model dominates. The ranking flips depending on the workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-world software engineering (PRs, refactors, multi-repo bug fixes): Opus 4.7 leads on independent evaluations that require reasoning across many files — Vals AI's Vibe Code Benchmark, the Aider Polyglot suite, and contamination-resistant tests like LiveCodeBench. It's the right pick when changes are multi-file and planning is the difficult part. You can run Opus end-to-end (plan + edits), or split the workflow: Opus writes the plan, GPT-5.5 on low or medium reasoning executes the file edits against it. The split is often the better cost-quality tradeoff.&lt;/li&gt;
&lt;li&gt;Terminal / agentic shell: GPT-5.5 leads at 82.7% on Terminal-Bench 2.0, ~15 points ahead of V4-Pro. These workloads involve many small tool calls and shell-output error recovery, and V4 hasn't been RL-trained on them at the same depth.&lt;/li&gt;
&lt;li&gt;Long-horizon autonomous execution (12+ hour runs): Kimi K2.6 is the open-source choice, with its Claw Groups multi-agent coordination and demonstrated runs across 4,000+ tool calls.&lt;/li&gt;
&lt;li&gt;Whole-repo reasoning (hundreds of files, &amp;gt;200K tokens): V4-Pro's 1M context is the only frontier option that's economical to use at full length — its architecture cuts inference cost to roughly a quarter of V3.2's at 1M context. More on why below. The natural fit is the discovery phase of a task: load the whole repo and use V4-Pro for deep research, search, and understanding how a codebase fits together — the analysis pass that feeds into a plan, which you then hand to Opus or GPT-5.5 to execute.&lt;/li&gt;
&lt;li&gt;Cost-per-task at scale: V4-Flash at $0.28/M output is 90-107x cheaper than the closed frontier. Tencent Hy3-preview at ~$0.55/M is in a similar range. For batch and overnight workloads, neither closed model is competitive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One additional entry worth noting: Tencent Hy3-preview is not competing for the largest open coding model. It's a 21B-active model optimized for cost-per-step in real product traffic, with stable agent runs of up to 495 steps in production powering CodeBuddy and WorkBuddy. If you're building product-embedded agents on tight budgets rather than optimizing for benchmark scores, it's worth evaluating. Tencent is direct about the tradeoffs: the release notes describe "weak error recovery capabilities when calling the tool and sensitivity to inference hyperparameters."&lt;/p&gt;

&lt;p&gt;The benchmarks (DeepSeek V4-Pro -&amp;gt; Claude Opus 4.7 -&amp;gt; GPT-5.5 -&amp;gt; Kimi K2.6 -&amp;gt; Qwen3.6-27B -&amp;gt; Tencent Hy3-preview):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total / Active params: 1.6T / 49B -&amp;gt; undisclosed -&amp;gt; undisclosed -&amp;gt; 1T / 32B -&amp;gt; 27B dense -&amp;gt; 295B / 21B&lt;/li&gt;
&lt;li&gt;Context window: 1M -&amp;gt; 1M -&amp;gt; 1M -&amp;gt; 256K -&amp;gt; 256K (1M YaRN) -&amp;gt; 256K&lt;/li&gt;
&lt;li&gt;SWE-Bench Verified: 80.6% -&amp;gt; 87.6% -&amp;gt; — -&amp;gt; 80.2% -&amp;gt; 77.2% -&amp;gt; 74.4%&lt;/li&gt;
&lt;li&gt;SWE-Bench Pro: 55.4% -&amp;gt; 64.3% -&amp;gt; 58.6% -&amp;gt; 58.6% -&amp;gt; 53.5% -&amp;gt; —&lt;/li&gt;
&lt;li&gt;Terminal-Bench 2.0: 67.9% -&amp;gt; 69.4% -&amp;gt; 82.7% -&amp;gt; 66.7% -&amp;gt; 59.3% -&amp;gt; 54.4%&lt;/li&gt;
&lt;li&gt;LiveCodeBench: 93.5% -&amp;gt; 84.69% -&amp;gt; 85.30% -&amp;gt; 89.6% -&amp;gt; — -&amp;gt; —&lt;/li&gt;
&lt;li&gt;BrowseComp: 83.4% -&amp;gt; 79.3% -&amp;gt; 84.4% -&amp;gt; 83.2% -&amp;gt; — -&amp;gt; 67.1%&lt;/li&gt;
&lt;li&gt;Output price ($/M tokens): $3.48 -&amp;gt; $25 -&amp;gt; $30 -&amp;gt; ~$2.50 -&amp;gt; ~$1.56 -&amp;gt; ~$0.55&lt;/li&gt;
&lt;li&gt;License: MIT (open weights) -&amp;gt; Closed API -&amp;gt; Closed API -&amp;gt; Modified MIT -&amp;gt; Apache 2.0 -&amp;gt; Open weights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sources: DeepSeek, VentureBeat, BenchLM, AkitaOnRails, Latent Space, Tencent, Qwen.&lt;/p&gt;

&lt;h2&gt;
  
  
  What engineers actually report
&lt;/h2&gt;

&lt;p&gt;The more useful signal comes from independent reviewers running real tasks, and the picture from the first 72 hours is mixed in informative ways.&lt;/p&gt;

&lt;p&gt;On the positive side, AkitaOnRails ran his RubyLLM benchmark — the same chat-app-against-a-specific-Ruby-library task he's been using to track open-source coding models — and observed V4 move from hallucinating API methods in V3.2 to writing code that compiled and ran on the first try, with Pro producing essentially reference-quality output. Vals AI observed the same pattern on their Vibe Code Benchmark (&lt;a href="https://www.vals.ai/benchmarks/vibe-code" rel="noopener noreferrer"&gt;https://www.vals.ai/benchmarks/vibe-code&lt;/a&gt;), where V4 improved roughly 10x from V3.2 and now leads open-source. DeepSeek's own team, in the release notes, is measured about positioning: V4 is their internal default now, better than Sonnet 4.5 and close to Opus 4.6 in non-thinking mode — but they explicitly stop short of claiming parity with Opus 4.7. The vendor's own framing is more accurate than much of the launch coverage.&lt;/p&gt;

&lt;p&gt;The negative reports cluster on integration. AkitaOnRails couldn't run V4-Pro through OpenCode at all — it kept failing on the thinking-mode handshake — and his broader assessment of DeepSeek launches reflects a consistent pattern: marketing ships earlier than working tool support, the community spends a few weeks reverse-engineering the protocol, and gaps in open-source harnesses tend to persist. Cursor's forum is showing similar issues, with open threads reporting V4's context capped at 200K with reasoning_content errors after tool calls (&lt;a href="https://forum.cursor.com/t/deepseek-v4-context-limited-to-200k-reasoning-content-error/159045" rel="noopener noreferrer"&gt;https://forum.cursor.com/t/deepseek-v4-context-limited-to-200k-reasoning-content-error/159045&lt;/a&gt;) and an open feature request for proper reasoning_content compatibility (&lt;a href="https://forum.cursor.com/t/compatibility-with-deepseek-models-design-to-return-reasoning-content-after-tool-calls/158905" rel="noopener noreferrer"&gt;https://forum.cursor.com/t/compatibility-with-deepseek-models-design-to-return-reasoning-content-after-tool-calls/158905&lt;/a&gt;). Local-inference users are also waiting — no community GGUF at launch, llama.cpp support days out, MLX on Apple Silicon trailing by a similar margin. vLLM works on the native FP4/FP8 checkpoints out of the box, but the hardware floor is one H200 141GB or two A100 80GBs for Flash, and four A100s or two H200s to use the full 1M context.&lt;/p&gt;

&lt;p&gt;A useful counterpoint came from Chew Loong Nian, who tested all four V4 tiers across 20 real tasks instead of leaderboard prompts. V4-Pro-Max didn't dominate. Flash won 7 outright at $0.14 per million input tokens, mostly on shorter tasks where the price-quality tradeoff favored it. Pro-Max only pulled clearly ahead when the workload genuinely required it: on three long-context retrieval tasks loading 800K tokens of a real GitHub repo and asking for a function's call graph, Pro hit 3/3 while Flash hit 1/3. That suggests the right approach — V4 is two models with different optimization points, and Pro earns its premium when context is large.&lt;/p&gt;

&lt;p&gt;The practical takeaway: budget for integration work, not just inference. The thinking-mode protocol is non-trivial, OpenCode and Claude Code adapters aren't all working cleanly at launch, and you'll likely maintain your own patches for several weeks. Run V4 in shadow before deploying it to customers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why V4 performs well where it does
&lt;/h2&gt;

&lt;p&gt;Two design choices explain most of V4's profile.&lt;/p&gt;

&lt;p&gt;It only activates 49B of its 1.6T parameters per token. That's the mixture-of-experts approach — only the experts relevant to the current token activate. Combined with running natively in 4-bit weights at inference (real FP4, not simulated quantization), this is how a 1.6T model fits within deployable economics. It's also why V4-Flash exists at 13B active: the same approach scaled further down. The cost gap to closed models comes from MoE plus FP4 plus training-efficiency improvements.&lt;/p&gt;

&lt;p&gt;It doesn't process the full million-token context. Instead, V4 summarizes long context into compressed blocks and learns which blocks to attend to for a given query. The result is concrete: at 1M context, V4-Pro uses 27% of V3.2's compute and 10% of its memory. That's what makes 1M context economically viable to serve, and why whole-repo reasoning is V4's primary workload.&lt;/p&gt;

&lt;p&gt;The tradeoff: the same compression is why V4 underperforms on terminal/agentic shell tasks. Those workloads are short-context, high-frequency tool calls — there's no million tokens to summarize, and the architectural advantage disappears. V4's weakness there isn't due to lack of effort; GPT-5.5 has been RL-trained on shell sessions much more heavily, and at short context that's what matters most.&lt;/p&gt;

&lt;p&gt;The technical report has more — including novel work on residual connections that other labs will likely adopt within two release cycles — but for routing decisions, the three points above are the most important.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to watch next
&lt;/h2&gt;

&lt;p&gt;Three things will determine whether V4 becomes a production default, or remains a release that performs well on benchmarks but is difficult to integrate, like several DeepSeek launches before it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool ecosystem catch-up (2-3 weeks). OpenCode, Cursor, Claude Code, Cline, and the long tail of agent harnesses need clean thinking-mode and reasoning_content support. The Cursor forum threads are the leading indicator; if they resolve within a few weeks, V4-Pro becomes a viable production option. If integration drags into May, the practical adoption ceiling stays low.&lt;/li&gt;
&lt;li&gt;The Birkhoff-constrained transformer in other labs. mHC is the architectural idea most likely to spread. Watch Llama 5, Qwen 4, and Mistral's next foundation model for residual-connection changes that reference it.&lt;/li&gt;
&lt;li&gt;Closed-frontier pricing response. With V4-Pro at one-seventh the price of Opus 4.7 and GPT-5.5 at near-comparable coding numbers, sustained pressure on closed-API pricing is the most likely industry move. The question is whether Anthropic and OpenAI hold premium pricing on differentiated workloads (real-world SWE for Anthropic, terminal/agentic for OpenAI) or take broader cuts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The broader context: six months ago, the best open-weights coding model trailed the closed frontier by 15-20 points on SWE-Bench. Today, three open models — DeepSeek V4-Pro, Kimi K2.6, GLM-5 — sit within ~7 points of Claude Opus 4.7. Chinese labs alone have shipped a coding-focused checkpoint roughly every week for the past three months. The open vs closed framing is no longer the most useful one. The more useful framing is which model fits which workload, at what cost, with which reliability profile. V4 changes the answer for several of those workloads. The rest is integration work.&lt;/p&gt;

&lt;p&gt;V3 to V4 is roughly the same step V2 to V3 was, on a similar release cadence. What's different this time is the timing: it arrives at the point where the open frontier has caught up enough to make multi-model routing — V4-Flash for cheap calls, V4-Pro for long-context, Opus 4.7 or GPT-5.5 for the critical path — the default architecture. The teams that establish routing and evaluation infrastructure first will gain a larger advantage than any single model choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] DeepSeek, "DeepSeek V4 Preview Release." DeepSeek API Docs, April 24, 2026. &lt;a href="https://api-docs.deepseek.com/news/news260424" rel="noopener noreferrer"&gt;https://api-docs.deepseek.com/news/news260424&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] MarkTechPost, "DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts." April 24, 2026. &lt;a href="https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/" rel="noopener noreferrer"&gt;https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] AkitaOnRails, "LLM Coding Benchmark (April 2026): GPT 5.5, DeepSeek v4, Kimi v2.6, MiMo, and the State of the Art." April 24, 2026. &lt;a href="https://akitaonrails.com/en/2026/04/24/llm-benchmarks-parte-3-deepseek-kimi-mimo/" rel="noopener noreferrer"&gt;https://akitaonrails.com/en/2026/04/24/llm-benchmarks-parte-3-deepseek-kimi-mimo/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] Chew Loong Nian, "I Tested All 4 DeepSeek V4 Modes on 20 Real Tasks — The $0.04 Flash Won 7 of Them." Towards AI on Medium, April 2026. &lt;a href="https://medium.com/@chewloongnian/i-tested-all-4-deepseek-v4-modes-on-20-real-tasks-the-0-04-flash-won-7-of-them-0ef0fb5c1771" rel="noopener noreferrer"&gt;https://medium.com/@chewloongnian/i-tested-all-4-deepseek-v4-modes-on-20-real-tasks-the-0-04-flash-won-7-of-them-0ef0fb5c1771&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] VentureBeat, "DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5." April 24, 2026. &lt;a href="https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5" rel="noopener noreferrer"&gt;https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] BenchLM, "Best Chinese LLMs in 2026: DeepSeek V4, Kimi 2.6, GLM-5, Qwen, and Every Model Ranked." April 2026. &lt;a href="https://benchlm.ai/blog/posts/best-chinese-llm" rel="noopener noreferrer"&gt;https://benchlm.ai/blog/posts/best-chinese-llm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[7] Latent Space, "Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6." April 20, 2026. &lt;a href="https://www.latent.space/p/ainews-moonshot-kimi-k26-the-worlds" rel="noopener noreferrer"&gt;https://www.latent.space/p/ainews-moonshot-kimi-k26-the-worlds&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[8] vLLM Blog, "DeepSeek V4 in vLLM: Efficient Long-context Attention." April 2026. &lt;a href="https://vllm.ai/blog/deepseek-v4" rel="noopener noreferrer"&gt;https://vllm.ai/blog/deepseek-v4&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[9] DeepSeek-V4-Pro on Hugging Face. &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro" rel="noopener noreferrer"&gt;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[10] Hacker News discussion: "DeepSeek v4." April 24, 2026. &lt;a href="https://news.ycombinator.com/item?id=47884971" rel="noopener noreferrer"&gt;https://news.ycombinator.com/item?id=47884971&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[11] Tencent, "Tencent Unveils Hy3 preview; Model Enhances Agent Capabilities and Real-World Usability." April 23, 2026. &lt;a href="https://www.tencent.com/en-us/articles/2202320.html" rel="noopener noreferrer"&gt;https://www.tencent.com/en-us/articles/2202320.html&lt;/a&gt;. Model weights: tencent/Hy3-preview on Hugging Face — &lt;a href="https://huggingface.co/tencent/Hy3-preview" rel="noopener noreferrer"&gt;https://huggingface.co/tencent/Hy3-preview&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[12] Cursor Community Forum, "DeepSeek V4: context limited to 200K + reasoning_content error." April 2026. &lt;a href="https://forum.cursor.com/t/deepseek-v4-context-limited-to-200k-reasoning-content-error/159045" rel="noopener noreferrer"&gt;https://forum.cursor.com/t/deepseek-v4-context-limited-to-200k-reasoning-content-error/159045&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[13] Cursor Community Forum, "Compatibility with DeepSeek models design to return reasoning_content after tool calls." April 2026. &lt;a href="https://forum.cursor.com/t/compatibility-with-deepseek-models-design-to-return-reasoning-content-after-tool-calls/158905" rel="noopener noreferrer"&gt;https://forum.cursor.com/t/compatibility-with-deepseek-models-design-to-return-reasoning-content-after-tool-calls/158905&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[14] Qwen Team, "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model." April 22, 2026. &lt;a href="https://qwen.ai/blog?id=qwen3.6-27b" rel="noopener noreferrer"&gt;https://qwen.ai/blog?id=qwen3.6-27b&lt;/a&gt;. Model weights: Qwen/Qwen3.6-27B on Hugging Face — &lt;a href="https://huggingface.co/Qwen/Qwen3.6-27B" rel="noopener noreferrer"&gt;https://huggingface.co/Qwen/Qwen3.6-27B&lt;/a&gt;. Pricing via OpenRouter — &lt;a href="https://openrouter.ai/qwen/qwen3.6-27b" rel="noopener noreferrer"&gt;https://openrouter.ai/qwen/qwen3.6-27b&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>deepseek</category>
    </item>
    <item>
      <title>The Coding Benchmark We Actually Need</title>
      <dc:creator>Mixture of Experts</dc:creator>
      <pubDate>Thu, 07 May 2026 00:59:12 +0000</pubDate>
      <link>https://dev.to/mixture-of-experts/the-coding-benchmark-we-actually-need-m2m</link>
      <guid>https://dev.to/mixture-of-experts/the-coding-benchmark-we-actually-need-m2m</guid>
      <description>&lt;p&gt;The benchmarks worth caring about measure something a customer would pay for. “Can this agent ship a product that generates revenue” is the question worth asking. “Can this agent reproduce SQLite from memory under adversarial constraints” is not.&lt;/p&gt;

&lt;p&gt;That’s the lens for evaluating coding agents going forward, and ProgramBench[1] is a useful place to ground it, because it gets one key thing right that’s worth carrying forward, while other parts of the design are worth scrutinizing. The setup: hand a coding agent a compiled binary, the user-facing docs, and a sandbox. Rebuild the program from scratch. Pass all the behavioral tests. No web access. No objdump, strings, or hexdump. No source. Across 200 tasks and 248,000 behavioral tests, every frontier model scored 0% fully resolved[1]. The tasks range from jq on the small end to SQLite, PHP, and FFmpeg on the large end. Claude Opus 4.7 leads the “almost resolved” column at 3.0%. GPT-5.4, Gemini 3.1 Pro, and Haiku 4.5 all sit at 0/0.&lt;/p&gt;

&lt;p&gt;The framing is that this is a hard reverse-engineering test, but what it actually measures is memorization, and that’s the wrong thing to be testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ProgramBench measures memorization, not capability
&lt;/h2&gt;

&lt;p&gt;Real reverse-engineering looks like the workflow any dev uses to rebuild something they don’t fully understand: poking at the product to see how it behaves, reading the docs, searching the web for similar projects, pulling up reference implementations and design-system examples, searching for half-remembered error strings, and reading the upstream changelog to figure out why a behavior changed. ProgramBench’s rules forbid all of that. The agent gets a binary it can execute and a manual it can read. That’s it.&lt;/p&gt;

&lt;p&gt;Strip those tools out and what’s left is: produce, from training data alone, a clean-room implementation of FFmpeg that matches the reference on a quarter-million tests. The model is recalling whether it saw enough of the original codebase during pretraining to reconstruct it, when what we actually want to know is whether it can reason about the binary.&lt;/p&gt;

&lt;p&gt;Doing well on this would tell us the model memorized the training set, which isn’t what we’re trying to measure. Doing poorly tells us only that current frontier models can’t perfectly memorize SQLite, which we already knew.&lt;/p&gt;

&lt;p&gt;The benchmark authors will say that’s the point: forbid the obvious tools so the model can’t cheat. But “cheating” here means “using the workflow that real engineers use.” The constraint makes the test cleaner to grade, but it stops the benchmark from measuring anything a customer would pay for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one part worth keeping: free-form implementation
&lt;/h2&gt;

&lt;p&gt;ProgramBench does get one thing right, and it’s worth calling out because it’s the part worth carrying forward into a better benchmark. The input format. No method signatures to fill in. No class skeletons. No PRD. No natural-language description of the intended file layout. Just: here’s the binary, here’s the manual, build the thing.&lt;/p&gt;

&lt;p&gt;That matters. Most coding benchmarks rely on partial structure to make grading tractable. SWE-Bench[2] hands you a repo plus a failing test. HumanEval gives you a docstring and a function signature. Even the harder agent benchmarks pass in a problem statement that a human has already broken down. ProgramBench is the rare benchmark that forces the model to architect from zero.&lt;/p&gt;

&lt;p&gt;The free-form input is the right idea. The rest of the design isn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  A proposal: free-form input, real outcomes, real tools
&lt;/h2&gt;

&lt;p&gt;Here’s the redesign. Keep ProgramBench’s free-form input. Drop the no-tools rule. Replace test pass rates with a metric a customer would actually pay for.&lt;/p&gt;

&lt;p&gt;Take Vending-Bench 2[3]: a year-long simulation where the agent runs a vending machine business starting with $500, negotiates suppliers, manages inventory, and gets scored on the bank balance at year-end. Andon Labs explicitly designed it to measure long-horizon coherence, the failure mode where agents drift, forget, or go bankrupt over thousands of tool calls.&lt;/p&gt;

&lt;p&gt;Now hybridize Vending-Bench’s outcome-based scoring with ProgramBench’s free-form input and SWE-Bench’s real-world software framing. Drop the agent into an empty repo. Give it a market hypothesis and the tools real engineers use, including the web, package managers, debuggers, and the works. Let it ship a SaaS app. Score it on generated ARR after 90 days of simulated operation, with a synthetic customer pool that buys, churns, and files support tickets against whatever the agent builds.&lt;/p&gt;

&lt;p&gt;That benchmark would test what coding agents are actually for: building things that work, in a real environment, with the tools real engineers use, against an outcome a customer would pay for. Memorization helps a little. Architecture, debugging, customer empathy, and long-horizon execution help a lot more. And critically, the score moves with the thing we actually want, GDP value generated, not with how much of the training set the model already saw during pretraining.&lt;/p&gt;

&lt;p&gt;What the 0% actually tells us&lt;br&gt;
ProgramBench’s headline number is a benchmark design choice. Forbid web access, forbid decompilation, forbid source, and you’ve forbidden the workflow. The remaining test measures recall under adversarial constraints, which is interesting research but not a useful signal for production routing decisions, and not a measure of value any customer would pay for.&lt;/p&gt;

&lt;p&gt;Run a coding agent in the environment it’s actually deployed in. Score it on outcomes a customer cares about. The benchmarks that survive the next two years will look more like Vending-Bench than ProgramBench. They will be long-horizon, tool-rich, free-form on the input side, and graded on revenue rather than test pass rates.&lt;/p&gt;

&lt;p&gt;The free-form input idea is worth keeping. Combine it with outcome-based scoring and you have the benchmark we actually need.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] ProgramBench, “Rebuilding programs from scratch: a benchmark for coding agents.” 2026. Link&lt;/p&gt;

&lt;p&gt;[2] SWE-Bench, “Can Language Models Resolve Real-World GitHub Issues?” Link&lt;/p&gt;

&lt;p&gt;[3] Andon Labs, “Vending-Bench 2: Long-horizon agent coherence over a one-year simulated business.” Link&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
