Ultra Dune

Posted on Apr 14

EVAL #010: The AI Coding Agent Wars — 10 Agents, 4 Architectures, 1 Winner (For Now)

#ai #llm #mlops #machinelearning

By Ultra Dune | EVAL — The AI Tooling Intelligence Report | April 14, 2026

Ten coding agents shipped major milestones in a single week. OpenHands hit 1.0. SWE-agent hit 2.0. Cline hit 4.0. Aider went autonomous. Devin launched v2. OpenAI and Amazon released dedicated coding agents. Bolt.new open-sourced their entire platform. Plandex 2.0 shipped. Roo Code and Goose kept iterating.

This isn't incremental progress. This is an entire product category reaching escape velocity — all at once. Meanwhile, the Llama 4 launch triggered the most dramatic ecosystem-wide tooling sprint we've seen since Llama 2, and there's a benchmark controversy big enough to warrant its own section.

So let's do what EVAL does. Let's evaluate.

The Eval: The AI Coding Agent Taxonomy — Architecture Decides Everything

Why This Matters Right Now

Here's the uncomfortable question every engineering team is facing: there are now at least 10 production-quality AI coding agents. Which one do you actually use?

The answer isn't "whichever one benchmarks highest." The answer depends on how the agent is built — its architecture, its interface model, its relationship with the developer. After testing all 10, I'm convinced the architecture is the thing. The model matters less than you think. The interface matters more than you realize. Let me show you why.

The Four Architectures

Every coding agent on the market fits one of four architecture patterns. Understanding these patterns tells you more about an agent's strengths and limitations than any benchmark score.

1. Code-as-Action (CodeAct)
Representative: OpenHands

OpenHands' CodeAct architecture is the most conceptually interesting. Instead of calling predefined tools through JSON schemas, the agent writes and executes code to accomplish tasks. Need to search a codebase? The agent writes a grep script. Need to refactor a module? The agent writes a transformation script.

This sounds circuitous, but it's surprisingly powerful. Code-as-action gives the agent a universal toolkit — any operation expressible in Python is available. The agent isn't limited to the tool definitions someone hand-wrote. When OpenHands encounters a novel situation, it improvises.

The downside is reliability. Code execution introduces a wider failure surface than calling a typed API. And debugging becomes harder — when the agent writes a script that produces wrong output, the error could be in the script's logic, not just in a tool parameter.

OpenHands 1.0 is the first stable release of this architecture, and it's the one I'd point researchers at. For production use? The failure modes need more guardrailing first.

2. Agent-Computer Interface (ACI)
Representative: SWE-agent

Princeton's SWE-agent introduced the most important idea in the coding agent space: the interface you give the agent matters as much as the model running it. They call it ACI — Agent-Computer Interface — and the analogy to HCI (Human-Computer Interface) is deliberate.

SWE-agent doesn't let the LLM use raw bash commands and standard file operations. Instead, it provides purpose-built tools optimized for how LLMs process information: a file viewer that shows line numbers and context windows, a search tool that returns structured results, an editor that operates on specific line ranges. These tools look redundant — you could do all of this with cat, grep, and sed — but the LLM makes fewer mistakes with the custom interface.

SWE-agent 2.0 pushes this further. They report solving 45% of real GitHub issues on SWE-bench Verified, putting the project solidly in the top tier. The lesson is clear: if you're building an agent system, invest as much time designing the tool interface as you do choosing the model.

3. Plan-and-Execute
Representatives: Plandex, Devin

Plandex and Devin approach the problem differently: plan first, then execute. The agent creates a detailed plan of changes before writing any code, and the human reviews the plan before execution starts.

Plandex takes this to its logical extreme. Every change goes through a version-controlled sandbox. You see the full diff of what the agent wants to do, accept or reject specific changes, and can roll back anything. It's a git-for-AI-edits approach that appeals to teams who need auditability. Plandex 2.0 adds Llama 4 support and improved planning quality.

Devin 2.0 is the cloud-native version of this pattern. It runs in its own sandboxed environment with a full code editor, browser, and terminal. You assign it tasks through Slack. It plans, executes, and reports back — sometimes hours later. Devin's 2.0 update adds full project management, CI/CD integration, and multi-repo support. The pitch is no longer "AI pair programmer" — it's "AI team member."

The plan-and-execute pattern has one clear advantage: transparency. You know what the agent intends to do before it does it. The disadvantage is speed. Planning takes time, and the human review step creates a bottleneck. For routine tasks, this is overhead. For high-stakes refactors, it's essential.

4. React-and-Iterate (Standard Tool-Use Loop)
Representatives: Cline, Aider, Roo Code, Goose, GitHub Copilot, Cursor, Amazon Q Developer

The majority of coding agents share a common architecture: observe the codebase, reason about the task, take an action (edit a file, run a command), observe the result, repeat. This is the ReAct pattern adapted for coding, and it works because it mirrors how developers actually work.

Where these agents differentiate is in the details:

Cline 4.0 is the gold standard for IDE-integrated agents with safety controls. Every action — every file edit, every terminal command, every API call — requires explicit human approval by default. Cline was among the first to integrate MCP, making it infinitely extensible through the standard tool protocol. The new 4.0 release adds multi-file editing that's finally reliable and native MCP support that makes external tool servers feel first-class. If you want an agent that's powerful but never does anything without asking, this is it.

Aider's autonomous mode is the week's most interesting evolution. Aider has always been the power-user's terminal-based coding partner — strong git integration, transparent benchmarking, a public LLM leaderboard showing per-model performance. The v0.82 architect mode uses a two-model approach: an "architect" model plans the changes, an "editor" model implements them. Different LLMs can fill each role. The new autonomous mode reduces human intervention, letting Aider chain multiple edit-test-fix cycles without waiting for approval on each step. For experienced developers who trust-but-verify via git history, this hits the sweet spot.

Roo Code forked from Cline and went its own direction with custom modes — define different AI personas for different tasks (code mode, architect mode, review mode). It's Cline for people who want more configurability at the expense of a larger surface area.

Goose (Block's open-source agent) takes a modular approach with pluggable toolkits. Enable or disable capabilities, write custom toolkits for your specific workflow. The corporate backing adds credibility, and the extensibility model is clean.

GitHub Copilot's agent mode and Cursor's background agents represent the commercial endpoint: deeply integrated IDE experiences with strong model backends but less flexibility in model choice or architectural customization.

Amazon Q Developer Agent is unique in the enterprise space: specialized sub-agents for different tasks (/dev for feature development, /transform for code modernization). The legacy code transformation capability — automatically upgrading Java versions across entire codebases — is genuinely differentiated. If you're in the AWS ecosystem, Q is the only agent that understands your infrastructure natively.

The Benchmark Problem

Here's where I get opinionated. SWE-bench Verified is the standard benchmark for coding agents, and the top systems are now resolving 50%+ of real GitHub issues. That's impressive. It's also approaching saturation.

The problem isn't that agents are getting too good. It's that SWE-bench measures a specific kind of coding: fixing isolated bugs in popular Python repos. It doesn't measure greenfield architecture, multi-repo orchestration, build system debugging, or the kind of "figure out what's even wrong" detective work that eats most of an engineer's time.

The agents that perform best in practice aren't always the ones with the highest SWE-bench scores. Aider's architect mode produces clean, reviewable diffs that integrate well with team workflows — that matters more than a benchmark score. Cline's MCP extensibility means it improves over time as you add tools — that compounds in ways benchmarks don't capture.

We need better benchmarks. Until then, the best advice I can give: try three agents for a week on your actual codebase. The one that fits your workflow is the right one — regardless of what the leaderboard says.

The Architecture Decision Tree

Here's my framework for choosing:

You're a solo developer who lives in the terminal → Aider. The architect mode + autonomous execution + transparent git integration is unmatched for the terminal workflow. Run it with Claude Sonnet as architect and Haiku as editor for the best cost-to-quality ratio.

You want IDE-integrated help with full control → Cline 4.0. The approval-gated action model means you're always in the loop. MCP support lets you extend it to any tool. The v4.0 multi-file editing finally makes it reliable for larger changes.

You need enterprise-grade coding at scale → Amazon Q Developer. Especially if you're in the AWS ecosystem. The code transformation agents are genuinely unique. For non-AWS shops, GitHub Copilot's agent mode integrates tightest with the GitHub workflow.

You're building an AI coding product or doing research → OpenHands. The CodeAct architecture is the most flexible and extensible. The tradeoff is that you need more engineering skill to deploy and customize it.

You want asynchronous task delegation → Devin 2.0. It's the only agent that feels like assigning a task to a teammate. The cloud sandbox model means it can work on tasks independently for hours— useful for tedious migration work where you don't want to babysit.

You want maximum customization → Roo Code (IDE) or Goose (terminal). Custom modes and pluggable toolkits, respectively. Best for teams with specific workflow requirements.

The Verdict

The coding agent space just jumped from "interesting experiment" to "which one does my team use." The architectures have converged enough that the top agents all handle standard coding tasks well. The differentiation is in the interaction model: how much control you want, how much you trust the agent, and where in your workflow it sits.

My prediction: within 12 months, the "standard tool-use loop" agents will dominate the market because they're the most adaptable. But the ideas from CodeAct (code-as-universal-tool) and ACI (purpose-built interfaces) will get absorbed into the mainstream agents. SWE-agent's core insight — that the interface matters as much as the model — will become conventional wisdom.

The real winner isn't one agent. It's the developer who picks the right agent for their workflow and gets proficient with it. That advantage compounds every week.

The Changelog

Llama 4 — The Ecosystem Sprint (And The Controversy)

Meta dropped Llama 4 Scout (17B active / 109B total, 16 MoE experts, 10M token context) and Llama 4 Maverick (17B active / 400B total, 128 experts, 1M context). Both natively multimodal. The ecosystem response was instant: vLLM 0.8.4, Ollama 0.6.3, HuggingFace Transformers 4.51.x, llama.cpp b5060+, and KTransformers v0.5 all shipped Llama 4 support within days. This is the fastest ecosystem-wide adoption we've tracked.

The controversy: Meta's Chatbot Arena submission used what appears to be a specially tuned "chat version" of Maverick with different system prompts and possibly different weights than the public release. It rocketed to #1. LMSys flagged it. r/LocalLLaMA and r/MachineLearning erupted. Independent testing shows Scout and Maverick underperform on coding tasks compared to Qwen 2.5 and DeepSeek — despite Meta's benchmark claims.

The architecture is genuinely innovative (MoE with only 17B active params at inference time). The 10M context window on Scout is industry-leading for open models. But the benchmark gaming undermines trust in ways that take quarters to rebuild.

vLLM 0.8.4 — V1 Engine Goes Default

V1 engine now enabled by default for Llama, Qwen2, Mistral, Gemma2, and more architectures. This is the big architectural switchover — V1 has a redesigned scheduler and memory manager built for production throughput. Also: native Llama 4 MoE support with expert parallelism, full prefix caching for multimodal models, chunked prefill optimizations, and speculative decoding improvements (Eagle and Medusa). The 0.8.4.post1 patch followed quickly fixing Llama 4 inference edge cases. If you're still on 0.8.2 or earlier, this is the release that makes the V1 migration mandatory.

Ollama 0.6.3 — Llama 4 For Everyone

Llama 4 Scout and Maverick support out of the box. Improved memory management for large MoE models (critical — Maverick's 400B total params need careful handling). Better KV cache scheduling, reasoning model improvements for QwQ and early Qwen3 thinking models, plus structured output fixes. Ollama continues to be the "it just works" on-ramp for local inference, and MoE support makes it relevant for models that were previously server-only.

llama.cpp b5060+ — The b5000 Milestone

llama.cpp crossed the b5000 milestone and immediately tackled Llama 4 MoE support in b5060 — GGUF format extensions for mixture-of-experts, new quantization approaches for expert weights, and MoE-aware inference paths. Builds b5070 and b5078 followed with CUDA optimizations, performance improvements, and Llama 4 bugfixes. MoE support in GGUF is harder than standard dense models, but the daily build cadence means issues get fixed fast.

KTransformers v0.5 — Llama 4 on Your Desktop

The community darling of the week. KTransformers v0.5 runs Llama 4 Scout (109B) on a single RTX 4090 with 128GB system RAM through intelligent CPU/GPU expert offloading. Active MoE experts route to GPU, inactive experts stay in CPU memory with Intel AMX/AVX-512 optimized kernels. Smart prefetching predicts which experts will be needed next. Performance: 10-15 tok/s decode on consumer hardware. This is what the MoE architecture was designed to enable — enormous model capacity without enormous inference cost — and KTransformers is the first tool to deliver it cleanly on hardware people actually own.

Weaviate 1.29 — 10x Faster Keyword Search

BlockMax WAND (BMW) algorithm delivers up to 10x faster BM25 keyword searches. Multi-tenancy improvements with hot/warm/cold/frozen storage tiers for cost-effective shared deployments. Async replication improvements for better consistency. New cursor API for efficient bulk data exports. If you're running hybrid search (vector + keyword), this is a significant upgrade on the BM25 side.

HuggingFace Transformers 4.51.2 / 4.51.3 — Llama 4 Patch Blitz

Two patch releases in one week, both focused on Llama 4 stabilization. Fixes for interleaved video+text inputs, device_map=auto crashes, bitsandbytes quantization issues, FSDP support, and multi-image handling. This is the "we launched day-zero support for a complex new architecture and now we're fixing the edge cases" pattern. Boring but necessary. If you're doing Llama 4 inference through Transformers, pin to 4.51.3.

LangChain 0.3.52 — Python 3.9 Farewell

The release itself is incremental — bug fixes and partner integrations. The notable item is Python 3.9 dropped across the LangChain ecosystem (minimum is now 3.10). If you're still on 3.9, this is your wake-up call. Also: KeyedValues type for structured data passing and parent run tree tracing utility for better debugging in LangSmith.

The Signal

Signal 1: Augment Code Raises $300M — The AI Coding War Chest

Augment Code closed a $300M Series C at a reported $3B valuation. That's a staggering bet on "Copilot isn't enough" — the thesis being that code completion is table stakes, and the real money is in AI that understands entire codebases, team patterns, and development workflows at a deeper level.

The competitive landscape is brutal. GitHub Copilot has distribution (every GitHub user). Cursor has developer love. Cline and Aider have open-source momentum. And now Augment is betting that enterprise-grade, full-codebase AI understanding is a defensible wedge.

The signal: investors believe the coding agent market is still early-innings despite the crowding. A $3B valuation says someone thinks this is a $30B+ category.

Signal 2: The Week 10 Coding Agents Hit Their Milestone

I didn't plan to make the Signal section match the deep-dive, but the data demands it. Count the milestones from a single week: OpenHands 1.0, SWE-agent 2.0, Cline 4.0, Aider v0.82 (autonomous mode), Devin 2.0, Plandex 2.0, OpenAI Codex Agent, Amazon Q Developer Agent, plus Bolt.new open-sourcing and Roo Code and Goose iterating.

This isn't coincidence. This is a category crystallizing. When 10 products in the same space all ship major versions in the same week, it means the market has decided this category is real. The comparison to the "LLM launch party" of early 2024 is apt — except now the bet is on the tooling layer, not the model layer. The value is shifting from "make the model smarter" to "make the developer more effective."

For engineering leaders: the era of optional AI coding assistance is ending. The question is no longer "should we use AI coding tools" but "which agent, what workflow, what guardrails."

Signal 3: Llama 4's Benchmark Controversy Changes The Trust Game

Meta's Chatbot Arena stunt — submitting a specially tuned model variant that wasn't the public release — is more than a PR embarrassment. It damages the credibility of benchmark results industry-wide.

Open-weight models compete on benchmarks. That's more or less the entire marketing strategy. When a major player gets caught gaming, it creates a trust deficit that affects every model release. Qwen 3, DeepSeek V3, Mistral — they all rely on benchmark credibility to establish technical legitimacy.

The silver lining: the community caught it fast. r/LocalLLaMA, independent evaluators, and LMSys all flagged the discrepancy quickly. The immune system works. But META needs to be more careful. Open-weight model development depends on good-faith benchmark reporting, and Llama 4's MoE architecture is genuinely innovative enough that it didn't need the help.

EVAL is the weekly AI tooling intelligence report. We eval the tools so you can ship the models.

Subscribe free: buttondown.com/ultradune
Skill Packs for agents: github.com/softwealth/eval-report-skills
Follow: @eval_report

EVAL is the weekly AI tooling intelligence report. We eval the tools so you can ship the models.

Subscribe free: buttondown.com/ultradune
Skill Packs for agents: github.com/softwealth/eval-report-skills

DEV Community

EVAL #010: The AI Coding Agent Wars — 10 Agents, 4 Architectures, 1 Winner (For Now)

The Eval: The AI Coding Agent Taxonomy — Architecture Decides Everything

Why This Matters Right Now

The Four Architectures

The Benchmark Problem

The Architecture Decision Tree

The Verdict

The Changelog

Llama 4 — The Ecosystem Sprint (And The Controversy)

vLLM 0.8.4 — V1 Engine Goes Default

Ollama 0.6.3 — Llama 4 For Everyone

llama.cpp b5060+ — The b5000 Milestone

KTransformers v0.5 — Llama 4 on Your Desktop

Weaviate 1.29 — 10x Faster Keyword Search

HuggingFace Transformers 4.51.2 / 4.51.3 — Llama 4 Patch Blitz

LangChain 0.3.52 — Python 3.9 Farewell

The Signal

Signal 1: Augment Code Raises $300M — The AI Coding War Chest

Signal 2: The Week 10 Coding Agents Hit Their Milestone

Signal 3: Llama 4's Benchmark Controversy Changes The Trust Game

Top comments (0)