Daniel Shashko

Posted on May 7

The Best LLMs for Agentic Coding in 2026 (Real-World, Not Just Benchmarks)

#ai #llm #coding #agents

It's May 2026 and there are a lot of coding models to choose from. Everything below is based on my personal experience running them in real agent loops - Claude Code, Copilot, and OpenCode, backed up by benchmark data and what other people are actually saying on Reddit.

Quick comparison

Benchmark column uses SWE-bench Verified, vendor-reported single-attempt numbers. LMSYS Arena ranks from arena.ai/leaderboard.

Model	Released	Context	$/M in	$/M out	SWE-bench Verified	LMSYS rank	Open weights
Claude Opus 4.7	Apr 2026	1M	$5	$25	87.6%	#1 (thinking)	No
GPT-5.5	Apr 2026	1M	$5	$30	88.7%	#7 (high)	No
Claude Opus 4.6	Late 2025	1M	$5	$25	80.8%	#3 (thinking)	No
Gemini 3.1 Pro	Feb 2026	1M	$2-$4.00	$12-$18	80.6%	#4	No
Kimi K2.6	Apr 2026	256K	$0.16	$4.00	80.2%	#28	Yes
Claude Sonnet 4.6	Feb 2026	1M	$3	$15	79.6%	#23	No
DeepSeek V4-Flash	Apr 2026	1M	$0.14	$0.28	~79%	#24	No
Gemini 3 Flash (high)	Dec 2025	1M	-	-	78.0%	-	No
Grok 4.3	2026	1M	$1.25	$2.50	~73%	#34	No
GPT-5.4	Mar 2026	1M	$2.50	$15	-	#11 (high)	No
GPT-5.4 Mini	Mar 2026	400K	$0.75	$4.50	56.2%	-	No
Qwen 3.5-max	Jul 2025	256K+	$0.40	$2.40	-	#25	Yes
Qwen 3 Coder	Apr 2025+	1M	$0.30	$1.50	-	-	Yes

SWE-bench Verified scores are vendor-reported single-attempt where available. Independent reproductions on swebench.com typically land 4-8 points lower. Rows sorted by SWE-bench score.

Read this first: benchmarks lie (and why that matters)

Before you trust any leaderboard, three things to keep in mind:

1. Training data contamination is real. Models get trained on the internet, and the internet contains the benchmarks. OpenAI publicly stopped reporting on SWE-bench Verified in early 2026 partly because the gap between "scoring well" and "actually being useful" got too large to ignore. Their own write-up is worth reading: Why we no longer evaluate on SWE-bench Verified.

2. The agent harness matters more than the model. On Terminal-Bench 2.0 the same model can swing 30 to 50 percentage points depending on which harness wraps it (Claude Code vs OpenHands vs a homegrown loop). When someone says "model X is best for agents," ask which harness, which tool set, which retry policy.

3. Benchmarks measure narrow tasks. LiveCodeBench and SWE-bench both test contained, well-specified problems. They don't measure: navigating a 200k-LOC repo you've never seen, refactoring without breaking three other files, holding context across a 4-hour session, knowing when to stop and ask. The model that wins your benchmark may not win your Tuesday.

The honest answer to "what's the best coding LLM" in 2026 is "the one that works best inside your specific loop, on your specific stack, at a price you can stomach." Use the rest of this article as a starting shortlist, not a verdict.

The 2026 landscape in one paragraph

Frontier closed models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) are still the safest bet for "one-shot the hard ticket" work. GPT-5.5 sits at $30 / M output - serious money for long sessions - while Claude Opus 4.7 and Gemini 3.1 Pro come in around $12 to $25 / M output. The interesting movement is below them: DeepSeek V4, Qwen 3 Coder, and Kimi K2.6 are within a few benchmark points of frontier at a fraction of the cost, and the open-weight tier (Qwen 3 Coder, Kimi K2.6, DeepSeek weights) is now good enough that lots of teams run 60 to 80 percent of their agent traffic locally and only escalate the hard 20 percent to a frontier API.

The contenders

GPT-5.5 and GPT-5.4 (OpenAI)

GPT-5.5 released: April 23, 2026
GPT-5.4 released: March 5, 2026
Context: 1M tokens (128K max output)
Pricing: GPT-5.5: $5 / $30 per M (in/out); GPT-5.4: $2.50 / $15 per M; GPT-5.4 Mini: $0.75 / $4.50 per M (OpenAI pricing)
Open weights? No

GPT-5.5 is OpenAI's current flagship - expanded to 1M context, stronger at multi-file reasoning, and according to the LMSYS Arena leaderboard it sits at Rank 7 (high-reasoning mode). GPT-5.4 is the mid-tier option at roughly half the output price: still very capable, used as the default for most Cursor and Claude Code users who switched from GPT-4. GPT-5.4 Mini is the cheap fast tier at $4.50 / M output with 400K context - the workhorse for anything that doesn't need the full flagship.

The catch with GPT-5.5 is $30 / M output. A serious multi-file refactor session can cost $5 to $15. Most people reserve it for the truly hard ticket and run GPT-5.4 for everything else.

What people are saying. A widely-discussed r/ClaudeCode head-to-head from late April 2026 ran GPT-5.5, GPT-5.4 and Claude Opus 4.7 against 56 real coding tasks in their native harnesses. The recurring sentiment in the comments: GPT-5.5 wins the hardest tickets but the price-per-result is brutal; GPT-5.4 at $15/M output is the practical daily driver for most people.

Claude Opus 4.7 and Sonnet 4.6 (Anthropic)

Opus 4.7 released: April 16, 2026
Sonnet 4.6 released: February 17, 2026
Context: 1M tokens
Pricing: Opus 4.7: $5 / $25 per M; Sonnet 4.6: $3 / $15 per M
Open weights? No

Claude Opus 4.7 currently holds the #1 spot on the LMSYS Arena leaderboard in thinking mode (arena.ai/leaderboard) and scores 87.6% on SWE-bench Verified (Anthropic launch post, Vellum breakdown) - the top vendor-reported result among all models as of May 2026. It's the strongest coding agent model available in a closed API today. The 4.7 release addressed community complaints about 4.6's habit of over-scoping - it now stays more focused, edits fewer files than asked, and explains its reasoning more clearly mid-run.

Claude Sonnet 4.6 is the workhorse: SWE-bench Verified 79.6% (Anthropic), same 1M context, at $15 / M output vs Opus 4.7's $25. Most Cursor and Claude Code users run Sonnet 4.6 as their daily driver and reach for Opus 4.7 when they hit a hard problem.

What people are saying. A widely-upvoted r/Anthropic thread on Opus 4.7 sums up community consensus: it's a step back for raw chat on claude.ai, but a clear improvement when driving Claude Code as an agent: better at staying in scope and explaining its reasoning mid-run. See also the r/ClaudeAI roundup of Opus 4.7 best practices from the Claude Code team. The pattern most people land on: Sonnet 4.6 daily, Opus 4.7 when Sonnet gets stuck.

Gemini 3.1 Pro (Google)

Released: February 2026
Context: 1M tokens (1,048,576 to be exact)
Pricing: $2 / $12 per M (input/output, up to 200K input); $4.00 / $18 per M (above 200K input) (Google AI pricing)
Open weights? No

Gemini 3.1 Pro sits at Rank 4 on the LMSYS Arena and scores 80.6% on SWE-bench Verified (single-attempt, DeepMind model card). Its real superpower is the 1M-token context window - you can paste an entire mid-sized monorepo and ask architectural questions. Note: the pricing tiers are sneaky. The moment you cross 200K input tokens (exactly when the big context becomes useful) input price jumps from $2 to $4.00 per M and output from $12 to $18 per M. Model that before you commit.

What people are saying. The pinned r/ClaudeAI "I trust Sonnet as my daily driver now" thread captures the Sonnet 4.6 mood: better code at a third the tokens once you write a markdown plan first. The companion "Sonnet 4.6 is something else" thread piles on with VS Code and IntelliJ workflow recipes. The Gemini 3.1 Pro side of the discussion lives mostly in r/GeminiAI. Common complaint there: tooling support outside Google's own AI Studio still lags Anthropic and OpenAI.

DeepSeek V4 (DeepSeek)

Released: April / May 2026
Context: 1M tokens
Pricing: V4-Flash: $0.14 / $0.28 per M; V4-Pro: $1.74 / $3.48 per M (DeepSeek pricing)
Open weights? No (V4 API). V3 weights are MIT-licensed; V4 weights expected mid-2026.

DeepSeek V4-Flash is the current production API - the model behind deepseek-chat and deepseek-reasoner endpoints. It supports both thinking and non-thinking modes, 1M context, and already sits at Rank 24 on the LMSYS Arena. At $0.28 / M output it's one of the cheapest capable options available today. V4-Pro is the premium tier at $1.74 / $3.48 per M (currently 75% discounted to $0.435 / $0.87 until end of May 2026).

V3 weights remain MIT-licensed and widely deployed for self-hosted setups. V4 weights are expected mid-2026.

What people are saying. r/opencodeCLI's "DeepSeek V4 Flash is a monster" thread documents tool-call accuracy on large code-change evals at near-zero cost. r/LocalLLaMA's "DeepSeek V4 being 17x cheaper got me to actually..." reports 65% of daily coding work running identically on a model that costs basically electricity. The realist take in the r/GithubCopilot fullstack thread: Flash is more like Haiku: great as a fast tier, not your only model.

Qwen 3 and Qwen 3 Coder (Alibaba)

Qwen 3 (dense): April 2025 (Qwen3-2504)
Qwen 3.5 / Qwen3-2507: July 2025
Context: 256K standard; up to 1M with Qwen 3 Coder
Pricing (API): Qwen 3.5 Plus $0.40 / $2.40 per M; Qwen 3 Coder $0.30 / $1.50 per M
Open weights? Yes - Apache 2.0, sizes from 0.6B to 235B MoE

The Qwen family is the open-source story of 2026. Qwen 3 Coder handles the bulk of routine coding tasks at a fraction of frontier cost - and the dense 32B and MoE 235B variants are self-hostable on consumer or prosumer hardware. Qwen3-2507 adds improved instruction following, better tool use, and a thinking / non-thinking mode switch. The LMSYS Arena puts Qwen 3.5-max-preview at Rank 25.

For anyone building a self-hosted coding agent or an on-prem product, Qwen 3 Coder is the obvious starting point.

What people are saying. r/LocalLLaMA's "Qwen Code — a powerful open-source coding agent + NO API key" walks through the standard self-hosted setup (LM Studio + Qwen3-Coder on port 1234, Qwen Code as the agent harness). The "Best way to use Qwen3-Coder for local AI coding" thread is a practical Q&A for tuning context size, quantization, and prompt format. The recurring pattern across both: route the routine 60–80% locally, escalate the rest to a frontier API.

Grok 4.3 (xAI)

Current version: Grok 4.3 (also a Grok 4.20-beta1 in testing)
Context: 1M tokens
Pricing: $1.25 / $2.50 per M (Grok 4.3) - substantially cheaper than its early 2026 pricing (docs.x.ai)
Open weights? Grok 2 weights are open; Grok 4.x is not

Grok 4.3 is on the LMSYS Arena at Rank 34, with the experimental Grok 4.20-beta1 at Rank 9. At $2.50 / M output it's now one of the more affordable frontier-adjacent options. xAI describes it as excelling at "agentic reasoning, knowledge work, and tool use." The Grok 4.20 beta is showing very promising early results in coding agent loops per community reports.

What people are saying. r/grok's "Anyone see that Grok 4.3 is out and they reduced all the API pricing" thread is the cleanest summary of the price cut. r/DeepSeek's "Grok 4.3 is cheaper than DeepSeek V4 Pro" post argues Grok is now genuinely competitive on raw price-per-call (though cache costs are 50× DeepSeek's). For the 4.20-beta debate, the extended-context benchmark thread on r/singularity is the one to read.

Kimi K2.6 (Moonshot AI)

Current version: Kimi K2.6 (successor to K2.5 and K2)
Context: 256K
Pricing (API): $0.16 / $4.00 per M (Moonshot API)
Open weights? Yes - modified MIT, ~1T MoE params

Kimi K2.6 scores 80.2% on SWE-bench Verified (Kimi K2.6 tech blog), up sharply from K2.5's 70.8% on the swebench.com independent leaderboard - currently at Rank 28 on the LMSYS Arena. The Kimi family has Deep Research, Sheets, an Agent Swarm mode, and Kimi Code, making it one of the more batteries-included open-weight options. The 1T MoE architecture is theoretically self-hostable and the weights are available.

What people are saying. The r/HowToAIAgent launch thread for K2.6 documents a 12+ hour autonomous run with 4,000+ tool calls optimizing Qwen3.5-0.8B inference in Zig. The long-horizon claim isn't marketing. The r/ArtificialIntelligence head-to-head with Opus 4.7 lands on a familiar split: Claude wins on careful long-context tasks, Kimi wins on raw throughput. The r/LocalLLaMA "About Kimi K2.6" thread is the best general-purpose discussion.

Pick two: speed, quality, cost

After a year of running these models in real loops, this is how I'd reduce the choice:

Maximum quality, cost no object: Claude Opus 4.7 for agent loops (SWE-bench #1), Gemini 3.1 Pro for big-context architectural work, GPT-5.5 as a tie-breaker.
Best balance for daily work: Claude Sonnet 4.6 or DeepSeek V4-Flash. Sonnet if you're in the Anthropic ecosystem already; DeepSeek V4-Flash if your CFO is watching.
Cheapest credible option: Qwen 3 Coder API ($0.30 / $1.50) or DeepSeek V4-Flash ($0.14 / $0.28). Both are genuinely good, not "good for the price."
Fastest for autocomplete-style use: Grok 4.3 ($2.50 out) or GPT-5.4 Mini ($4.50 out).
Reasoning on hard bugs: DeepSeek R1 or Claude Opus 4.7 in thinking mode.

None of those answers are "the model with the highest SWE-bench score." That's the point.

Open weights vs closed weights in 2026

The open-weight tier matured faster than most people expected. In May 2026 the realistic open-weight shortlist for coding is:

Qwen 3 Coder (0.6B to 235B MoE) - Apache 2.0, the default for self-hosted
Kimi K2.6 - ~1T MoE, modified MIT, excellent long-context Q&A
DeepSeek V3 weights - MIT, self-hostable while V4 weights are pending
Qwen 3.5 dense models - Apache 2.0, easy to fine-tune

If you're building an on-prem product, working in a regulated industry, or just don't want your company's source code passing through a third-party API, you can ship a credible coding agent end-to-end on open weights today. You will give up some quality on the hardest tasks; you will gain auditability, predictable cost, and zero rate limits.

The pattern most teams have settled into: route 60-80% of agent traffic to a self-hosted Qwen 3 Coder or Kimi K2.6 setup, escalate the remaining 20-40% to Claude Opus 4.7 or Gemini 3.1 Pro via API. That blend keeps spend predictable while preserving access to frontier quality when you need it.

Decision guide: which one should you pick?

You vibe-code in Cursor / Claude Code on a laptop and ship side projects. Claude Sonnet 4.6 is the default. Add DeepSeek V4-Flash as a cheap fallback for the boring 80% of edits.

You're building a customer-facing coding agent. Test on YOUR workload, not benchmarks. Claude Opus 4.7 and Gemini 3.1 Pro are the safest defaults for quality; Qwen 3 Coder is the safest default for unit cost.

You're an indie hacker watching the AWS bill. Qwen 3 Coder API ($0.30 / $1.50) handles 90% of your work. Bring in DeepSeek V4-Flash for the last 10%. You can run a real agent for under $20 a month.

You work in regulated / on-prem land. Self-host Qwen 3 Coder 32B on a single H100 (or a pair of 4090s), keep Kimi K2.6 around for long-doc tasks, never touch a closed API.

You're doing AI research or hard algorithmic work. DeepSeek R1 for deep thinking, Claude Opus 4.7 for writing it up.

Your team is already standardized on one cloud. Use what's nearest. The cost of integration friction usually dwarfs the model-quality difference for typical work.

FAQ

Is GPT-5.5 the "best" coding model? On some benchmarks it's near the top. In real agent loops in May 2026, Claude Opus 4.7 actually leads SWE-bench Verified at 76.8%, and at $30 / M output GPT-5.5 is hard to justify for anything other than the highest-stakes tasks. GPT-5.4 at $15 / M output is the more practical flagship choice for most people.

Can I really self-host a useful coding model? Yes. Qwen 3 Coder 32B on a single high-end GPU is genuinely productive for day-to-day coding. It will not match Claude Opus 4.7 on a hard refactor, but it will handle most of what you do in a day.

What about Llama 4, Mistral Codestral, and the rest? Both are fine but neither broke into the top tier for coding-agent use in early 2026. They're worth tracking for the next refresh cycle.

Why isn't SWE-bench the gold standard if you're using it? Because numbers can be gamed and contamination is real - that's the whole point of the opening section. We're using SWE-bench Verified as the least bad common signal available in May 2026 while noting it measures narrow, contained tasks. OpenAI's own write-up on why they stopped reporting SWE-bench is worth reading.

How do I keep this list current? Bookmark the LMSYS Arena leaderboard and SWE-bench Verified. Check r/LocalLLaMA every couple of weeks. The model that wins your loop in November 2026 probably hasn't been released yet.

TL;DR

Benchmarks are just a starting filter. The coding-LLM world splits cleanly into three usable tiers: frontier closed (Claude Opus 4.7, GPT-5.5 / 5.4, Gemini 3.1 Pro) for the hard 20%, cheap-and-good (Claude Sonnet 4.6, DeepSeek V4-Flash, Grok 4.3) for daily driving, and open-weight (Qwen 3 Coder, Kimi K2.6, DeepSeek weights) for self-hosting and cost control.

Pick the one that fits your loop. Re-test every quarter. Don't let a leaderboard make the decision for you.

Top comments (2)

rp1run • May 9

Useful breakdown — the SWE-bench contamination point is the one most "best LLM" posts skip. One pattern I'd add from running these in real loops: the gap between models matters less than the gap between workflows. Sonnet 4.6 with a structured plan (your "markdown plan first" point) consistently outperforms Opus 4.7 on a vibe prompt. Which means the lever most people aren't pulling isn't model selection, it's making the planning layer constitutional rather than ad-hoc per session. We've been formalising that into rp1 (open source, rp1.run — works across Claude Code, OpenCode, Codex, and Copilot CLI) but the underlying point holds regardless of which workflow tool you use: pick the cheapest capable model and put your effort into the workflow harness, not the model upgrade.

Max Quimby • May 22

The "best model is the one that works in your loop" framing is the right hill to die on. Benchmark contamination aside, the bigger issue is that benchmarks measure single-shot capability while agentic coding is a multi-turn game — recovery behavior after a failed test run matters more than first-attempt correctness, and no public benchmark captures that well.

On the 60-80% routing split: the part that's underrated is the routing decision cost itself. Deciding whether a task is "hard enough" to escalate is non-trivial — if you get it wrong you either burn frontier-API money on trivial work or you let a cheap model flail on something it can't do and pay in retries anyway. We've found a cheap heuristic (task touches >N files, or involves a test that's currently red) beats trying to have a model self-assess difficulty. Curious what signal you route on — is it static task metadata, or do you let the cheap model attempt first and escalate on failure?