Collin Wilkins

Posted on Feb 27 • Originally published at collinwilkins.com

The AI Coding Model Wars: How Open Source Is Closing the Gap on Proprietary Coding Models

#ai #llm #news #opensource

Originally published at collinwilkins.com

Four major coding models launched in six days. Two proprietary. Two open source. The benchmark gap between the best and worst? Just 2.6 percentage points.

That number is the story of February 2026. There isn't a single model that is clearly winning. What matters now is which model fits your workflow, your budget, and how much you care about keeping your code off someone else's servers.

The week that broke the leaderboard

On February 5, Anthropic released Claude Opus 4.6 and OpenAI shipped Codex 5.3. Same day. Two very different philosophies, both claiming the top spot in coding performance.

Six days later, Zhipu AI dropped GLM-5. A 744-billion parameter open-source model under an MIT license. It scored within 1.6 points of Opus on SWE-bench. At roughly 1/45th the cost.

Then Kimi K2.5 from Moonshot AI. One trillion parameters, open source, agent swarm architecture that can coordinate 100 sub-agents in parallel.

Here's where things stand:

Model	SWE-bench Verified	Input Cost (per MTok)	License
Claude Opus 4.6	79.4%	~$5.00	Proprietary
GLM-5	77.8%	~$0.11	MIT
Codex 5.3	~77.3% (Terminal-Bench leader)	~$1.75	Proprietary
Kimi K2.5	76.8%	Open weight	Open Source

Sources: aifreeapi.com, Interconnects.ai, Winbuzzer

Razor-thin. Two years ago, the gap between the best and fifth-best model on any coding benchmark was 15+ points. Now the top four sit within a few points of each other and the rankings shuffle depending on which benchmark you pick.

Interconnects.ai put it well: workflow fit matters more than leaderboard position. I'd go further. If you're choosing a coding model based on SWE-bench scores alone, you're optimizing for the wrong thing.

The real differences are in how these models work, what they cost, and what you're allowed to do with them.

The proprietary heavyweights

Claude Opus 4.6

Opus 4.6 is the deep thinker. Its headline feature is Agent Teams, the ability to spin up 16+ parallel agents that coordinate on complex tasks. Anthropic demonstrated this by having agent teams build a 100,000-line C compiler across 2,000 sessions (Interconnects.ai).

The philosophy is autonomous. Give it a complex problem, set guardrails, let it work. A 1-million-token context window means it can hold entire codebases in memory, and deep reasoning chains let it plan multi-step refactors that other models lose track of halfway through.

The tradeoff is cost. At ~$5/MTok input, a heavy agentic session gets expensive fast. That C compiler demo reportedly cost $20,000 in API spend. I've run smaller agent workflows that still burned through $50-100 in an afternoon. For enterprise teams where engineer time costs more than API credits, that math works. For a solo dev, it probably doesn't.

Best for: Complex multi-file refactors, architectural changes, enterprise workflows where correctness matters more than cost.

Codex 5.3

Codex takes the opposite approach. Where Opus goes deep and autonomous, Codex goes fast and collaborative.

It leads Terminal-Bench at 77.3%, which measures terminal-based coding tasks closer to how developers actually work than isolated benchmark problems (Interconnects.ai). The real strength is interactive steering: you can redirect it mid-task without breaking context or restarting the conversation.

At ~$1.75/MTok input, that's about 3x cheaper than Opus. The ecosystem around it is mature, with deep integration into VS Code, GitHub Copilot, and the broader OpenAI toolchain.

Every.to described the split well: Opus is the model you set loose on a problem. Codex is the model you pair-program with.

Best for: Fast iteration, interactive development, teams already invested in the OpenAI ecosystem.

The philosophical split

This matters more than the benchmarks.

Opus says: "Tell me the goal, I'll figure it out." That works when the task is complex enough that you'd spend hours on it yourself. It fails when you need tight feedback loops or when the cost of an autonomous run gone sideways exceeds the cost of doing it manually.

Codex says: "Let's work on this together." That works for the daily grind. Writing functions, debugging, building features incrementally. It fails when you need sustained multi-step reasoning across a large surface area.

The model you want depends on how you work, not how it benchmarks. I keep Opus for architecture-level tasks and reach for Codex-class models when I'm iterating fast on implementation. Most days are implementation days.

But the proprietary debate is only half the story. The open-source models that showed up a week later made the whole conversation more interesting.

The open-source challengers

GLM-5

GLM-5 is the model that changed the math.

744 billion parameters in a Mixture-of-Experts architecture. MIT license. 77.8% on SWE-bench Verified, within 1.6 points of Opus 4.6.

At ~$0.11 per million input tokens through Zhipu's API, that's roughly 45x cheaper than Opus for comparable coding performance.

But cost isn't even the most interesting part.

GLM-5 was trained entirely on Huawei Ascend chips, no NVIDIA dependency. It's self-hostable. Because it's MIT-licensed, you can fine-tune it on your proprietary codebase without worrying about licensing terms.

The tooling ecosystem moved fast. Within days of release, GLM-5 was working with Claude Code, OpenCode, and Roo Code as a drop-in backend. Simon Willison noted that it handled agentic coding workflows comparably to proprietary alternatives. Multi-step tasks with tool use. The stuff that actually matters for real development work.

$0.11/MTok for 77.8% SWE-bench performance, MIT-licensed, self-hostable. Read that sentence again if you're still paying $5/MTok for routine coding tasks.

Best for: Budget-conscious teams, self-hosted environments, privacy-sensitive codebases.

Kimi K2.5

K2.5 from Moonshot AI takes a different angle on open source. One trillion total parameters with 32 billion active (another MoE architecture), but the standout feature is the agent swarm system. It can coordinate up to 100 sub-agents making 1,500 tool calls in parallel.

It scores 76.8% on SWE-bench Verified. Slightly below GLM-5 on pure coding benchmarks. But it has two things the others don't: strong frontend/visual understanding and native agent orchestration at a scale that would require serious custom infrastructure to replicate with other models.

If you're building something that involves UI generation, design-to-code workflows, or massive parallel agent tasks, K2.5 is worth evaluating. I haven't tested it as deeply as GLM-5, but the agent swarm capability is genuinely novel.

Best for: Frontend and visual tasks, large-scale agent orchestration, teams experimenting with multi-agent architectures.

Why open source matters now

The performance argument is settled. Open-source models match proprietary ones on coding benchmarks. The remaining arguments are about everything else.

GLM-5 at $0.11/MTok vs Opus at $5/MTok. For teams processing thousands of coding tasks per day, that's the difference between a rounding error and a budget line item. At that ratio, you could run 45 GLM-5 tasks for the cost of one Opus task. The volume math gets absurd fast.

Self-hosted means your code never leaves your infrastructure. For regulated industries, defense contractors, or anyone with strict data residency requirements, this isn't a nice-to-have. It's a hard requirement. I've talked to teams in healthcare and fintech who won't touch any cloud-hosted model for their core codebase. GLM-5 with an MIT license is the first model that gives them frontier-tier coding capability without that tradeoff.

There's a harder question behind the self-hosting argument, though. GLM-5 and Kimi K2.5 both come from Chinese companies — Zhipu AI and Moonshot AI, respectively. China's 2017 National Intelligence Law requires organizations to cooperate with state intelligence work. Multiple governments have already responded: the US banned Chinese AI models from government devices, Australia followed, Taiwan and Italy took similar action. CrowdStrike found that DeepSeek-R1 produces insecure code when prompted with politically sensitive topics. The scrutiny isn't theoretical. It's policy.

The distinction that matters is hosted API versus self-hosted weights. Using Zhipu's API at $0.11/MTok means your code routes through Chinese servers — a non-starter for most enterprises and outright banned in some jurisdictions. Self-hosting the MIT-licensed weights means your data never leaves your infrastructure, and Chinese intelligence law doesn't apply to weights you downloaded and run locally. This is actually the strongest argument for the open-source license. The MIT license isn't just a cost play. It's the escape valve that makes these models usable for teams that would otherwise never touch them.

Fine-tuning on your own codebase means the model learns your patterns, your conventions, your internal APIs. Proprietary models can't offer this. And if Zhipu raises prices or changes terms, you have the weights. You can host them anywhere. Bitdoze noted this portability as a key factor driving enterprise adoption.

The catch is real though. Self-hosting a 744B parameter model requires serious hardware. You're trading API costs for infrastructure costs. For many teams, the managed API at $0.11/MTok is the pragmatic choice anyway. But the option to self-host is what creates competitive pressure on pricing across the board.

When to use what

Skip the "which is best?" question. Wrong frame. The right question is "which is best for this task?"

Use Case	Recommended Model	Why
Complex multi-file refactors	Opus 4.6	Deepest reasoning, Agent Teams, 1M context
Fast iteration and pair programming	Codex 5.3	Speed, interactive steering, mature ecosystem
Budget-conscious / high-volume	GLM-5	Frontier quality at 1/45th the price
Self-hosted / privacy-first	GLM-5 (self-hosted)	MIT license, self-hostable, avoids Chinese API data routing concerns
Frontend / visual / design-to-code	Kimi K2.5	Strong vision capabilities, UI generation
Large-scale agent orchestration	Kimi K2.5	100 sub-agents, 1,500 parallel tool calls
Simple tasks (formatting, linting, boilerplate)	Haiku / GPT-4.1 mini / Flash	Don't overthink it. Cheap and fast wins here.

I wrote about the model selection framework in more detail. The core principle is matching capability to complexity. Using Opus to format a JSON file is like renting a crane to hang a picture frame.

The table above is a starting point. Your actual workflow will be messier. You'll find tasks that fall between tiers, models that surprise you on tasks they weren't "supposed" to handle, and edge cases where the cheap model is actually better because it doesn't overthink. Test on your workload. The table gives you a starting hypothesis.

The multi-model future

The teams getting the best results aren't picking one model. They're routing.

Simple tasks go to cheap, fast models. Complex tasks go to frontier models. Nobody runs a single EC2 instance type for their entire infrastructure. Same principle applies here.

The tooling supports this now. Claude Code, Cursor, Continue, and OpenCode all support model switching or multi-model configurations. You can set your default to a cost-efficient model and escalate when the task warrants it.

What a practical multi-model workflow looks like:

Scaffolding, boilerplate, simple edits → Haiku or GLM-5 (~$0.10-0.25/MTok)
Feature implementation, debugging, test writing → Codex 5.3 or Sonnet (~$1-3/MTok)
Architecture decisions, complex refactors, multi-file changes → Opus 4.6 (~$5/MTok)
Privacy-sensitive codebases → GLM-5 self-hosted (infrastructure cost only)

The cost difference compounds. A team that routes 80% of tasks to a cheap model and 20% to a frontier model might spend 5-10x less than a team that runs everything through Opus. The quality difference on those routine tasks? Negligible. I've tested this across a mix of refactoring, test generation, and boilerplate tasks. The cheap model handles 80% of them fine. The 20% where you need Opus, you really need Opus. But you don't need it for the other 80%.

GLM-5 at $0.11/MTok makes a great default for routine tasks, with Opus as the escalation path for hard problems. Even if you never self-host, even if you stay fully proprietary for your critical work, the existence of GLM-5 at that price point changes the economics of your entire workflow.

What comes next

The competitive picture will keep shifting. New models will launch. Benchmarks will get closer. Pricing will drop. That trend line isn't changing.

But the lesson from February 2026 is already clear. No single model wins everything. Each has a philosophy. Open source isn't "catching up" anymore; it's competitive, and the cost and privacy arguments seal it for many teams. Multi-model workflows are the pragmatic path forward, and the tooling finally supports them without duct tape.

If you're still defaulting to one model for every coding task, you're either overpaying or underperforming. Probably both.

Pick one task you're currently routing to an expensive model. Try it on GLM-5 or a smaller model. Measure the difference. You might be surprised how little you lose.

DEV Community