<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Collin Wilkins</title>
    <description>The latest articles on DEV Community by Collin Wilkins (@cwilkins507).</description>
    <link>https://dev.to/cwilkins507</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3796467%2F441e8a02-402c-4b8e-b897-d40223dbbf8b.jpeg</url>
      <title>DEV Community: Collin Wilkins</title>
      <link>https://dev.to/cwilkins507</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cwilkins507"/>
    <language>en</language>
    <item>
      <title>LLM Gateway Architecture: When You Need One and How to Get Started</title>
      <dc:creator>Collin Wilkins</dc:creator>
      <pubDate>Mon, 06 Apr 2026 11:49:06 +0000</pubDate>
      <link>https://dev.to/cwilkins507/llm-gateway-architecture-when-you-need-one-and-how-to-get-started-1817</link>
      <guid>https://dev.to/cwilkins507/llm-gateway-architecture-when-you-need-one-and-how-to-get-started-1817</guid>
      <description>&lt;p&gt;The monthly cloud invoice came in $12K higher than expected and nobody can explain it. &lt;/p&gt;

&lt;p&gt;Engineering added Opus for a summarization feature... Product had QA testing vision with GPT-4o... the data team switched from Sonnet to a fine-tuned model on Bedrock three weeks ago and forgot to mention it...&lt;/p&gt;

&lt;p&gt;This is the database connection problem, replayed for LLMs. Every service talking directly to an external provider, no abstraction layer, no visibility, no fallback. You solved this for database connections a decade ago with connection pools. The LLM gateway is the same pattern, and most mid-market engineering teams don't have one yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an LLM Gateway Actually Does
&lt;/h2&gt;

&lt;p&gt;An LLM gateway sits between your application code and your model providers. Instead of each service importing the OpenAI SDK or the Anthropic SDK or the Bedrock client and calling providers directly, every request routes through a single layer. Your code talks to the gateway. The gateway talks to the providers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0030q0i94aoz0qx7smql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0030q0i94aoz0qx7smql.png" alt="LLM Gateway Architecture" width="780" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think API gateway (Kong, Envoy), but built for LLM traffic patterns specifically. LLM calls stream responses, bill per token, throw provider-specific errors like Anthropic's 529 overloaded, and can run for 30+ seconds on complex prompts. A generic API gateway doesn't handle any of that well.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The practical value comes down to two things: reliability and cost visibility. Everything else the gateway does supports one of those.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On the reliability side, automatic fallback means Anthropic returns a 529 and the gateway retries on Bedrock. The outage becomes a log entry instead of a P1 incident. Prompt format differences between providers require some compatibility work upfront (system message handling, tool schemas), but once that's configured the failover is hands-off. Your application code calls one unified API regardless of which provider handles the request.&lt;/p&gt;

&lt;p&gt;On the cost side, tag every request with team, feature, and environment, and suddenly you can say "the summarization feature costs $2,400/month and 80% of that is the QA environment." That sentence is impossible without the gateway. With it, the answer takes five minutes to pull up. Routing rules send classification to Haiku and generation to Opus from a config file instead of hardcoding model names across repositories. Per-team rate limits and budget caps keep a runaway loop from burning through your monthly allocation in an afternoon.&lt;/p&gt;

&lt;p&gt;Cost visibility gets the gateway approved. Once the team sees automatic failover survive a provider outage at 2am without a page, nobody proposes removing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do You Actually Need Multiple Providers?
&lt;/h2&gt;

&lt;p&gt;Most teams don't need multiple providers yet. Every major provider ships a model family with tiers designed for exactly this kind of routing. Anthropic has Opus for complex reasoning, Sonnet for everyday code and logic, Haiku for classification and lightweight tasks. OpenAI has a similar spread. Google has Gemini Pro and Flash. One provider, three tiers, handles a surprising percentage of use cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4mwa5zq0qyq0zt1g1nh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4mwa5zq0qyq0zt1g1nh.png" alt="Single Provider 3 Levels" width="537" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The price gaps between tiers make this worth doing even without a gateway. As of April 2026, Claude API pricing per million tokens:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/MTok)&lt;/th&gt;
&lt;th&gt;Output ($/MTok)&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;Complex reasoning, coding agents, multi-step tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;Balanced performance, general production workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;High-throughput, simple queries, cost-sensitive apps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Routing a classification task from Opus to Sonnet saves 40%. Routing it to Haiku saves 80%. If half your LLM traffic is simple classification and extraction running on Opus, those numbers compound fast.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This setup uses the same API, SDK, billing, and auth. This doesn't need a gateway. You can manage it with a model parameter that changes per task.&lt;/p&gt;

&lt;p&gt;I run LeadSync this way. Haiku handles lead scoring, Sonnet handles email content generation, and the routing is a config value per task. Same pattern works for agent orchestration: route expensive models to code review and content scoring where errors cost the most, cheaper models to research and classification. None of it requires a gateway because it all runs through one provider.&lt;/p&gt;

&lt;p&gt;So when does a gateway actually earn its keep? Provider redundancy is the big one — if Anthropic goes down, a gateway fails over to Bedrock or Azure OpenAI automatically. Cost arbitrage matters when Bedrock pricing differs from direct API pricing on the same model. Capability gaps force multi-provider setups when no single provider is best at everything (vision, code generation, long context, and structured output might each have a different best-in-class model). And compliance requirements make multi-provider routing mandatory when European customers' data needs to route through EU-hosted models.&lt;/p&gt;

&lt;p&gt;If none of those apply yet, single-provider routing is the right starting point. Add the gateway when you actually hit the wall.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Add the Gateway
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single provider, fewer than 3 services&lt;/strong&gt; — No gateway needed. Route by model tier in your app config. Revisit when you cross 3 services or $3K/month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3+ services OR $3K+/month LLM spend&lt;/strong&gt; — Centralized gateway. Start with cost tagging and one fallback provider.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple providers required&lt;/strong&gt; (redundancy, compliance, capability gaps) — Centralized gateway with multi-provider routing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data residency requirements&lt;/strong&gt; — Layer edge routing on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can't answer "what is each team spending per feature per month," you need the gateway regardless of where you fall on this list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Architecture Patterns
&lt;/h2&gt;

&lt;p&gt;The deployment pattern depends on team size, how many services are making LLM calls, and whether you have data residency requirements.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Latency Impact&lt;/th&gt;
&lt;th&gt;Visibility&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sidecar Proxy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gateway runs as a library or sidecar alongside each service&lt;/td&gt;
&lt;td&gt;Minimal (in-process or localhost)&lt;/td&gt;
&lt;td&gt;Per-service only&lt;/td&gt;
&lt;td&gt;Small teams, fewer than 3 services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Centralized Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dedicated service all LLM traffic routes through&lt;/td&gt;
&lt;td&gt;One network hop&lt;/td&gt;
&lt;td&gt;Full cross-service visibility&lt;/td&gt;
&lt;td&gt;Mid-market teams, 3-20 services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edge Routing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gateway at CDN/edge, routing by geography or compliance zone&lt;/td&gt;
&lt;td&gt;Variable by region&lt;/td&gt;
&lt;td&gt;Full with regional breakdown&lt;/td&gt;
&lt;td&gt;Multi-region, data residency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Sidecar proxy&lt;/strong&gt; is the fastest way in. Import LiteLLM as a Python library, point your existing model calls at it, and you have basic routing and fallback working in an afternoon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized gateway&lt;/strong&gt; is where most mid-market teams should land. Deploy LiteLLM in proxy mode (or Portkey) as a standalone service and point each application at the gateway's URL instead of the provider's. One dashboard shows every team's spend, every model's usage, every feature's cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge routing&lt;/strong&gt; adds geographic or compliance-based routing on top. European requests go to EU-hosted models for GDPR, APAC to the closest region for latency. Most teams don't need this yet. If you don't have data residency requirements, Pattern 2 covers you.&lt;/p&gt;

&lt;p&gt;The decision shortcut: fewer than 3 services, sidecar. Three or more, centralized. Data residency requirements, layer edge routing on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Strategies That Actually Save Money
&lt;/h2&gt;

&lt;p&gt;The gateway gives you routing. The strategy determines how much value you extract from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-based routing&lt;/strong&gt; has the highest impact and the simplest logic. A support ticket classifier doesn't need Opus. Haiku handles it for a fraction of the cost with comparable accuracy on well-defined tasks. The gateway lets you make that distinction in one routing table instead of hunting through application code for hardcoded model names. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capability-based routing&lt;/strong&gt; sends vision tasks to models with vision support, long-context requests to large-window models, and structured output requests to models with native JSON mode. Without a gateway this means importing four SDKs and writing provider-specific conditionals that nobody wants to maintain. With a gateway you define the capability map once and application code doesn't care which model handles the request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency-based routing&lt;/strong&gt; sends streaming chat responses to the fastest available provider and batch jobs to the cheapest. The gateway can measure provider performance empirically and shift traffic away from degraded providers before users start complaining. This is where the reliability engineering value shows up, since the gateway is making routing decisions based on real-time performance data rather than static configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A/B testing&lt;/strong&gt; routes a percentage of traffic to a new model, compares quality against the baseline, and promotes or rolls back. Without a gateway this means feature flags, comparison infrastructure, and new deployment code. With a gateway you change a routing weight and let it run.&lt;/p&gt;

&lt;p&gt;Most teams combine cost-based with one other strategy. That covers the vast majority of the value.&lt;/p&gt;

&lt;p&gt;Here's what a basic cost-based routing config looks like in LiteLLM proxy mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fast-classify"&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-haiku-4-5-20251001"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate"&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-sonnet-4-20250514"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate"&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock/anthropic.claude-sonnet-4-v1"&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;routing_strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple-shuffle"&lt;/span&gt;
  &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your application calls &lt;code&gt;fast-classify&lt;/code&gt; for ticket routing and tagging, &lt;code&gt;generate&lt;/code&gt; for content and reasoning. Two entries for &lt;code&gt;generate&lt;/code&gt; means if the direct Anthropic API fails, the gateway retries on Bedrock automatically. The routing decision lives in this config file, not scattered across your application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build vs. Adopt
&lt;/h2&gt;

&lt;p&gt;Most teams should start with &lt;strong&gt;LiteLLM&lt;/strong&gt; in proxy mode. It's open source, supports 100+ providers through a unified API, runs as a Python library or standalone proxy, and handles cost tracking, fallback, and rate limiting out of the box. SaaS alternatives like Portkey and Helicone exist if you don't want to run the proxy yourself, but the per-request pricing adds up. Building a custom routing layer is almost never justified — routing models by task complexity is a configuration problem, not a software engineering problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting It Into Production
&lt;/h2&gt;

&lt;p&gt;The sequence matters more than the timeline. With AI-assisted scaffolding you can get through this in a few days, but doing the steps out of order is where teams get burned.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deploy the proxy with one service.&lt;/strong&gt; Point a single existing service at LiteLLM without any changes. If something breaks, you want to find out before migrating anything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add cost tags.&lt;/strong&gt; Team, feature, environment on every request. Let baseline data collect. This is where teams have their first real conversation about LLM spend, because the data almost always surfaces something nobody expected — QA running expensive calls around the clock, a retry loop doubling costs on one endpoint, a feature nobody uses still generating hundreds of requests a day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure automatic fallback.&lt;/strong&gt; Primary provider returns a 429 or 529, gateway retries on a secondary. Test by blocking the primary in staging while you're watching, not during an actual outage at 2am.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Downgrade one use case.&lt;/strong&gt; Pick a task where you're using an expensive model for something simple and switch it to Haiku-class. Measure quality against your baseline. If it holds (and it usually does for classification and extraction), that's your first real cost savings. If quality drops, switch back and try a different task boundary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roll out and publish the dashboard.&lt;/strong&gt; I know, &lt;em&gt;another&lt;/em&gt; dashboard to worry about.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Migrate remaining services and share the cost dashboard with engineering leadership. Teams that can see their LLM costs start optimizing without anyone writing a policy memo.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Goes Wrong
&lt;/h2&gt;

&lt;p&gt;This section matters more than the implementation playbook, because the mistakes are where the real money goes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The QA environment is the silent budget killer.&lt;/strong&gt; A test suite running Opus calls against every PR, 24/7, with nobody reviewing the results. The fix takes five minutes once cost tagging by environment is in place, but without it the spend is invisible. This is the single most common cost surprise and it's also the easiest to fix, which makes it a good argument for the gateway all by itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retry loops compound faster than you'd expect.&lt;/strong&gt; A service gets a 429 rate limit, retries with exponential backoff, but the backoff ceiling is set too high and the service hammers the same provider with progressively more expensive calls (longer prompts on each retry because context accumulates). Gateway fallback routing eliminates this entirely since the retry goes to a different provider instead of beating on the rate-limited one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over-engineering the routing logic.&lt;/strong&gt; The first strategy should be simple: expensive model for complex tasks, cheap model for simple tasks, one fallback provider. The teams that get the most value from gateways are the ones that start with simple routing rules and add more only when the cost data shows they need them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating the gateway as a one-time cost savings project.&lt;/strong&gt; Teams deploy the gateway, save 30% through routing, and call it done. They never build the cost dashboard or set up ongoing tagging for new services. Cost savings are great, but the bigger win is permanent visibility into what you're spending, where, and why. That requires treating the gateway as infrastructure, not a project.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about AI infrastructure and engineering every couple weeks. &lt;a href="https://buttondown.com/collinwilkins" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; if this was useful.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Claude Code Productivity Paradox</title>
      <dc:creator>Collin Wilkins</dc:creator>
      <pubDate>Wed, 11 Mar 2026 18:41:16 +0000</pubDate>
      <link>https://dev.to/cwilkins507/the-claude-code-productivity-paradox-47go</link>
      <guid>https://dev.to/cwilkins507/the-claude-code-productivity-paradox-47go</guid>
      <description>&lt;p&gt;&lt;em&gt;originally published at &lt;a href="https://collinwilkins.com/articles/claude-code-productivity-paradox" rel="noopener noreferrer"&gt;collinwilkins.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Anthropic surveyed 132 of their own engineers about Claude Code. The numbers looked incredible. 67% more merged PRs per day. Usage jumped from 28% to 59% of daily work. Self-reported productivity gains between 20% and 50%.&lt;/p&gt;

&lt;p&gt;Then someone checked the organizational dashboard. The delivery metrics hadn't moved.&lt;/p&gt;

&lt;p&gt;That's the productivity paradox, and the gap between those two sets of numbers is where it gets interesting.&lt;/p&gt;

&lt;p&gt;I've been running Claude Code as my main dev tool for months now (it's basically replaced my terminal workflow at this point). I've written about &lt;a href="https://collinwilkins.com/articles/context-engineering-ai-coding-tools" rel="noopener noreferrer"&gt;context engineering&lt;/a&gt;, &lt;a href="https://collinwilkins.com/articles/ai-agent-workflow-claude-code" rel="noopener noreferrer"&gt;specialized agents&lt;/a&gt;, and &lt;a href="https://collinwilkins.com/articles/from-vibe-coding-to-agentic-engineering" rel="noopener noreferrer"&gt;agentic orchestration patterns&lt;/a&gt;. All of that assumed AI coding tools deliver net-positive outcomes. Turns out the picture is messier than I thought.&lt;/p&gt;

&lt;h2&gt;
  
  
  The individual numbers are impressive
&lt;/h2&gt;

&lt;p&gt;Those Anthropic survey numbers are all individual metrics — how much each engineer used the tool, how fast they felt, how many PRs they shipped. By any of those measures, the tool was clearly working.&lt;/p&gt;

&lt;p&gt;The solo developer story is even more dramatic. A case study published in February 2026 documented one developer delivering what was scoped as a "4 people x 6 months" project in 2 months, working alone. That's a raw 12x multiplier on person-months (the kind of number that gets screenshot'd and passed around without context), and by my rough math, about 3x when you weight for task mix. The breakdown by task type tells the real story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate and scaffolding&lt;/td&gt;
&lt;td&gt;~10x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex logic and debugging&lt;/td&gt;
&lt;td&gt;~2x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture and planning&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That distribution matters. The mechanical work got dramatically faster, but the judgment work barely moved.&lt;/p&gt;

&lt;p&gt;This matches my own usage almost exactly. When I'm scaffolding a new module or wiring up boilerplate integrations, Claude Code flies. I can stand up a full project structure in minutes. But the architecture decisions, the "which service owns this data" conversations, the debugging where the root cause is three layers removed from the symptom? Those take the same time they always did.&lt;/p&gt;

&lt;p&gt;Faros AI's analysis confirmed the same shape: 21% more tasks completed, 98% more PRs merged.&lt;/p&gt;

&lt;p&gt;If you stopped reading here, the conclusion is obvious... Ship Claude Code to your whole team and watch the numbers climb!&lt;/p&gt;

&lt;p&gt;Don't stop reading here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The organizational numbers tell a different story
&lt;/h2&gt;

&lt;p&gt;Faros AI measured DORA metrics on the same teams: deployment frequency, lead time, change failure rate, time to restore service. Unchanged. Meanwhile code review times increased 91%. The METR study found experienced developers on familiar codebases took 19% longer on real-world tasks while estimating they were 20% faster. Developers felt faster. The customer deliverables didn't move.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the speedup goes
&lt;/h2&gt;

&lt;p&gt;Code generation got faster but everything downstream of it didn't. Planning, design, prioritization, code review, QA — still run at the same speed. When one stage of the pipeline accelerates and the rest stays flat, you get a pile-up at the next bottleneck, not faster delivery.&lt;/p&gt;

&lt;p&gt;What I'm seeing in a lot of teams right now: AI writes the code, opens a PR, and then another AI tool (or human) on the review side suggests meaningful changes. That leads to additional back-and-forth after the PR is already open — churn that doesn't show up in "PRs merged" but absolutely shows up in cycle time.&lt;/p&gt;

&lt;p&gt;Anthropic's own survey found that more than 50% of their engineers could "fully delegate" only 0-20% of their daily work to Claude Code. These are Anthropic engineers, on Anthropic's own tool, in an environment optimized for exactly this usage. If the people who built the tool can only fully hand off a fifth of their work, the ceiling for a typical team is lower.&lt;/p&gt;

&lt;p&gt;I'd put myself in that 0-20% bucket too. Most of my Claude Code usage is collaborative, not delegated. I'm reviewing output, re-prompting when it drifts, catching architectural decisions the agent doesn't have context for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "build it because you can" trap
&lt;/h2&gt;

&lt;p&gt;There's a subtler problem that doesn't show up in any of the studies. Because coding delivery sped up, more features feel feasible. A feature that would have taken two sprints now looks like a long afternoon, and that changes the calculus on whether it's worth building.&lt;/p&gt;

&lt;p&gt;It shouldn't. The cost of building a feature was never just the implementation time. It's the maintenance, the cognitive load on the team, the opportunity cost of not building something else, the QA cycles, the documentation, the support burden. AI made the implementation cheaper. It didn't make any of those other costs cheaper.&lt;/p&gt;

&lt;p&gt;What I'm seeing is that the bar for "let's just build it" has dropped. It's easy to prompt a new feature into existence, so naturally the threshold for opening a PR lowers. Teams should keep a high bar and think hard about whether a feature is worth shipping at all, regardless of how fast it can be coded.&lt;/p&gt;

&lt;p&gt;A lot of teams are also freezing hiring or laying people off based on early perceptions of AI development speed. There's a common assumption that AI simply raises the bar for everybody. In my experience, that's not the case. The gains are uneven, task-dependent, and often illusory when you measure end-to-end.&lt;/p&gt;

&lt;p&gt;A Hacker News thread from March 2026 captured the human side of this well. Comments offered the most useful frame: "Do you enjoy the 'micro' of getting bits of code to work, or the 'macro' of building systems that work? If it's the former, you hate AI agents. If it's the latter, you love AI agents." That split is real and doesn't resolve with better tooling. It requires honest conversations about what each person's role becomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for team adoption
&lt;/h2&gt;

&lt;p&gt;Your current metrics are probably measuring the wrong things. If you're an engineering manager trying to figure out whether AI coding tools are working, here's what the data actually says.&lt;/p&gt;

&lt;p&gt;The first thing I'd change is what you're counting. More PRs per developer is a real number. It doesn't mean the team ships better software faster. If review times are climbing and defect rates are flat, the bottleneck moved from writing to reviewing. Measure the bottleneck, not the part that got faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest in review infrastructure before scaling AI-generated output.&lt;/strong&gt; The review time increase isn't a tooling problem you fix with a faster CI pipeline. That's a structural problem. If you're rolling out AI coding tools to a team without simultaneously expanding review capacity, you're building pressure on the part of the pipeline least equipped to absorb it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set expectations by task type, not tool type.&lt;/strong&gt; The speedup distribution from that solo dev case study is the most useful number to take away from this data. Boilerplate flies. Architecture doesn't move.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track the boring metrics.&lt;/strong&gt; If you're measuring AI tool ROI through surveys and individual PR counts, you're measuring perception. Track cycle time end-to-end. Look at defects per deploy. Pull time-from-commit-to-production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't let the tool lower your refactoring standards.&lt;/strong&gt; When you're iterating on a feature, the original design sometimes calls for a refactor. When the LLM can work around the existing structure, the willingness to do that refactor drops. Fight that. Leave the codebase better than you found it, same as always.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review AI output in a fresh session.&lt;/strong&gt; AI is biased in the code it writes. Common patterns, familiar abstractions, the path of least resistance. The best way to catch those inefficiencies is to review with fresh eyes, outside the context window that produced the code. A thorough human review in a separate session will catch things that in-context review misses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use AI to bulldoze friction.&lt;/strong&gt; The friction you feel during development, the code review pushback, the design debate, the test that keeps failing, that friction exists for a reason. Using AI code generation to power through it faster doesn't remove the underlying problem. It just ships the problem to production. These are the same engineering practices we've always applied. &lt;/p&gt;

&lt;h2&gt;
  
  
  What you're actually measuring
&lt;/h2&gt;

&lt;p&gt;Confusing "more output" with "better outcomes" is how teams make expensive adoption decisions. The teams that get real value from Claude Code won't be the ones that hand it to every developer and watch PR counts climb. They'll redesign their workflow around the new shape of the work — where writing is cheap, reviewing is expensive, judgment calls haven't gotten any easier, and half the features that feel feasible probably aren't worth building.&lt;/p&gt;

&lt;p&gt;I've built my personal workflow around these tools. This piece is about whether they're actually working.&lt;/p&gt;

&lt;p&gt;Both questions matter. Most teams are only asking the first one.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Context Engineering for AI</title>
      <dc:creator>Collin Wilkins</dc:creator>
      <pubDate>Tue, 03 Mar 2026 11:52:58 +0000</pubDate>
      <link>https://dev.to/cwilkins507/context-engineering-for-ai-2fof</link>
      <guid>https://dev.to/cwilkins507/context-engineering-for-ai-2fof</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://collinwilkins.com/articles/context-engineering" rel="noopener noreferrer"&gt;collinwilkins.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This one is long and a more detailed follow up to my &lt;a href="https://collinwilkins.com/articles/context-engineering-ai-coding-tools" rel="noopener noreferrer"&gt;first article on the topic&lt;/a&gt; on this&lt;/p&gt;

&lt;p&gt;Two things before we get into it: &lt;br&gt;
first, you'll walk away with at least one thing you can apply this week to get more consistent results from your AI coding tool. &lt;/p&gt;

&lt;p&gt;Second, the examples throughout use Claude Code — that's my daily stack, not an endorsement. Every principle here applies to Cursor, Copilot, or whatever you're running. &lt;/p&gt;

&lt;p&gt;Let's start with an example that probably happened to you this week...&lt;/p&gt;

&lt;p&gt;You asked your AI coding tool to add a new API endpoint. It generated exactly what you need — right naming convention, file location, and imports. You closed the task in 15 minutes.&lt;/p&gt;

&lt;p&gt;Next morning, you asked for another endpoint. It used a naming pattern from a framework you dropped three months ago. The file landed in the wrong directory. It imported a library that's no longer in the dependency tree. You spent 40 minutes cleaning it up.&lt;/p&gt;

&lt;p&gt;Then a teammate tried the same tool on the same codebase. Their output matched neither of yours.&lt;/p&gt;

&lt;p&gt;Same model and codebase produced three completely different results. The variable nobody names: what the AI could actually see. Controlling that is a discipline and many developers aren't doing it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Debate Everyone's Having Wrong
&lt;/h2&gt;

&lt;p&gt;A few weeks ago, an HN thread along the lines of "Cursor's context is 10X better than Claude Code's" hit the front page with 150+ points and hundreds of comments. Developers trading war stories about which tool retrieves the right files, which one hallucinates project conventions, which one actually understands a large codebase.&lt;/p&gt;

&lt;p&gt;The thread was comparing tool features — how Cursor auto-indexes and retrieves files by semantic similarity versus how Claude Code relies on explicit file reads and instruction routing. Worth knowing. &lt;/p&gt;

&lt;p&gt;But none of it explains why the same tool, same codebase, same developer produces solid output on Tuesday and unshippable output on Thursday. That gap is context engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; is the discipline of controlling what information an AI coding tool has access to, how that information is structured, and what instructions govern its behavior. It's distinct from prompt engineering (what you say in a given session) and model selection (which AI you use). You can write perfect prompts and pick the most capable model and still get inconsistent results if the context is wrong.&lt;/p&gt;

&lt;p&gt;I went into this in more detail &lt;a href="https://collinwilkins.com/articles/enterprise-best-practices" rel="noopener noreferrer"&gt;here&lt;/a&gt;, this variability is actually designed into models to help give them reasoning&lt;/p&gt;

&lt;p&gt;Developers who understand this produce more consistent work than any tool comparison would predict. The ones still debating Cursor vs. Claude Code are optimizing the wrong variable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Context Quality Determines Output Quality
&lt;/h2&gt;

&lt;p&gt;Every AI coding tool generates predictions from everything in its context window. That's not just your last message. It includes the files the tool read earlier in the session, the instruction files it loaded at startup, documentation it retrieved, and the full conversation history. You're getting a response to everything the model has seen, not just what you typed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Primacy Problem.&lt;/strong&gt; Models recall information near the beginning and end of their context window better than material buried in the middle. The implication is direct: your most important instructions — naming conventions, anti-patterns, what to never modify — belong at the top of your config files, not tucked into section 7 after a wall of boilerplate. Instructions at line 300 of a bloated CLAUDE.md are functionally invisible no matter how well-written they are. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukcgsqb1p9lcxs0eojz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukcgsqb1p9lcxs0eojz8.png" alt="The Primacy Problem — models recall the beginning and end of their context window better than the middle" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ask your AI tool: "What naming conventions does this project use?" If it answers correctly without reading a specific file, your context is working. If it asks for clarification or gives you a generic answer, your context engineering needs work.&lt;/p&gt;

&lt;p&gt;Garbage in, garbage out applies at the context level, not just the prompt level. A well-crafted prompt can't compensate for context that's missing, outdated, or structurally wrong.&lt;/p&gt;

&lt;p&gt;Most developers understand this in the abstract. They just haven't mapped which layer they're actually investing in. That mapping explains almost everything about inconsistent results.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Four Layers of Context
&lt;/h2&gt;

&lt;p&gt;Most developers treat context as a single thing: what they've said so far this session. It's actually four layers with very different durability and very different impact.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;th&gt;How Long It Lasts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Project structure&lt;/td&gt;
&lt;td&gt;Folder names, file naming, co-location of decisions&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instruction files&lt;/td&gt;
&lt;td&gt;CLAUDE.md, Cursor rules, .github/copilot-instructions.md&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File-level docs&lt;/td&gt;
&lt;td&gt;Comments, type annotations, explicit naming&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session context&lt;/td&gt;
&lt;td&gt;Files read this session, conversation history&lt;/td&gt;
&lt;td&gt;Ephemeral&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnb2zal8bj3atdj7xe04r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnb2zal8bj3atdj7xe04r.png" alt="The Four Layers of Context — layers 1-3 are permanent, layer 4 resets every session" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most developers spend all their energy on Layer 4. The right prompt for this session. Better instructions in this particular message. Layer 4 resets every session. Everything you engineer there disappears at session end.&lt;/p&gt;

&lt;p&gt;Layers 1-3 are permanent. They work whether you're logged in or not. They benefit every session, every developer on the team, every AI tool that touches the codebase.&lt;/p&gt;

&lt;p&gt;The math is simple: one hour invested in a CLAUDE.md instruction file compounds across every future session. One hour spent crafting a better prompt compounds across exactly one. Fix the bottom layers and the top layer takes care of itself. The best place to start is your instruction file.&lt;/p&gt;
&lt;h2&gt;
  
  
  CLAUDE.md Patterns That Actually Work
&lt;/h2&gt;

&lt;p&gt;Every major AI coding tool has an equivalent to CLAUDE.md. Cursor has &lt;code&gt;.cursor/rules&lt;/code&gt;. GitHub Copilot has &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt;. OpenAI and others use AGENTS.md. Same principle across all of them: a file the AI reads at session start that shapes its behavior for everything that follows. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this file is loaded automatically at the start of every session! &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most teams write this file wrong. They treat it like a README.&lt;/p&gt;

&lt;p&gt;A README explains your project. A table of contents tells you where everything else lives. Your instruction file should be a navigation layer: pointers to where conventions are documented, not exhaustive documentation of those conventions. When the agent needs your GraphQL design patterns, it gets routed to the right file. The patterns don't live in the root config. The root config tells the agent where to find them.&lt;/p&gt;

&lt;p&gt;Primacy applies here too. Put project overview and critical anti-patterns at the top. This is deliberate architecture, not formatting preference.&lt;/p&gt;

&lt;p&gt;A monolithic root CLAUDE.md that's 800 lines long is context bloat with a reading time penalty every session. Move subdirectory-specific conventions into CLAUDE.md files within those subdirectories. The root file stays lean.&lt;/p&gt;

&lt;p&gt;Most teams skip the exclusion zone entirely. Naming what the AI should never touch matters as much as naming what it can do. Generated code, migration files, lock files, vendor directories — put them on an explicit list. Relying on the model to infer off-limits territory is how you get PRs that modify auto-generated files.&lt;/p&gt;

&lt;p&gt;Add YAML frontmatter to your project documentation. When docs, ADRs, and notes carry structured metadata, they become machine-queryable. Ask the agent for "anything tagged with payment-flow" and it surfaces the right files rather than grepping blindly. That's the closest thing to semantic search without native support.&lt;/p&gt;

&lt;p&gt;Here's a minimal skeleton that reflects the structure that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project Name -- CLAUDE.md&lt;/span&gt;

&lt;span class="gu"&gt;## Overview&lt;/span&gt;
[2-3 sentences: what this is, what it does, the core stack]

&lt;span class="gu"&gt;## Folder Map&lt;/span&gt;
src/api/        - Route handlers. One file per domain.
src/services/   - Business logic. Stateless functions only.
src/models/     - Prisma schema and type definitions.
docs/           - Project documentation. Read before making architectural changes.
docs/decisions/ - Architecture Decision Records. One file per major decision.

&lt;span class="gu"&gt;## Tech Stack&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Node.js 22 / TypeScript 5
&lt;span class="p"&gt;-&lt;/span&gt; Prisma + PostgreSQL
&lt;span class="p"&gt;-&lt;/span&gt; Next.js 15 App Router

&lt;span class="gu"&gt;## Naming Conventions&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Components: PascalCase
&lt;span class="p"&gt;-&lt;/span&gt; Utils and hooks: camelCase
&lt;span class="p"&gt;-&lt;/span&gt; Files: kebab-case
&lt;span class="p"&gt;-&lt;/span&gt; Database tables: snake_case

&lt;span class="gu"&gt;## Do Not Touch&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; /migrations   - auto-generated, never edit manually
&lt;span class="p"&gt;-&lt;/span&gt; /generated    - prisma client output, run &lt;span class="sb"&gt;`npx prisma generate`&lt;/span&gt; to rebuild
&lt;span class="p"&gt;-&lt;/span&gt; src/vendor/   - third-party code, not ours

&lt;span class="gu"&gt;## Anti-Patterns&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; No raw SQL -- use Prisma queries
&lt;span class="p"&gt;-&lt;/span&gt; No &lt;span class="sb"&gt;`any`&lt;/span&gt; types -- use proper types or &lt;span class="sb"&gt;`unknown`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; No default exports -- named exports only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is roughly the structure this vault uses, adapted for a software project. It took many iterations to reach something stable. That's the nature of the document. You evolve it, you don't write it once so version control it with your other code changes.&lt;/p&gt;

&lt;p&gt;The ROI data on this is concrete. Aakash Gupta's PM OS (news.aakashg.com, Feb 2026) used a well-crafted CLAUDE.md with skills and sub-agents to reduce PRD creation from 4-8 hours to 30 minutes. Harry Zhang called CLAUDE.md the "highest ROI habit" in Claude Code. Faros AI's 2026 measurement of Claude Code usage across engineering teams found roughly 4:1 ROI — cost per PR around $37.50 against 2 hours saved at $75/hour. Not controlled studies. Consistent practitioner reports. The pattern holds across enough setups that dismissing it as anecdote is a mistake.&lt;/p&gt;

&lt;p&gt;A tight instruction file is necessary but not sufficient. It works better when your project structure isn't fighting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure as Context Contract
&lt;/h2&gt;

&lt;p&gt;Before the AI reads a single instruction file, it's already forming a model of your codebase from its structure. Folder names are documentation. File names are documentation. The way you organize things tells the agent what belongs where, what relates to what, and what conventions you follow — automatically, without you saying a word.&lt;/p&gt;

&lt;p&gt;Most teams don't think about this as context engineering. I've watched this exact disconnect produce mysterious, inconsistent AI output on every large codebase I've touched. The structure is sending signals the team never intended to send.&lt;/p&gt;

&lt;p&gt;Consistent naming matters more than you'd expect. If some components are named &lt;code&gt;UserCard&lt;/code&gt;, some are &lt;code&gt;user-card&lt;/code&gt;, and some are &lt;code&gt;UserCardComponent&lt;/code&gt;, the agent is receiving three different signals about the same thing. It can't infer a convention from contradictions. It produces output that matches whichever form it saw most recently, not the correct form. Three inconsistent names is three opportunities for the wrong suggestion.&lt;/p&gt;

&lt;p&gt;Keep tests, docs, and decisions next to the code they describe. A test file two directories away from its source module is context the agent might never retrieve. A test file in the same directory gets read automatically when the source gets opened. Don't make the agent hunt. It won't always find what it's looking for, and you'll pay for that in bad output.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;docs/decisions/&lt;/code&gt; folder earns its keep fast. One file per major architectural choice, written when you make the decision. When the agent is working in the payments layer and a relevant ADR exists, it surfaces the reasoning behind how things are built. Without ADRs, the agent sees the what and invents the why. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A good practice is to write the architectural map or code in a lookup table in this section so there's a quick reference for AI to get up to speed on a codebase (every session is 'new').&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Deeply nested folder hierarchies are a hidden context tax. Every level of nesting increases the probability that relevant files fall outside the context window when the agent is working on something nearby. Flat structure with clear naming outperforms deep hierarchies for AI-assisted work. If your project is necessarily deep, your instruction file routing has to be precise enough to compensate.&lt;/p&gt;

&lt;p&gt;Structure produces consistent context. Even perfect structure can't fix a bloated context window, though. That's where most sessions quietly break down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Your Context Window
&lt;/h2&gt;

&lt;p&gt;These examples use Claude Code mechanics because that's what I work in daily. Every serious AI coding tool has equivalents. The pattern matters more than the specific command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check your window before it checks you.&lt;/strong&gt; Claude Code's &lt;code&gt;/context&lt;/code&gt; command shows token counts for your current session: input tokens used, output tokens, cache status. When input tokens are approaching the model's limit, output quality degrades. Responses get shorter. Suggestions get less precise. Hallucinations increase. By the time you notice the quality drop, you're already in it. Check before starting long tasks, not after.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbkjfzywxka29jwuftb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbkjfzywxka29jwuftb6.png" alt="Context window at 130k/200k tokens used" width="800" height="698"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;130k of 200k tokens used (65%). Messages account for 106.4k tokens — over half the window consumed by conversation history alone. Free space: 35k. This is the threshold where output quality starts slipping.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compact vs. new session.&lt;/strong&gt; The &lt;code&gt;/compact&lt;/code&gt; command summarizes your current session and rebuilds a condensed version. Use it when you're mid-task, need to shed conversation weight, and the working context (decisions made, files read, direction established) is still relevant to where you're going.&lt;/p&gt;

&lt;p&gt;A new session starts clean. Use it when the task is complete, when you're switching domains, or when accumulated context has drifted from what you're actually doing. Old context isn't neutral. It's noise that pulls the model toward decisions you've already discarded.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvkv54ejiurg83gp2wvp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvkv54ejiurg83gp2wvp.png" alt="/compact operation in progress" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;/compact rebuilds the session: Claude re-reads the key files it needs, restores skills, and collapses the conversation into a condensed summary.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tpri3e8whntl5rbixu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tpri3e8whntl5rbixu3.png" alt="After compact: 51k/200k tokens" width="782" height="713"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;After compact: 51k/200k tokens (25%). Messages dropped from 106.4k to 25.4k. Free space jumped from 35k to 116k. The working context survived. The dead weight didn't.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Compact preserves momentum. A new session preserves clarity. Clarity usually wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three modes, three different situations.&lt;/strong&gt; Plan mode (&lt;code&gt;/plan&lt;/code&gt;) makes the AI propose before touching anything. Use it for multi-file changes, anything touching shared infrastructure, or any task where you're not certain what the blast radius is. The proposal step isn't overhead. It's the difference between reviewing a plan and reviewing a broken implementation.&lt;/p&gt;

&lt;p&gt;Accept with edits is the default for most sessions. The AI does the work, you verify.&lt;/p&gt;

&lt;p&gt;Bypass or auto-approve is appropriate only when Layers 1-3 are solid. When the AI knows your conventions, when it has explicit anti-patterns to follow, when the context is tight — that's when giving it autonomous scope makes sense. Better context engineering is how you earn the right to give your agent autonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;/think&lt;/code&gt; before complex decisions.&lt;/strong&gt; This command forces the model to reason explicitly before responding. Use it for architecture decisions, hard debugging, anything where the first answer is likely wrong. You're not changing the response. You're changing the quality of the reasoning that produces it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These three files are your agent's persistent identity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; (user-level): your preferences across all projects — editor config, communication style, how you want code commented&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;./CLAUDE.md&lt;/code&gt; (project-level): project conventions, folder structure, anti-patterns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AGENTS.md&lt;/code&gt; (root): behavioral rules for specific agents or workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without them, every session starts from zero. The agent has no memory of what you've built, what you've decided, or what you've told it to avoid. With them, the agent isn't starting from zero. It already knows what you're building and what you've decided. That's the difference between a tool you configure once and a tool you re-brief every morning.&lt;/p&gt;

&lt;p&gt;Build that identity well and you'll want to extend it. That's where the context budget comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Budget: Skills vs. MCP
&lt;/h2&gt;

&lt;p&gt;This tradeoff applies to any framework that extends an AI agent's capabilities. It shows up in every serious setup. Most developers don't think about it until sessions start degrading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skills&lt;/strong&gt; are lightweight instruction files loaded at session start. They tell the agent how to do something: a workflow, a content pattern, a code review checklist, a task it performs the same way every time. You write it once. The context cost is fixed and paid once at session start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt; connects the agent to real-time external services. The agent calls a tool mid-session: a database query, a live API call, a current data source. The cost is variable, paid per call, and it stacks. Every MCP call loads tool schemas, the call result, and server responses into the context window.&lt;/p&gt;

&lt;p&gt;This compounds. Three MCP calls per task, ten tasks in a session — that's 30 discrete context injections on top of everything else. A skill-based equivalent, where the workflow is pre-written and the agent follows it, has a fraction of that pressure.&lt;/p&gt;

&lt;p&gt;Use MCP when you genuinely need live data. Current timestamps, a real database query, an API response that changes between calls. The output can't be pre-written because it depends on real-time state.&lt;/p&gt;

&lt;p&gt;Use skills when you have a repeatable workflow. Content patterns, review checklists, tasks the agent performs identically every time. Pre-write it once, reference it for as long as the workflow holds.&lt;/p&gt;

&lt;p&gt;The decision rule: if you can write it down and have it work 90% of the time, write it as a skill. Every unnecessary MCP call is a context tax paid on every execution. You can even use the MCP first to get something working quickly then create your own skill to get that output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auditing Your Context Setup
&lt;/h2&gt;

&lt;p&gt;Your AI should answer these questions from context alone, without reading a specific file:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What naming convention does this project use for components?&lt;/li&gt;
&lt;li&gt;What's the tech stack?&lt;/li&gt;
&lt;li&gt;What are the top 3 things I should never change in this codebase?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If it can't answer question 3, you don't have explicit anti-patterns documented. That's the biggest gap in most instruction files, and the first thing to fix.&lt;/p&gt;

&lt;p&gt;Signs your context is bloated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AI asks you to clarify things it should already know&lt;/li&gt;
&lt;li&gt;Suggestions don't match project conventions&lt;/li&gt;
&lt;li&gt;Errors reference wrong library versions or deprecated APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;If this is happening you should &lt;code&gt;/compact&lt;/code&gt; or start a new session, you aren't getting optimal results anyways.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Signs your context is working:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AI refers to project conventions without being prompted&lt;/li&gt;
&lt;li&gt;Suggestions match naming patterns on first pass&lt;/li&gt;
&lt;li&gt;It knows where things live without being told&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bloated context problem is almost always a Layer 2 issue. The instruction file grew to document everything, contradicts itself in places, and buries the most critical rules in the middle where recall degrades. Trim ruthlessly. Move subdirectory-specific rules to their subdirectory. Keep the root file focused on what's true across the whole project.&lt;/p&gt;

&lt;p&gt;That's when context engineering stops feeling like maintenance and starts paying for itself.&lt;/p&gt;




&lt;p&gt;The Cursor vs. Claude Code debate will be irrelevant within a year. Something will ship that makes both look dated. The debate restarts around whatever that tool is — same framing, same wrong frame.&lt;/p&gt;

&lt;p&gt;Context engineering won't be irrelevant. The principles — what the AI can see, how it's organized, what contracts you've written to govern its behavior — apply to whatever ships next. You're building fluency with a discipline, not a product.&lt;/p&gt;

&lt;p&gt;Open your instruction file. Find the first section that doesn't exist yet. Anti-patterns list. ADR folder reference. Exclusion zone. Write it this week. One section, one hour, permanent improvement.&lt;/p&gt;

&lt;p&gt;Master the harness. The horse will change.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>automation</category>
    </item>
    <item>
      <title>The AI Coding Model Wars: How Open Source Is Closing the Gap on Proprietary Coding Models</title>
      <dc:creator>Collin Wilkins</dc:creator>
      <pubDate>Fri, 27 Feb 2026 15:01:17 +0000</pubDate>
      <link>https://dev.to/cwilkins507/the-ai-coding-model-wars-how-open-source-is-closing-the-gap-on-proprietary-coding-models-3ca8</link>
      <guid>https://dev.to/cwilkins507/the-ai-coding-model-wars-how-open-source-is-closing-the-gap-on-proprietary-coding-models-3ca8</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://collinwilkins.com/articles/ai-coding-model-wars-2026" rel="noopener noreferrer"&gt;collinwilkins.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Four major coding models launched in six days. Two proprietary. Two open source. The benchmark gap between the best and worst? Just 2.6 percentage points.&lt;/p&gt;

&lt;p&gt;That number is the story of February 2026. There isn't a single model that is clearly winning. What matters now is which model fits your workflow, your budget, and how much you care about keeping your code off someone else's servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The week that broke the leaderboard
&lt;/h2&gt;

&lt;p&gt;On February 5, Anthropic released Claude Opus 4.6 and OpenAI shipped Codex 5.3. Same day. Two very different philosophies, both claiming the top spot in coding performance.&lt;/p&gt;

&lt;p&gt;Six days later, Zhipu AI dropped &lt;a href="https://the-decoder.com/chinese-ai-lab-zhipu-releases-glm-5-under-mit-license-claims-parity-with-top-western-models/" rel="noopener noreferrer"&gt;GLM-5&lt;/a&gt;. A 744-billion parameter open-source model under an MIT license. It scored within 1.6 points of Opus on SWE-bench. At roughly 1/45th the cost.&lt;/p&gt;

&lt;p&gt;Then Kimi K2.5 from Moonshot AI. One trillion parameters, open source, agent swarm architecture that can coordinate 100 sub-agents in parallel.&lt;/p&gt;

&lt;p&gt;Here's where things stand:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;SWE-bench Verified&lt;/th&gt;
&lt;th&gt;Input Cost (per MTok)&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;79.4%&lt;/td&gt;
&lt;td&gt;~$5.00&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5&lt;/td&gt;
&lt;td&gt;77.8%&lt;/td&gt;
&lt;td&gt;~$0.11&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex 5.3&lt;/td&gt;
&lt;td&gt;~77.3% (Terminal-Bench leader)&lt;/td&gt;
&lt;td&gt;~$1.75&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;76.8%&lt;/td&gt;
&lt;td&gt;Open weight&lt;/td&gt;
&lt;td&gt;Open Source&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sources: &lt;a href="https://www.aifreeapi.com/en/posts/glm-5-vs-opus-4-6-vs-gpt-5-3" rel="noopener noreferrer"&gt;aifreeapi.com&lt;/a&gt;, &lt;a href="https://www.interconnects.ai/p/opus-46-vs-codex-53" rel="noopener noreferrer"&gt;Interconnects.ai&lt;/a&gt;, &lt;a href="https://winbuzzer.com/2026/02/12/zhipu-ai-glm-5-744b-model-rivals-claude-opus-z-ai-platform-xcxwbn/" rel="noopener noreferrer"&gt;Winbuzzer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dzch00hcpu080qu4j5i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dzch00hcpu080qu4j5i.png" alt="Performance vs. Cost scatter plot showing all four models" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Razor-thin. Two years ago, the gap between the best and fifth-best model on any coding benchmark was 15+ points. Now the top four sit within a few points of each other and the rankings shuffle depending on which benchmark you pick.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.interconnects.ai/p/opus-46-vs-codex-53" rel="noopener noreferrer"&gt;Interconnects.ai&lt;/a&gt; put it well: workflow fit matters more than leaderboard position. I'd go further. If you're choosing a coding model based on SWE-bench scores alone, you're optimizing for the wrong thing.&lt;/p&gt;

&lt;p&gt;The real differences are in how these models work, what they cost, and what you're allowed to do with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The proprietary heavyweights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Claude Opus 4.6
&lt;/h3&gt;

&lt;p&gt;Opus 4.6 is the deep thinker. Its headline feature is &lt;strong&gt;Agent Teams&lt;/strong&gt;, the ability to spin up 16+ parallel agents that coordinate on complex tasks. Anthropic demonstrated this by having agent teams build a 100,000-line C compiler across 2,000 sessions (&lt;a href="https://www.interconnects.ai/p/opus-46-vs-codex-53" rel="noopener noreferrer"&gt;Interconnects.ai&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The philosophy is autonomous. Give it a complex problem, set guardrails, let it work. A 1-million-token context window means it can hold entire codebases in memory, and deep reasoning chains let it plan multi-step refactors that other models lose track of halfway through.&lt;/p&gt;

&lt;p&gt;The tradeoff is cost. At ~$5/MTok input, a heavy agentic session gets expensive fast. That C compiler demo reportedly cost $20,000 in API spend. I've run smaller agent workflows that still burned through $50-100 in an afternoon. For enterprise teams where engineer time costs more than API credits, that math works. For a solo dev, it probably doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Complex multi-file refactors, architectural changes, enterprise workflows where correctness matters more than cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codex 5.3
&lt;/h3&gt;

&lt;p&gt;Codex takes the opposite approach. Where Opus goes deep and autonomous, Codex goes fast and collaborative.&lt;/p&gt;

&lt;p&gt;It leads Terminal-Bench at 77.3%, which measures terminal-based coding tasks closer to how developers actually work than isolated benchmark problems (&lt;a href="https://www.interconnects.ai/p/opus-46-vs-codex-53" rel="noopener noreferrer"&gt;Interconnects.ai&lt;/a&gt;). The real strength is interactive steering: you can redirect it mid-task without breaking context or restarting the conversation.&lt;/p&gt;

&lt;p&gt;At ~$1.75/MTok input, that's about 3x cheaper than Opus. The ecosystem around it is mature, with deep integration into VS Code, GitHub Copilot, and the broader OpenAI toolchain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://every.to/vibe-check/codex-vs-opus" rel="noopener noreferrer"&gt;Every.to&lt;/a&gt; described the split well: Opus is the model you set loose on a problem. Codex is the model you pair-program with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Fast iteration, interactive development, teams already invested in the OpenAI ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  The philosophical split
&lt;/h3&gt;

&lt;p&gt;This matters more than the benchmarks.&lt;/p&gt;

&lt;p&gt;Opus says: "Tell me the goal, I'll figure it out." That works when the task is complex enough that you'd spend hours on it yourself. It fails when you need tight feedback loops or when the cost of an autonomous run gone sideways exceeds the cost of doing it manually.&lt;/p&gt;

&lt;p&gt;Codex says: "Let's work on this together." That works for the daily grind. Writing functions, debugging, building features incrementally. It fails when you need sustained multi-step reasoning across a large surface area.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model you want depends on how you work, not how it benchmarks.&lt;/strong&gt; I keep Opus for architecture-level tasks and reach for Codex-class models when I'm iterating fast on implementation. Most days are implementation days.&lt;/p&gt;

&lt;p&gt;But the proprietary debate is only half the story. The open-source models that showed up a week later made the whole conversation more interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The open-source challengers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GLM-5
&lt;/h3&gt;

&lt;p&gt;GLM-5 is the model that changed the math.&lt;/p&gt;

&lt;p&gt;744 billion parameters in a Mixture-of-Experts architecture. MIT license. &lt;a href="https://winbuzzer.com/2026/02/12/zhipu-ai-glm-5-744b-model-rivals-claude-opus-z-ai-platform-xcxwbn/" rel="noopener noreferrer"&gt;77.8% on SWE-bench Verified&lt;/a&gt;, within 1.6 points of Opus 4.6.&lt;/p&gt;

&lt;p&gt;At ~$0.11 per million input tokens through Zhipu's API, that's roughly 45x cheaper than Opus for comparable coding performance.&lt;/p&gt;

&lt;p&gt;But cost isn't even the most interesting part.&lt;/p&gt;

&lt;p&gt;GLM-5 was &lt;a href="https://the-decoder.com/chinese-ai-lab-zhipu-releases-glm-5-under-mit-license-claims-parity-with-top-western-models/" rel="noopener noreferrer"&gt;trained entirely on Huawei Ascend chips&lt;/a&gt;, no NVIDIA dependency. It's self-hostable. Because it's MIT-licensed, you can fine-tune it on your proprietary codebase without worrying about licensing terms.&lt;/p&gt;

&lt;p&gt;The tooling ecosystem moved fast. Within days of release, GLM-5 was &lt;a href="https://simonwillison.net/2026/Feb/11/glm-5/" rel="noopener noreferrer"&gt;working with Claude Code, OpenCode, and Roo Code&lt;/a&gt; as a drop-in backend. Simon Willison noted that it handled agentic coding workflows comparably to proprietary alternatives. Multi-step tasks with tool use. The stuff that actually matters for real development work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$0.11/MTok for 77.8% SWE-bench performance, MIT-licensed, self-hostable.&lt;/strong&gt; Read that sentence again if you're still paying $5/MTok for routine coding tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Budget-conscious teams, self-hosted environments, privacy-sensitive codebases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kimi K2.5
&lt;/h3&gt;

&lt;p&gt;K2.5 from Moonshot AI takes a different angle on open source. One trillion total parameters with 32 billion active (another MoE architecture), but the standout feature is the &lt;a href="https://medium.com/data-science-in-your-pocket/kimi-k2-5-best-open-sourced-coding-ai-is-here-00c355772640" rel="noopener noreferrer"&gt;agent swarm system&lt;/a&gt;. It can coordinate up to 100 sub-agents making 1,500 tool calls in parallel.&lt;/p&gt;

&lt;p&gt;It scores 76.8% on SWE-bench Verified. Slightly below GLM-5 on pure coding benchmarks. But it has two things the others don't: strong frontend/visual understanding and native agent orchestration at a scale that would require serious custom infrastructure to replicate with other models.&lt;/p&gt;

&lt;p&gt;If you're building something that involves UI generation, design-to-code workflows, or massive parallel agent tasks, K2.5 is worth evaluating. I haven't tested it as deeply as GLM-5, but the agent swarm capability is genuinely novel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Frontend and visual tasks, large-scale agent orchestration, teams experimenting with multi-agent architectures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why open source matters now
&lt;/h3&gt;

&lt;p&gt;The performance argument is settled. Open-source models match proprietary ones on coding benchmarks. The remaining arguments are about everything else.&lt;/p&gt;

&lt;p&gt;GLM-5 at $0.11/MTok vs Opus at $5/MTok. For teams processing thousands of coding tasks per day, that's the difference between a rounding error and a budget line item. At that ratio, you could run 45 GLM-5 tasks for the cost of one Opus task. The volume math gets absurd fast.&lt;/p&gt;

&lt;p&gt;Self-hosted means your code never leaves your infrastructure. For regulated industries, defense contractors, or anyone with strict data residency requirements, this isn't a nice-to-have. It's a hard requirement. I've talked to teams in healthcare and fintech who won't touch any cloud-hosted model for their core codebase. GLM-5 with an MIT license is the first model that gives them frontier-tier coding capability without that tradeoff.&lt;/p&gt;

&lt;p&gt;There's a harder question behind the self-hosting argument, though. GLM-5 and Kimi K2.5 both come from Chinese companies — Zhipu AI and Moonshot AI, respectively. China's &lt;a href="https://en.wikipedia.org/wiki/National_Intelligence_Law_of_the_People%27s_Republic_of_China" rel="noopener noreferrer"&gt;2017 National Intelligence Law&lt;/a&gt; requires organizations to cooperate with state intelligence work. Multiple governments have already responded: the US banned Chinese AI models from government devices, Australia followed, Taiwan and Italy took similar action. CrowdStrike &lt;a href="https://www.crowdstrike.com/en-us/blog/crowdstrike-researchers-identify-hidden-vulnerabilities-ai-coded-software/" rel="noopener noreferrer"&gt;found that DeepSeek-R1 produces insecure code&lt;/a&gt; when prompted with politically sensitive topics. The scrutiny isn't theoretical. It's policy.&lt;/p&gt;

&lt;p&gt;The distinction that matters is hosted API versus self-hosted weights. Using Zhipu's API at $0.11/MTok means your code routes through Chinese servers — a non-starter for most enterprises and outright banned in some jurisdictions. Self-hosting the MIT-licensed weights means your data never leaves your infrastructure, and Chinese intelligence law doesn't apply to weights you downloaded and run locally. This is actually the strongest argument &lt;em&gt;for&lt;/em&gt; the open-source license. The MIT license isn't just a cost play. It's the escape valve that makes these models usable for teams that would otherwise never touch them.&lt;/p&gt;

&lt;p&gt;Fine-tuning on your own codebase means the model learns your patterns, your conventions, your internal APIs. Proprietary models can't offer this. And if Zhipu raises prices or changes terms, you have the weights. You can host them anywhere. &lt;a href="https://www.bitdoze.com/best-open-source-llms-claude-alternative/" rel="noopener noreferrer"&gt;Bitdoze&lt;/a&gt; noted this portability as a key factor driving enterprise adoption.&lt;/p&gt;

&lt;p&gt;The catch is real though. Self-hosting a 744B parameter model requires serious hardware. You're trading API costs for infrastructure costs. For many teams, the managed API at $0.11/MTok is the pragmatic choice anyway. But the &lt;em&gt;option&lt;/em&gt; to self-host is what creates competitive pressure on pricing across the board.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use what
&lt;/h2&gt;

&lt;p&gt;Skip the "which is best?" question. Wrong frame. The right question is "which is best for &lt;em&gt;this task&lt;/em&gt;?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3f8vg3dnj8mg2rdbwuv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3f8vg3dnj8mg2rdbwuv.png" alt="Decision tree for choosing a coding model" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Recommended Model&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Complex multi-file refactors&lt;/td&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;Deepest reasoning, Agent Teams, 1M context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast iteration and pair programming&lt;/td&gt;
&lt;td&gt;Codex 5.3&lt;/td&gt;
&lt;td&gt;Speed, interactive steering, mature ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget-conscious / high-volume&lt;/td&gt;
&lt;td&gt;GLM-5&lt;/td&gt;
&lt;td&gt;Frontier quality at 1/45th the price&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted / privacy-first&lt;/td&gt;
&lt;td&gt;GLM-5 (self-hosted)&lt;/td&gt;
&lt;td&gt;MIT license, self-hostable, avoids Chinese API data routing concerns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend / visual / design-to-code&lt;/td&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;Strong vision capabilities, UI generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large-scale agent orchestration&lt;/td&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;100 sub-agents, 1,500 parallel tool calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simple tasks (formatting, linting, boilerplate)&lt;/td&gt;
&lt;td&gt;Haiku / GPT-4.1 mini / Flash&lt;/td&gt;
&lt;td&gt;Don't overthink it. Cheap and fast wins here.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I wrote about the &lt;a href="https://collinwilkins.com/articles/ai-model-selection" rel="noopener noreferrer"&gt;model selection framework&lt;/a&gt; in more detail. The core principle is matching capability to complexity. Using Opus to format a JSON file is like renting a crane to hang a picture frame.&lt;/p&gt;

&lt;p&gt;The table above is a starting point. Your actual workflow will be messier. You'll find tasks that fall between tiers, models that surprise you on tasks they weren't "supposed" to handle, and edge cases where the cheap model is actually better because it doesn't overthink. Test on your workload. The table gives you a starting hypothesis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The multi-model future
&lt;/h2&gt;

&lt;p&gt;The teams getting the best results aren't picking one model. They're routing.&lt;/p&gt;

&lt;p&gt;Simple tasks go to cheap, fast models. Complex tasks go to frontier models. Nobody runs a single EC2 instance type for their entire infrastructure. Same principle applies here.&lt;/p&gt;

&lt;p&gt;The tooling supports this now. Claude Code, Cursor, Continue, and OpenCode all support model switching or multi-model configurations. You can set your default to a cost-efficient model and escalate when the task warrants it.&lt;/p&gt;

&lt;p&gt;What a practical multi-model workflow looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scaffolding, boilerplate, simple edits → Haiku or GLM-5 (~$0.10-0.25/MTok)&lt;/li&gt;
&lt;li&gt;Feature implementation, debugging, test writing → Codex 5.3 or Sonnet (~$1-3/MTok)&lt;/li&gt;
&lt;li&gt;Architecture decisions, complex refactors, multi-file changes → Opus 4.6 (~$5/MTok)&lt;/li&gt;
&lt;li&gt;Privacy-sensitive codebases → GLM-5 self-hosted (infrastructure cost only)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feuec2onibbd8250ejjms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feuec2onibbd8250ejjms.png" alt="Cost comparison: single model vs multi-model routing" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cost difference compounds. A team that routes 80% of tasks to a cheap model and 20% to a frontier model might spend 5-10x less than a team that runs everything through Opus. The quality difference on those routine tasks? Negligible. I've tested this across a mix of refactoring, test generation, and boilerplate tasks. The cheap model handles 80% of them fine. The 20% where you need Opus, you really need Opus. But you don't need it for the other 80%.&lt;/p&gt;

&lt;p&gt;GLM-5 at $0.11/MTok makes a great default for routine tasks, with Opus as the escalation path for hard problems. Even if you never self-host, even if you stay fully proprietary for your critical work, the existence of GLM-5 at that price point changes the economics of your entire workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The competitive picture will keep shifting. New models will launch. Benchmarks will get closer. Pricing will drop. That trend line isn't changing.&lt;/p&gt;

&lt;p&gt;But the lesson from February 2026 is already clear. No single model wins everything. Each has a philosophy. Open source isn't "catching up" anymore; it's competitive, and the cost and privacy arguments seal it for many teams. Multi-model workflows are the pragmatic path forward, and the tooling finally supports them without duct tape.&lt;/p&gt;

&lt;p&gt;If you're still defaulting to one model for every coding task, you're either overpaying or underperforming. Probably both.&lt;/p&gt;

&lt;p&gt;Pick one task you're currently routing to an expensive model. Try it on GLM-5 or a smaller model. Measure the difference. You might be surprised how little you lose.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>llm</category>
      <category>news</category>
    </item>
  </channel>
</rss>
