Route by Task, Not Vendor: The Open-Weight AI Stack

#aiarchitecture

Six months ago, "move your AI workloads to open Chinese models" was a thought experiment you floated in a strategy deck to sound forward-looking. Now it is a procurement story with real invoices attached. The migration is already happening at names you know, and it is not being driven by ideology. It is being driven by arithmetic.

Airbnb moved to Qwen, Alibaba's open-weight family. CEO Brian Chesky described it plainly: "very good, fast and cheap." It powers their support agent. Cursor built its Composer coding model on Moonshot's open weights, shipping as Kimi K2.5. Microsoft has been hosting and testing DeepSeek V4 inside Azure Foundry and Copilot. Shopify, Coinbase, Siemens, and Uber Eats have all been reported routing real production workloads to Qwen, GLM, Kimi, or DeepSeek.

None of them "switched to the best model." That framing misreads the entire decision. Each of them moved the right task to a cheaper open-weight model sitting within a few points of frontier. The distinction matters more than any benchmark leaderboard, because it inverts the question everyone has been asking.

The question was never "which model is best?"

For three years the industry has treated model selection as a single global decision. You pick the smartest model, you wire everything to it, you feel safe. That instinct is expensive and increasingly wrong. The real question is narrower and far more useful: which task actually needs the best model?

Look at what production traffic is actually made of. The overwhelming majority of it is the boring 80% — extraction, classification, summarization, routing, simple tool calls, reformatting, deduplication. This is plumbing. It does not require a model that can reason through a novel proof or design a distributed system. It requires a model that is competent, fast, and cheap.

Frontier models are priced for the hard 20% — the genuinely difficult reasoning, the long-horizon planning, the cases where an extra few points of quality translate into measurable business value. That is what you are paying a premium for. When you send the easy 80% through a frontier API, you are paying that premium on every request that never needed it.

Paying frontier prices for the easy 80% is one of the biggest sources of AI budget waste in production today.

Route by task, not by vendor

The architecture that follows is not exotic. It is a routing table. You classify the task, then you send it to the cheapest model that clears the quality bar for that task. In practice, a stack that holds up in production looks something like this:

Reasoning — GLM or Kimi, which now sit close enough to frontier that the gap rarely shows up in real workloads.
Code — Kimi Code or Qwen Coder for the bulk of generation and refactoring.
Agents and tool calls — GLM, which handles structured tool invocation reliably at a fraction of closed-API cost.
Bulk processing — MiMo, where you are grinding through volume and latency-per-dollar dominates.
Images and video — fine-tuned LTX plus Wan, tuned to your own domain.
Local workhorse — Qwen3.6-35B-A3B, the model that runs on your own hardware and quietly handles the daily grind.

Almost all of these are open-weight, self-hostable, close to frontier, and a fraction of the cost. This is the same principle that runs underneath capability commoditizing while cost becomes the frontier: when the models converge on quality, the differentiation moves to how efficiently you deploy them.

Savings are the visible win. Ownership is the real one.

The cost delta is what gets the CFO's attention, and it is real. But savings are not the point. The point is that you own the stack. When the core of your system runs on weights you hold, nobody can switch you off. Nobody can revoke your access on their timeline. Nobody can see your data, dictate your pricing, or quietly reshape your roadmap by changing theirs.

That is a business-continuity property, not a line item. It is the same argument that sits underneath the question of who owns your harness — the orchestration layer that actually knows how your company works. Open weights are the only ones nobody outside your walls can turn off.

Clearing up the "my data goes to China" reflex

There is a reflexive objection worth killing directly. "Chinese model equals my data goes to China" is simply wrong for open weights. Open weights run on your infrastructure. The weights may originate in a lab in Hangzhou or Beijing, but the weights are a static artifact — a file of numbers. When you self-host them, your data never leaves your servers. It goes to your GPUs, not theirs.

This is why the real boundary is not American versus Chinese. It is open-weight versus closed. A closed American API can log your prompts, change its terms, and go dark on a government's schedule. A set of open weights running in your own datacenter cannot do any of those things, regardless of which country trained it. The nationality of the training run is a distraction; the deployment topology is the actual security boundary.

When to still pay for closed

None of this means closed models are obsolete. They still lead on some cinematic, high-stakes workloads where the last few points of quality genuinely move the needle. The discipline is to pay for a closed model when the quality gap creates measurable business value — not by default, not out of habit, and not because it is the name everyone recognizes.

The rule is simple to state and harder to enforce: open-source first, self-host the core, pay for frontier only where it creates value you cannot get elsewhere. Enforcing it means building a routing layer, maintaining evals per task, and resisting the temptation to route everything to the smartest model because it is easier.

Key takeaways

The right question is not "which model is best?" but "which task actually needs the best?"
Most production traffic is the boring 80% — extraction, classification, routing — and frontier pricing on it is pure waste.
Route by task, not vendor: match each workload to the cheapest model that clears its quality bar.
Open weights self-hosted mean your data never leaves your servers, whatever the model's country of origin.
The real boundary is open-weight versus closed, not American versus Chinese.
Pay for closed models only where the quality gap creates business value you cannot get otherwise.

The race stopped being about the smartest model. It became about architecture that still works when today's smartest model is unavailable, unaffordable, or switched off. I write more about that shift across my essays on execution and AI infrastructure. Build the routing table now, while it is still a competitive edge rather than table stakes.