DEV Community: 欧阳石景

GLM-5.2 Made It Official: 9 of the Top 10 Open-Source LLMs Are Chinese

欧阳石景 — Wed, 17 Jun 2026 11:55:09 +0000

9 of the world's top 10 open-source LLMs are now Chinese. After GLM-5.2 landed
this week, the only non-Chinese model still in the top 10 is Llama. If you've been
putting off integrating these models because the access path looks weird from the US
or EU, this is the post I wish I'd had eighteen months ago.

The 9-of-10 moment

I keep two browser tabs open all day: aider.chat/docs/leaderboards and the HuggingFace Open LLM Leaderboard. As of this week:

Kimi K2.6 sits at the top of Aider Polyglot at 53.9 — currently SOTA for open-weights coding.
DeepSeek-R1 0528 is the reasoning king on math, agentic loops, and reasoning_effort-style chains.
Qwen3-Max / Qwen3-VL owns multimodal and multilingual.
GLM-5.2 (released this week by Zhipu) brings 1M+ context plus genuinely usable tool calling.

Add DeepSeek-V4, Kimi-Linear, GLM-4.6, Qwen3-VL and Yi-Lightning and you've got nine of the global top ten. The tenth is Llama 4 405B. Mistral, Falcon, the various European fine-tunes — all dropped off this quarter or are downstream of a Chinese base.

That's not a culture-war headline. It's a routing problem. Every indie stack that wants to be on the open-weights frontier in 2026 needs a clean path to four Chinese model families. Most don't have one yet.

The Reddit receipts

Before I get into code, the thing that's kept me writing about this is that the same complaint shows up on r/LocalLLaMA every week. Search for "Chinese model" + "outside China" and you'll find variations of:

"Tried signing up for DeepSeek directly — needs a +86 phone, my Visa got rejected, ended up paying 5.5% to OpenRouter just to call R1."
— paraphrased from a recurring r/LocalLLaMA thread on access friction

"Why is there no clean DeepSeek API proxy that just takes PayPal and gives me an OpenAI-compatible URL?"
— r/LocalLLaMA, Q1 2026, in a thread that hit 400+ upvotes

"OpenRouter is fine for trying things, but the markup compounds at scale and the failover doesn't work the way the docs imply."
— recurring r/LocalLLaMA + OpenRouter Discord complaint, see GitHub Issues #45663 and #50389

I'm not cherry-picking. The category demand is real, the friction is real, and the pattern that fixes it is genuinely small. Let me show you.

haotokai, in one sentence

I run haotokai.com. It's the gateway I built after running into all five of the access frictions above too many times. Here's the line on the homepage:

The cheapest way to access DeepSeek, Kimi K2 and Qwen from outside China — one base URL, OpenAI-compatible, no markup, no subscription.

Five upstream channels: Kimi (K2.6), DeepSeek (R1 0528), Qwen (Qwen3 family), Zhipu (GLM-4.6, GLM-5.2 rolling out this week), and iFlytek Spark. Token prices are pass-through — DeepSeek-R1 is on the list at $0.55 / 1M input tokens, the published direct rate, with no card surcharge layered on top. PayPal works, $1 minimum top-up, no subscription tier.

If you're already on OpenRouter and bleeding the 5.5% card fee, this is the OpenRouter alternative DeepSeek users keep asking for in those threads. If you've never used a gateway and you've just been writing four separate clients for four separate providers, this is the upgrade path.

The pattern itself works against any OpenAI-compatible aggregator — litellm, openrouter, requesty, others. I'm using haotokai in the examples because it's mine, but the diff is the same.

The migration is one line

If your code already speaks OpenAI's Chat Completions schema — and almost everyone's does — switching is a base_url change. Here's the diff:

  from openai import OpenAI

  client = OpenAI(
-     api_key="sk-or-v1-xxxxxxxx",
-     base_url="https://openrouter.ai/api/v1",
+     api_key="sk-haotokai-xxxxxxxx",
+     base_url="https://api.haotokai.com/v1",
  )

  resp = client.chat.completions.create(
-     model="anthropic/claude-3.5-sonnet",
+     model="deepseek-reasoner",
      messages=[{"role": "user", "content": "Plan a 7-day Tokyo trip."}],
  )

That's the whole story. Streaming works. Tool calling works. JSON mode works. If you were previously using a cheap Claude API for indie hackers via OpenRouter and the 5.5% fee finally got under your skin, the swap to a Chinese-model gateway is the same shape of diff.

Verifying with curl, before you wire anything in

The first thing I do with any new gateway is hit it with raw curl. No SDK, no framework, no abstractions in the way. If this works, OpenAI Python will work. If this doesn't, no SDK is going to save you.

curl https://api.haotokai.com/v1/chat/completions \
  -H "Authorization: Bearer $HAOTOKAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [
      {"role": "user", "content": "In one paragraph: why does long context matter for agentic loops?"}
    ],
    "max_tokens": 200
  }'

Now swap the model string and the same call hits a different upstream:

`model` value	Maps to	Best for
`deepseek-reasoner`	DeepSeek-R1 0528	Reasoning, math, agentic chains
`kimi-k2`	Moonshot Kimi K2.6	Coding (53.9 Aider Polyglot SOTA)
`qwen3-max`	Alibaba Qwen3-Max	Multimodal, multilingual, vision
`glm-5.2`	Zhipu GLM-5.2	1M+ context, tool calling

Same path, same JSON envelope, same auth. That's "OpenAI-compatible" in practice. It's also why getting Kimi K2 API international access through a thin proxy doesn't require relearning a single concept you don't already know.

A real Python script: 4 models, one cost report

The thing I actually use this for, daily, is comparing models on a fixed prompt. Here's the script — it asks the same question to four Chinese open-source frontier models in parallel, then prints the answer plus per-call cost so I can see what a true Qwen API outside China call actually costs versus the OpenRouter quote.

import os
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["HAOTOKAI_API_KEY"],
    base_url="https://api.haotokai.com/v1",
)

PROMPT = "In 60 words: when GLM-5.2 ships with 1M context, what's the first indie use case I should try?"

# Pass-through prices on haotokai (USD per 1M tokens, input / output).
PRICES = {
    "deepseek-reasoner": (0.55, 2.19),
    "kimi-k2":           (0.27, 1.10),
    "qwen3-max":         (0.30, 1.20),
    "glm-5.2":           (0.50, 1.50),
}

def ask(model: str):
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": PROMPT}],
        max_tokens=140,
    )
    in_tok  = resp.usage.prompt_tokens
    out_tok = resp.usage.completion_tokens
    p_in, p_out = PRICES[model]
    cost = (in_tok * p_in + out_tok * p_out) / 1_000_000
    return model, cost, resp.choices[0].message.content.strip()

with ThreadPoolExecutor(max_workers=4) as ex:
    for model, cost, answer in ex.map(ask, PRICES):
        print(f"\n[{model}]  ${cost:.6f}")
        print(answer)

Total bill for one run: under a tenth of a cent. Run the same script through OpenRouter and you'll add 5.5% on top of token cost plus whatever per-provider markup OpenRouter is routing to that hour. On a single dev session, invisible. On a CI pipeline grading 50,000 candidate prompts a week, it isn't.

When to use which model

I keep this cheatsheet pinned in Notion. It's deliberately lossy:

Reasoning, math, agentic loops → deepseek-reasoner. R1 0528 is the best open-weights reasoning model I've used. The token economics are aggressive enough that you can afford to let it think.
Coding, polyglot refactors, monorepo passes → kimi-k2. K2.6's 2M-token context window plus the Aider Polyglot 53.9 score means you can throw an entire repo at it and ask for a coherent rename. No other open model is close on that workflow yet.
Multimodal, multilingual, vision → qwen3-max or qwen3-vl. Qwen is the easiest "just works" model for image input, OCR-style tasks, and any non-English language work. If your product has Spanish, Arabic, Vietnamese, or Indonesian users, this is the one that doesn't fall apart on idioms.
Long context, tool use, structured output → glm-5.2. The new arrival. 1M+ context plus better tool-calling fidelity than GLM-4.6. Already swapping it in for legal-doc review and "summarize this 800-page transcript" patterns.

You don't need a router for this. You don't need an "Auto" model. You just need to know the four shapes and pick.

Why the OpenRouter alternative space is real, in three bullets

The split is structural. 9-of-10 isn't a Q2 2026 anomaly — it's been trending that way for six straight quarters. Any indie stack that wants to stay on the frontier needs a clean path to those weights.
Indie economics ≠ enterprise economics. When you're a one-person team, you don't need 400 models, an SLA, or SOC 2. You need four model families at pass-through prices, $1 minimum top-up, PayPal at checkout. That product is cheaper to build and run than the everything-store, which is why a focused gateway can credibly undercut the big aggregators on the four channels it covers.
OpenAI compatibility commoditizes the gateway. Your switching cost is approximately zero — exactly the diff above. Pick the cheapest gateway that covers your four models, switch the day a cheaper one ships. That's good for users, painful for any gateway that thinks it owns its customers.

Try it

If any of this resonated, the offer is small and concrete: sign up at haotokai.com, get $1 in trial credit, and run the curl above against glm-5.2, deepseek-reasoner, kimi-k2, and qwen3-max. That's a few hundred calls each from one key. No card. PayPal works. The base URL is https://api.haotokai.com/v1.

If you stick around, great. If you don't, you've at least seen the OpenAI-compatible base-URL pattern in action and you'll have your own opinion on whether the 9-of-10 moment is worth wiring into your stack this week. My read: GLM-5.2 just nudged the question from "should I be using Chinese open-source models?" to "what's my excuse for not having one in production by Friday?"

The 5.5% Tax of OpenRouter — and Why I Built an Alternative

欧阳石景 — Mon, 15 Jun 2026 13:05:41 +0000

9 of the world's top 10 open-source LLMs are now Chinese. After GLM-5.2 landed, the
only non-Chinese model still in the top 10 is Llama. If your gateway taxes every call
by 5.5%, you are paying that tax to route to models you could reach for free.

The 5.5% you didn't notice you were paying

OpenRouter's pitch is fair: one key, 400+ models, transparent pricing.

But read the billing page closely. Every credit-card top-up adds 5.5% (minimum $0.80). Crypto top-ups add 5%. Token prices look "at cost" — until you check community benchmarks and notice DeepSeek-R1 routinely runs ~15% above what direct providers charge, depending on which underlying provider OpenRouter routes to that hour.

For a hobby project, this is invisible. For anyone running a real workload, the math gets uncomfortable fast.

$10,000 / month on inference  ->  ~$550 / month in routing tax
$100,000 / month                ->  ~$5,500 / month
$1,000,000 / month              ->  ~$55,000 / month

That's not a rounding error. That's an engineer's salary. And you're paying it to do something you could do with a Cloudflare Worker and ten lines of Go.

I built haotokai as an OpenRouter alternative because I'm one of the people who ran that math, and the answer was: this needs a smaller, cheaper, narrower tool.

Reddit has been complaining for a year

This isn't me being clever. The biggest LLM-infra subreddits have been writing the same comment in different words for twelve months:

"Great for trying out niche models, but the 5.5% fee stings at scale."
— r/LocalLLaMA, on OpenRouter pricing

"I use OpenRouter for experimentation, then move to direct API or a BYOK router for production."
— r/LocalLLaMA, in a thread about cost optimization for indie hackers

"It's not actually routing — you still pick the model yourself."
— r/OpenAI, recurring complaint about the "Auto Router" framing

Those quotes are not cherry-picked outliers. Search "OpenRouter fee" on r/LocalLLaMA or the OpenRouter Discord and you will find the same thread, monthly, since the start of 2025. The product is good. The tax is real. Both things are true.

There's a second, quieter complaint that matters more for production:

"Provider returned an error from OpenRouter does not trigger model failover."
— OpenRouter GitHub Issue #45663

"Rate limit errors surfaced to user instead of auto-failover."
— OpenRouter GitHub Issue #50389

So you're paying 5.5% for routing — but a lot of users are reporting the routing doesn't fail over the way they expected. That's the gap I started building into.

What the 5.5% is actually paying for

To be fair to Alex and the OpenRouter team, that 5.5% covers some real things: card processing, fraud risk, chargebacks, frontend, dashboards, the leaderboard, Auto Router, and a Discord with 50K+ members.

But it also covers 400 models I will never call, a marketplace I don't need, and a level of "everything to everyone" that I, personally, am not the customer for.

I'm an indie hacker. I call four models, ever:

DeepSeek-V4 / DeepSeek-R1 for code and reasoning
Kimi K2 for long-context document work (2M context window, no joke)
Qwen3 for multilingual tasks
GLM-4.6 as a backup reasoning model

That's it. Three Chinese open-source families plus GLM. If you look at the mid-2026 leaderboards, that's 9 of the global top 10. The fact that they're all built outside the U.S. is, technically, irrelevant — they sit at the top of the same evals everyone else ranks on.

So the question I asked myself, in October 2025, was: what does an OpenRouter alternative look like if it only has to do those four families well?

haotokai, in one sentence

Here's the positioning I landed on, and the line I keep on the homepage:

The cheapest way to access DeepSeek, Kimi K2 and Qwen from outside China — one base URL, OpenAI-compatible, no markup, no subscription.

That's it. No 400 models. No leaderboard. No Auto Router. No subscription tier. No 5.5% card fee. Pay-as-you-go from $1.

The trade is honest: you give up breadth, you get pass-through pricing on the four model families that, frankly, do most indie work in 2026.

If you want OpenAI, Claude, and Gemini in the same key, OpenRouter is still the right tool. If you want a DeepSeek API proxy that doesn't tax you for the privilege, an OpenRouter alternative is what you want — and you can probably guess where I think you should look.

One line of code to switch

The migration is the same as moving between any two OpenAI-compatible providers. You change the base URL. That's the entire story.

  from openai import OpenAI

  client = OpenAI(
-     api_key="sk-or-v1-xxxxxxxx",
-     base_url="https://openrouter.ai/api/v1",
+     api_key="sk-haotokai-xxxxxxxx",
+     base_url="https://api.haotokai.com/v1",
  )

Streaming works. Tool calling works. JSON mode works. Same Chat Completions schema, same error envelope. If your code worked against OpenRouter, it will work against haotokai with one diff line.

Verifying it with curl

Before wiring it into your stack, hit it with curl. This is the test I run on every gateway I evaluate:

curl https://api.haotokai.com/v1/chat/completions \
  -H "Authorization: Bearer $HAOTOKAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-reasoner",
    "messages": [
      {"role": "user", "content": "In one sentence: why does a 5.5% routing fee compound badly at scale?"}
    ],
    "max_tokens": 120
  }'

If you get a 200 with a normal choices[0].message.content, you're done. The same curl, against the same path, with the same JSON, works for kimi-k2, qwen3-max, and glm-4.6 — only the model string changes. That's the whole UX promise of OpenAI-compatible gateways, and it's why Kimi K2 API international access shouldn't require relearning anything.

A real Python demo: 4 models, one cost report

The thing I actually use this for, daily, is comparing models on a fixed prompt. Here's the script — it asks the same question to four Chinese open-source frontier models in parallel, then prints the answer plus the per-call cost, so you can see what a true Qwen API outside China call actually costs versus the OpenRouter quote.

import os
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["HAOTOKAI_API_KEY"],
    base_url="https://api.haotokai.com/v1",
)

PROMPT = "In 50 words: when is a 5.5% routing fee actually worth paying?"

# Pass-through prices on haotokai (USD per 1M tokens, input / output).
PRICES = {
    "deepseek-reasoner": (0.55, 2.19),
    "kimi-k2":           (0.27, 1.10),
    "qwen3-max":         (0.30, 1.20),
    "glm-4.6":           (0.50, 1.50),
}

def ask(model: str):
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": PROMPT}],
        max_tokens=120,
    )
    in_tok  = resp.usage.prompt_tokens
    out_tok = resp.usage.completion_tokens
    p_in, p_out = PRICES[model]
    cost = (in_tok * p_in + out_tok * p_out) / 1_000_000
    return model, cost, resp.choices[0].message.content.strip()

with ThreadPoolExecutor(max_workers=4) as ex:
    for model, cost, answer in ex.map(ask, PRICES):
        print(f"\n[{model}]  ${cost:.6f}")
        print(answer)

Run that on a Tuesday afternoon and the total bill comes in under a tenth of a cent. Run the same script through OpenRouter and add 5.5% on top of token cost plus the gateway's per-provider markup. On a single dev session that gap is invisible. On a CI pipeline that grades 50,000 candidate prompts a week, it isn't.

Why I think the OpenRouter alternative space is real

Three reasons, in order of how much they matter to me:

The model market split is permanent. 9 of the open-source top 10 are Chinese in 2026. That isn't going to flip in 2027. Any global stack needs a clean, OpenAI-shaped path to those models, and OpenRouter is one of the few tools that ships that path today — it just charges 5.5% for it.
Indie economics are different from enterprise economics. When you're a one-person team shipping a product, you don't need 400 models, you need the four you actually use, at pass-through prices, with a $1 minimum top-up. A cheap Claude API for indie hackers is a different product from an enterprise gateway with SOC 2 and SSO — both should exist.
OpenAI compatibility commoditizes the gateway. If everyone speaks the same Chat Completions schema, switching gateways is a one-line diff. That's good for users and bad for any gateway that thinks it owns its customers. The right response, as a gateway, is to charge less and stay narrow — not charge more and add features people didn't ask for.

That's the bet behind haotokai. It is the OpenRouter alternative for people who only care about DeepSeek, Kimi K2, Qwen, and GLM, and who would rather not pay a 5.5% tax to get there. It's also, today, the cleanest OpenRouter alternative DeepSeek users have for pass-through pricing on R1 specifically — DeepSeek-R1 is on the list at $0.55 / 1M input tokens, the published direct rate, with no card surcharge layered on top.

Try it

If any of the above resonated, the offer is small and concrete: sign up at haotokai.com, top up nothing, and you get $1 in free trial credit — enough to hit DeepSeek-R1, Kimi K2, Qwen3, and GLM-4.6 a few hundred times each from the same key. No card. PayPal works. The base URL is https://api.haotokai.com/v1, the SDK is whatever you already use, and the migration from OpenRouter is the diff above.

If you stick around after the free credit, great. If you don't, you've at least seen what the 5.5% was actually buying you — and you'll never look at an aggregator's billing page the same way again.

That alone, I'd argue, is worth ten minutes.

8 of the World's Top-10 Open-Source LLMs Are Chinese. Here's How to Use Them All with One OpenAI-Compatible Key.

欧阳石景 — Fri, 12 Jun 2026 12:24:17 +0000

8 of the world's top-10 open-source LLMs are Chinese. Here's how to use them all with one OpenAI-compatible key.

Mid-2026 leaderboards: Kimi K2.6 leads at 53.9. The closest non-Chinese model trails by 14+ points.
If you've been ignoring this side of the model market, you're leaving capability on the table.

The leaderboard reality nobody talks about

Walk into any infra channel in San Francisco today and the model picker is still
GPT-4o, Claude, Llama-405. Meanwhile the global open-source leaderboards quietly
flipped: 8 of the top 10 spots now belong to Chinese labs. Moonshot's Kimi K2.6
sits at the top with a 14-point lead. DeepSeek-R1 still beats most closed
reasoning models on math and code. Qwen, GLM, Yi keep landing in benchmarks people
run anyway.

The gap between "top of leaderboard" and "what your team actually calls" is now embarrassingly wide.

Why most teams skip this layer

Talk to anyone who tried to wire up two of these directly. The list of friction is the same:

Sign-up walls. Most native dashboards still require a Chinese phone number (+86) or a domestic ID.
5 dashboards, 5 currencies. Some bill in CNY, some in USD, some in both. Reconciling a monthly invoice is a side quest.
5 different SDKs, each subtly off-spec from OpenAI. Streaming frames differ. Function calling differs. Even error JSON differs.
Region instability. A model goes down in one provider, you have no fallback unless you wrote it yourself.

The result: even teams that want to use Kimi or DeepSeek end up shipping with whatever
their existing OpenAI key can reach.

The three-layer architecture (the missing middle)

This week at the Trusted Token Cloud Service Symposium in Beijing, Prof. Zheng Weimin
(Chinese Academy of Engineering, Tsinghua) framed token infrastructure as three layers:

Producers   →   Aggregators   →   Schedulers
(model labs)    (gateways)        (your app)

The middle layer is what's been missing for non-Chinese teams. It's also what
haotokai does: normalize all those producers behind a
single OpenAI-compatible endpoint, settle in USD only, and route around outages
automatically.

Disclosure: I run haotokai. This post is biased. The leaderboard isn't.

Three lines of code, six frontier models

If you're already using the openai SDK, here's the entire migration:

- base_url = "https://api.openai.com/v1"
- api_key  = "sk-openai-xxxxx"
- model    = "gpt-4o"
+ base_url = "https://api.haotokai.com/v1"
+ api_key  = "sk-haotokai-xxxxx"
+ model    = "kimi-k2"   # or deepseek-reasoner, qwen-max, glm-4.5, ...

Same SDK. Same streaming. Same function calling. Different gateway.

Python

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["HAOTOKAI_API_KEY"],
    base_url="https://api.haotokai.com/v1",
)

resp = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[{"role": "user", "content": "Why is the middle layer eating the stack?"}],
)
print(resp.choices[0].message.content)

Node

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.HAOTOKAI_API_KEY,
  baseURL: "https://api.haotokai.com/v1",
});

const resp = await client.chat.completions.create({
  model: "kimi-k2",
  messages: [{ role: "user", content: "Summarize this 100k-token doc..." }],
});

The killer demo: 4 models, one call

This is the kind of thing that's painful with 4 vendors but trivial with one gateway:

from concurrent.futures import ThreadPoolExecutor
MODELS = ["deepseek-chat", "qwen-max", "glm-4.5", "moonshot-v1-128k"]

def ask(m):
    return m, client.chat.completions.create(
        model=m,
        messages=[{"role": "user",
                   "content": "What's the most under-appreciated trait of a great engineer?"}],
        max_tokens=80,
    ).choices[0].message.content

with ThreadPoolExecutor(max_workers=4) as ex:
    for m, ans in ex.map(ask, MODELS):
        print(f"[{m}] {ans}")

Four frontier Chinese models, side-by-side, in one Python file. Try doing that with native SDKs.

What you actually save

	Direct (5 vendors)	One gateway
Endpoints to manage	5+	1
API keys to rotate	5+	1
Billing currencies	USD / CNY / mixed	USD only
Sign-up phone requirement	mostly +86	none
Switch models in prod	rewrite SDK calls	change a string
DeepSeek-R1 price	$0.55 / 1M tokens	~50% cheaper
Failover when one drops	manual	automatic

Try it

📦 GitHub: openai-sdk-examples — clone-and-run examples in Python, Node, and curl
🌐 Sign up: https://www.haotokai.com (PayPal accepted, no Chinese phone required, welcome credit on signup)
📖 Long read: The Three-Layer Architecture of AI Tokens

If 80% of the open-source frontier is built in one country, and you're routing around that
because the sign-up form asked for a +86 number, that's a worse engineering trade than people admit.

The gateway layer fixes that — and the gateway is OpenAI-compatible, so the migration is
genuinely three lines.

The Three-Layer Architecture of AI Tokens: Why the Middle Is Eating the Stack

欧阳石景 — Wed, 10 Jun 2026 12:27:23 +0000

Something interesting is happening in the way smart people talk about AI infrastructure.

For the past two years, the conversation was about models — which one is biggest, which one writes the best code, which one will reach AGI first. That conversation hasn't gone away, but at recent AI infrastructure summits a different framing has been quietly taking over. Industry experts and academic researchers have started describing the token economy as a three-layer stack, not unlike the way we eventually came to think about cloud computing.

The framing goes like this:

Layer 1 — Producers. The model labs that actually train and serve frontier LLMs.
Layer 2 — Aggregators. The middleware that normalizes APIs, pools capacity, and bills users.
Layer 3 — Schedulers. The intelligence that routes each request to the right model at the right price.

If you build with AI today, you almost certainly live in Layer 1 — talking directly to one or two model providers. And if you've felt the pain of vendor lock-in, capacity outages, or surprise bills, the three-layer framing explains exactly why that pain exists and where it's going to be solved.

Spoiler: it's going to be solved in the middle. This post is about why.

The Single-Model Era Is Quietly Ending

In 2023, the typical AI app was a wrapper around gpt-3.5-turbo. In 2024, it was a wrapper around gpt-4 with a fallback to gpt-3.5 for cost. That was the entire architecture.

Look at a production AI app shipped in 2026 and the picture has fundamentally changed. A real example from a B2B SaaS team I spoke with last month:

Customer-facing chat: DeepSeek V3 for general turns, GPT-4o only on escalation
Internal RAG over Chinese documents: Qwen 2.5-72B
Long-document summarization: Kimi K2 (because of its million-token context)
Structured extraction: GLM-4-Flash (cheap and reliable)
Coding agent: Claude 3.5 Sonnet
Embeddings: a self-hosted open model

Six models. Six different APIs. Six different billing dashboards. Six different rate-limit policies. Six different ways to get paged at 3 a.m.

This is not because the team is over-engineering. It's because no single model is best at everything anymore, and the price-performance gap between models has gotten so wide that picking the wrong one for a task can multiply your bill by 30x. A request that costs $0.0003 on DeepSeek can cost $0.01 on GPT-4o for output that's qualitatively identical for the task at hand.

If you're still building "the OpenAI app," you're building yesterday's architecture. The multi-model app is the new default, and the multi-model app needs a different kind of infrastructure underneath it.

The Three-Layer Architecture, Properly Explained

Let me unpack the three layers in a way that makes sense if you've ever shipped code.

Layer 1: Producers — The Token Factories

Producers are the labs that train frontier models and operate the inference clusters that turn prompts into tokens. OpenAI, Anthropic, Google, Meta, DeepSeek, Moonshot, Zhipu, Alibaba's Qwen team, Mistral — these are all producers.

Producers compete on three things:

Capability — benchmark scores, reasoning depth, context length, multimodality.
Unit economics — cost per token, throughput per GPU.
Specialization — Chinese-language quality, coding ability, long-context recall, function calling.

What producers don't compete on is consistency. Every producer's API is subtly different. Authentication differs. Streaming formats differ. Function-calling schemas differ. Even the meaning of temperature drifts between vendors. This is not malice; it's just the natural state of a market where every player is moving at maximum speed.

Producers also can't afford to optimize for your workload. Their job is to keep the GPUs hot. Your job is to keep your users happy. Those goals are not always aligned.

Layer 2: Aggregators — The Universal Translators

The aggregator's job is to make the producer layer look like a single, well-behaved system.

A real aggregator does at least seven things:

Protocol normalization. One request schema (typically the OpenAI Chat Completions format) maps to every backend model.
Identity and billing. One API key, one wallet, one invoice — instead of six accounts in six countries with six different KYC processes.
Capacity pooling. Aggregators buy commitments from multiple producers and resell on demand, so individual developers don't have to predict their own usage.
Geographic accessibility. Producers in mainland China, Europe, and the US each have their own access rules. An aggregator can be the only practical way for a developer in, say, Brazil to use a Chinese model legally and reliably.
Payment flexibility. Most developers globally can't easily pay for, say, a DeepSeek API. Aggregators accept PayPal, cards, crypto — whatever the market actually uses.
Observability. Logs, latency metrics, error rates, and spend dashboards in one place.
Compatibility shimming. When a backend producer changes their schema (and they always do), the aggregator absorbs the breakage so your code doesn't.

If this list sounds familiar, it should. Stripe did this for payment processors. Cloudflare did this for origin servers. Twilio did this for telcos. In every case, the "boring" middle layer ended up being more strategically important — and often more valuable — than the producers it sat in front of.

Layer 3: Schedulers — The Routing Brain

Schedulers sit on top of the aggregator and decide, on a per-request basis, which model should handle the call.

A good scheduler considers:

Task type (reasoning vs. summarization vs. extraction vs. translation)
Required quality tier (is this customer-facing or background?)
Current price per million tokens for each candidate model
Current health and latency of each model
Fallback policy if the first choice fails

Today, the scheduler is usually a few hundred lines of code inside your application. In a couple of years, it will look more like a managed service, much the way Kubernetes eventually swallowed everyone's bespoke deployment scripts.

Why the Middle Layer Eats the Stack

Here's the part that I think gets undersold. In a three-layer architecture, the middle layer is structurally the most strategic place to be — and the place most independent developers and startups should be paying attention to.

1. The middle layer is where lock-in dies

The biggest hidden tax in AI development right now is switching cost. Re-integrating a new model takes a week. Re-integrating five new models takes a quarter. Most teams just don't do it, and they overpay forever as a result.

An aggregator normalizes the interface. Once you're behind one, switching from GPT-4o to DeepSeek V3 is a string change, not a sprint.

2. The middle layer is where economics work

Producers price for their best customers — typically large enterprises with predictable, high-volume commits. Solo developers and small startups pay rack rate. Aggregators sit between the two: they negotiate volume rates with producers and resell in small chunks to long-tail developers. The arbitrage funds everyone in the middle.

This is exactly why AWS exists. EC2 isn't cheaper than running your own server because Amazon has cheaper electricity. It's cheaper because Amazon buys electricity at industrial scale and sells it to you in minute increments.

3. The middle layer is where reliability lives

No single producer has 100% uptime. Anyone who's been on Anthropic during a capacity squeeze, or on OpenAI during a launch day, knows this in their bones. The only durable answer is multi-provider failover — and you can't do multi-provider failover until you have a unified interface to fail over with. That's the middle layer.

4. The middle layer is where new geographies open up

The most underrated story in AI right now is that the price-performance frontier has shifted. The cheapest token that meets quality bar for many real tasks is no longer made in California. It's made in Hangzhou, in Beijing. DeepSeek V3 is roughly 30x cheaper than GPT-4o on output tokens and ties or beats it on a large fraction of coding and reasoning tasks. Qwen 2.5 is genuinely competitive with Claude for many enterprise use cases. GLM-4 ships an extremely cheap "Flash" tier that's perfect for structured extraction.

Most non-Chinese developers have never used these models. Not because they're inferior — they often aren't — but because the access path is hard: foreign credit cards don't always work, KYC is in a foreign language, payment limits are restrictive, and the regional latency from outside Asia can be brutal without proper routing.

This is, structurally, an aggregator problem. Solve it once for everybody.

5. The middle layer is where the standards will eventually live

One of the consistent points at recent infrastructure conferences is that the AI industry has a standards gap. There's no equivalent of TCP/IP, or POSIX, or even OpenAPI for how a model should expose itself to the world. We're in the pre-standardization era, which is exactly when middleware companies create de facto standards.

The Chat Completions schema — invented by OpenAI, adopted by everyone else because it was already there — is the first such standard. There will be more. They will almost certainly emerge from the aggregator layer, because that's where the pressure to standardize is highest.

What a Production-Grade Middle Layer Actually Looks Like

If you've never used an aggregator, here's what working with one feels like in practice.

from openai import OpenAI

# One API key. Every model.
client = OpenAI(
    api_key="YOUR_HAOTOKAI_KEY",
    base_url="https://api.haotokai.com/v1"
)

# Cheap, fast, Chinese-language-strong
qwen_reply = client.chat.completions.create(
    model="qwen2.5-72b-instruct",
    messages=[{"role": "user", "content": "Summarize this doc..."}]
)

# Long-context — million-token window
kimi_reply = client.chat.completions.create(
    model="moonshot-v1-128k",
    messages=[{"role": "user", "content": full_book_text}]
)

# Reasoning-heavy task
deepseek_reply = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Design a sharding scheme..."}]
)

# Structured extraction at near-zero cost
glm_reply = client.chat.completions.create(
    model="glm-4-flash",
    messages=[{"role": "user", "content": "Extract all invoice line items..."}]
)

Same SDK. Same request shape. Same billing wallet. Same observability. No new authentication, no new error handling, no new rate-limit logic.

That's the whole point. The middle layer's job is to disappear.

Where Haotokai Fits

This is the part where I should be transparent: Haotokai is a Layer-2 aggregator, and it's the product I work on. The reason we built it is exactly the thesis of this post — the middle layer is where most developers' real pain lives, and there wasn't a good option for developers outside China who wanted clean access to the Chinese model ecosystem.

Concretely, Haotokai gives you:

One OpenAI-compatible endpoint across DeepSeek (V3, R1), Qwen 2.5, GLM-4, Kimi (Moonshot), Spark (iFlytek), and more.
Pricing that mirrors the source providers, so the cheap Chinese models stay cheap — typically 60–90% below GPT-4o-class pricing.
PayPal, card, and crypto payment, so you don't need a Chinese bank account to use Chinese tokens.
One dashboard, one wallet, one invoice for everything you spend across providers.
Drop-in compatibility with the OpenAI SDK and anything built on top of it (LangChain, LlamaIndex, Vercel AI SDK, etc.).
$20 in free credit to try every model side by side before you commit.

If you're already running a multi-model setup, Haotokai consolidates the integration mess. If you're a single-model shop curious about the price-performance frontier outside the US labs, it's probably the lowest-friction way to experiment.

The Honest Counter-Arguments

I'd be wasting your time if I didn't address the obvious objections.

"Aggregators are just middlemen taking a cut."
Mathematically, yes — there's a markup. Practically, the markup is small (usually 5–15%), and it's dwarfed by the savings from being able to route to cheaper models. If switching 70% of your traffic to a model that's 10x cheaper saves you 65% on your bill, a 10% middleware fee is rounding error.

"I'm worried about another point of failure."
Reasonable concern, but in practice a well-run aggregator improves reliability because it can fail over between producers automatically. Single-producer setups have no fallback. Multi-producer setups behind an aggregator have several.

"What about data privacy?"
Pick an aggregator that doesn't log prompts and doesn't train on your data, and the privacy posture is essentially the same as going direct. For workloads that need dedicated compliance (HIPAA, SOC 2, regional data residency), stick with a producer that offers those certifications. For everything else, the aggregator is fine.

"I'll just build my own routing layer."
You can, and many teams do. The question is whether routing is your business. For Stripe, payment routing is the business. For Cloudflare, traffic routing is the business. For your AI startup, the chatbot or the agent or the document tool is the business. Build the differentiated thing; rent the boring infrastructure.

What to Take Away

The three-layer framing for AI tokens isn't a marketing slide. It's a useful description of where the industry is actually heading, and once you see it you can't unsee it.

Producers will keep training better models and competing on capability.
Schedulers will become a managed service category over the next 2–3 years.
Aggregators in the middle will quietly become the place where most developers actually live.

If you're building a serious AI application today, the highest-leverage architectural decision you can make is to stop talking to producers directly and start talking to a normalized middle layer. It's the same lesson the web learned with CDNs, that mobile learned with cross-platform SDKs, and that payments learned with Stripe. The middle is where the leverage is.

The single-model era is over. The multi-model era needs a middle layer. That middle layer is the next critical piece of AI infrastructure.

If you want to try the multi-model approach hands-on, you can grab a free API key on Haotokai and run requests against DeepSeek, Qwen, GLM, Kimi and more in under five minutes. $20 in free credit, no card required to start.

Curious which model to start with? Our Best Chinese AI Models 2026 comparison is a good first read.