<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: 欧阳石景</title>
    <description>The latest articles on DEV Community by 欧阳石景 (@shijing).</description>
    <link>https://dev.to/shijing</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3976195%2F6d78a0ea-e601-40c1-bdf2-8b36abc75cfb.png</url>
      <title>DEV Community: 欧阳石景</title>
      <link>https://dev.to/shijing</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shijing"/>
    <language>en</language>
    <item>
      <title>The Three-Layer Architecture of AI Tokens: Why the Middle Is Eating the Stack</title>
      <dc:creator>欧阳石景</dc:creator>
      <pubDate>Wed, 10 Jun 2026 12:27:23 +0000</pubDate>
      <link>https://dev.to/shijing/the-three-layer-architecture-of-ai-tokens-why-the-middle-is-eating-the-stack-3m7m</link>
      <guid>https://dev.to/shijing/the-three-layer-architecture-of-ai-tokens-why-the-middle-is-eating-the-stack-3m7m</guid>
      <description>&lt;p&gt;Something interesting is happening in the way smart people talk about AI infrastructure.&lt;/p&gt;

&lt;p&gt;For the past two years, the conversation was about &lt;em&gt;models&lt;/em&gt; — which one is biggest, which one writes the best code, which one will reach AGI first. That conversation hasn't gone away, but at recent AI infrastructure summits a different framing has been quietly taking over. Industry experts and academic researchers have started describing the token economy as a &lt;strong&gt;three-layer stack&lt;/strong&gt;, not unlike the way we eventually came to think about cloud computing.&lt;/p&gt;

&lt;p&gt;The framing goes like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1 — Producers.&lt;/strong&gt; The model labs that actually train and serve frontier LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2 — Aggregators.&lt;/strong&gt; The middleware that normalizes APIs, pools capacity, and bills users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 3 — Schedulers.&lt;/strong&gt; The intelligence that routes each request to the right model at the right price.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you build with AI today, you almost certainly live in Layer 1 — talking directly to one or two model providers. And if you've felt the pain of vendor lock-in, capacity outages, or surprise bills, the three-layer framing explains exactly why that pain exists and where it's going to be solved.&lt;/p&gt;

&lt;p&gt;Spoiler: it's going to be solved in the middle. This post is about why.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Single-Model Era Is Quietly Ending
&lt;/h2&gt;

&lt;p&gt;In 2023, the typical AI app was a wrapper around &lt;code&gt;gpt-3.5-turbo&lt;/code&gt;. In 2024, it was a wrapper around &lt;code&gt;gpt-4&lt;/code&gt; with a fallback to &lt;code&gt;gpt-3.5&lt;/code&gt; for cost. That was the entire architecture.&lt;/p&gt;

&lt;p&gt;Look at a production AI app shipped in 2026 and the picture has fundamentally changed. A real example from a B2B SaaS team I spoke with last month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer-facing chat:&lt;/strong&gt; DeepSeek V3 for general turns, GPT-4o only on escalation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal RAG over Chinese documents:&lt;/strong&gt; Qwen 2.5-72B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-document summarization:&lt;/strong&gt; Kimi K2 (because of its million-token context)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured extraction:&lt;/strong&gt; GLM-4-Flash (cheap and reliable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coding agent:&lt;/strong&gt; Claude 3.5 Sonnet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings:&lt;/strong&gt; a self-hosted open model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six models. Six different APIs. Six different billing dashboards. Six different rate-limit policies. Six different ways to get paged at 3 a.m.&lt;/p&gt;

&lt;p&gt;This is not because the team is over-engineering. It's because &lt;strong&gt;no single model is best at everything anymore&lt;/strong&gt;, and the price-performance gap between models has gotten so wide that picking the wrong one for a task can multiply your bill by 30x. A request that costs $0.0003 on DeepSeek can cost $0.01 on GPT-4o for output that's qualitatively identical for the task at hand.&lt;/p&gt;

&lt;p&gt;If you're still building "the OpenAI app," you're building yesterday's architecture. The multi-model app is the new default, and the multi-model app needs a different kind of infrastructure underneath it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Layer Architecture, Properly Explained
&lt;/h2&gt;

&lt;p&gt;Let me unpack the three layers in a way that makes sense if you've ever shipped code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Producers — The Token Factories
&lt;/h3&gt;

&lt;p&gt;Producers are the labs that train frontier models and operate the inference clusters that turn prompts into tokens. OpenAI, Anthropic, Google, Meta, DeepSeek, Moonshot, Zhipu, Alibaba's Qwen team, Mistral — these are all producers.&lt;/p&gt;

&lt;p&gt;Producers compete on three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capability&lt;/strong&gt; — benchmark scores, reasoning depth, context length, multimodality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unit economics&lt;/strong&gt; — cost per token, throughput per GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialization&lt;/strong&gt; — Chinese-language quality, coding ability, long-context recall, function calling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What producers &lt;em&gt;don't&lt;/em&gt; compete on is consistency. Every producer's API is subtly different. Authentication differs. Streaming formats differ. Function-calling schemas differ. Even the meaning of &lt;code&gt;temperature&lt;/code&gt; drifts between vendors. This is not malice; it's just the natural state of a market where every player is moving at maximum speed.&lt;/p&gt;

&lt;p&gt;Producers also can't afford to optimize for &lt;em&gt;your&lt;/em&gt; workload. Their job is to keep the GPUs hot. Your job is to keep your users happy. Those goals are not always aligned.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Aggregators — The Universal Translators
&lt;/h3&gt;

&lt;p&gt;The aggregator's job is to make the producer layer look like a single, well-behaved system.&lt;/p&gt;

&lt;p&gt;A real aggregator does at least seven things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Protocol normalization.&lt;/strong&gt; One request schema (typically the OpenAI Chat Completions format) maps to every backend model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity and billing.&lt;/strong&gt; One API key, one wallet, one invoice — instead of six accounts in six countries with six different KYC processes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity pooling.&lt;/strong&gt; Aggregators buy commitments from multiple producers and resell on demand, so individual developers don't have to predict their own usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geographic accessibility.&lt;/strong&gt; Producers in mainland China, Europe, and the US each have their own access rules. An aggregator can be the only practical way for a developer in, say, Brazil to use a Chinese model legally and reliably.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Payment flexibility.&lt;/strong&gt; Most developers globally can't easily pay for, say, a DeepSeek API. Aggregators accept PayPal, cards, crypto — whatever the market actually uses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability.&lt;/strong&gt; Logs, latency metrics, error rates, and spend dashboards in one place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatibility shimming.&lt;/strong&gt; When a backend producer changes their schema (and they always do), the aggregator absorbs the breakage so your code doesn't.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If this list sounds familiar, it should. Stripe did this for payment processors. Cloudflare did this for origin servers. Twilio did this for telcos. In every case, the "boring" middle layer ended up being more strategically important — and often more valuable — than the producers it sat in front of.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Schedulers — The Routing Brain
&lt;/h3&gt;

&lt;p&gt;Schedulers sit on top of the aggregator and decide, on a per-request basis, &lt;em&gt;which&lt;/em&gt; model should handle the call.&lt;/p&gt;

&lt;p&gt;A good scheduler considers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task type (reasoning vs. summarization vs. extraction vs. translation)&lt;/li&gt;
&lt;li&gt;Required quality tier (is this customer-facing or background?)&lt;/li&gt;
&lt;li&gt;Current price per million tokens for each candidate model&lt;/li&gt;
&lt;li&gt;Current health and latency of each model&lt;/li&gt;
&lt;li&gt;Fallback policy if the first choice fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today, the scheduler is usually a few hundred lines of code &lt;em&gt;inside your application&lt;/em&gt;. In a couple of years, it will look more like a managed service, much the way Kubernetes eventually swallowed everyone's bespoke deployment scripts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Middle Layer Eats the Stack
&lt;/h2&gt;

&lt;p&gt;Here's the part that I think gets undersold. In a three-layer architecture, the middle layer is structurally the most strategic place to be — and the place most independent developers and startups should be paying attention to.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The middle layer is where lock-in dies
&lt;/h3&gt;

&lt;p&gt;The biggest hidden tax in AI development right now is &lt;strong&gt;switching cost&lt;/strong&gt;. Re-integrating a new model takes a week. Re-integrating five new models takes a quarter. Most teams just don't do it, and they overpay forever as a result.&lt;/p&gt;

&lt;p&gt;An aggregator normalizes the interface. Once you're behind one, switching from GPT-4o to DeepSeek V3 is a string change, not a sprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The middle layer is where economics work
&lt;/h3&gt;

&lt;p&gt;Producers price for their &lt;em&gt;best&lt;/em&gt; customers — typically large enterprises with predictable, high-volume commits. Solo developers and small startups pay rack rate. Aggregators sit between the two: they negotiate volume rates with producers and resell in small chunks to long-tail developers. The arbitrage funds everyone in the middle.&lt;/p&gt;

&lt;p&gt;This is exactly why AWS exists. EC2 isn't cheaper than running your own server because Amazon has cheaper electricity. It's cheaper because Amazon buys electricity at industrial scale and sells it to you in minute increments.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The middle layer is where reliability lives
&lt;/h3&gt;

&lt;p&gt;No single producer has 100% uptime. Anyone who's been on Anthropic during a capacity squeeze, or on OpenAI during a launch day, knows this in their bones. The only durable answer is multi-provider failover — and you can't do multi-provider failover until you have a unified interface to fail over with. That's the middle layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The middle layer is where new geographies open up
&lt;/h3&gt;

&lt;p&gt;The most underrated story in AI right now is that the price-performance frontier has shifted. The cheapest token that meets quality bar for many real tasks is no longer made in California. It's made in Hangzhou, in Beijing. DeepSeek V3 is roughly 30x cheaper than GPT-4o on output tokens and ties or beats it on a large fraction of coding and reasoning tasks. Qwen 2.5 is genuinely competitive with Claude for many enterprise use cases. GLM-4 ships an extremely cheap "Flash" tier that's perfect for structured extraction.&lt;/p&gt;

&lt;p&gt;Most non-Chinese developers have never used these models. Not because they're inferior — they often aren't — but because &lt;strong&gt;the access path is hard&lt;/strong&gt;: foreign credit cards don't always work, KYC is in a foreign language, payment limits are restrictive, and the regional latency from outside Asia can be brutal without proper routing.&lt;/p&gt;

&lt;p&gt;This is, structurally, an aggregator problem. Solve it once for everybody.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The middle layer is where the standards will eventually live
&lt;/h3&gt;

&lt;p&gt;One of the consistent points at recent infrastructure conferences is that the AI industry has a &lt;strong&gt;standards gap&lt;/strong&gt;. There's no equivalent of TCP/IP, or POSIX, or even OpenAPI for how a model should expose itself to the world. We're in the pre-standardization era, which is exactly when middleware companies create de facto standards.&lt;/p&gt;

&lt;p&gt;The Chat Completions schema — invented by OpenAI, adopted by everyone else because it was already there — is the first such standard. There will be more. They will almost certainly emerge from the aggregator layer, because that's where the pressure to standardize is highest.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a Production-Grade Middle Layer Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;If you've never used an aggregator, here's what working with one feels like in practice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="c1"&gt;# One API key. Every model.
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_HAOTOKAI_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.haotokai.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Cheap, fast, Chinese-language-strong
&lt;/span&gt;&lt;span class="n"&gt;qwen_reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-72b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this doc...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Long-context — million-token window
&lt;/span&gt;&lt;span class="n"&gt;kimi_reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;moonshot-v1-128k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;full_book_text&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Reasoning-heavy task
&lt;/span&gt;&lt;span class="n"&gt;deepseek_reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Design a sharding scheme...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Structured extraction at near-zero cost
&lt;/span&gt;&lt;span class="n"&gt;glm_reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm-4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract all invoice line items...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same SDK. Same request shape. Same billing wallet. Same observability. No new authentication, no new error handling, no new rate-limit logic.&lt;/p&gt;

&lt;p&gt;That's the whole point. The middle layer's job is to disappear.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Haotokai Fits
&lt;/h2&gt;

&lt;p&gt;This is the part where I should be transparent: &lt;a href="https://www.haotokai.com" rel="noopener noreferrer"&gt;Haotokai&lt;/a&gt; is a Layer-2 aggregator, and it's the product I work on. The reason we built it is exactly the thesis of this post — the middle layer is where most developers' real pain lives, and there wasn't a good option for developers outside China who wanted clean access to the Chinese model ecosystem.&lt;/p&gt;

&lt;p&gt;Concretely, Haotokai gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One OpenAI-compatible endpoint&lt;/strong&gt; across DeepSeek (V3, R1), Qwen 2.5, GLM-4, Kimi (Moonshot), Spark (iFlytek), and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing that mirrors the source providers&lt;/strong&gt;, so the cheap Chinese models stay cheap — typically 60–90% below GPT-4o-class pricing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PayPal, card, and crypto payment&lt;/strong&gt;, so you don't need a Chinese bank account to use Chinese tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One dashboard, one wallet, one invoice&lt;/strong&gt; for everything you spend across providers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop-in compatibility&lt;/strong&gt; with the OpenAI SDK and anything built on top of it (LangChain, LlamaIndex, Vercel AI SDK, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$20 in free credit&lt;/strong&gt; to try every model side by side before you commit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're already running a multi-model setup, Haotokai consolidates the integration mess. If you're a single-model shop curious about the price-performance frontier outside the US labs, it's probably the lowest-friction way to experiment.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Counter-Arguments
&lt;/h2&gt;

&lt;p&gt;I'd be wasting your time if I didn't address the obvious objections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Aggregators are just middlemen taking a cut."&lt;/strong&gt;&lt;br&gt;
Mathematically, yes — there's a markup. Practically, the markup is small (usually 5–15%), and it's dwarfed by the savings from being able to route to cheaper models. If switching 70% of your traffic to a model that's 10x cheaper saves you 65% on your bill, a 10% middleware fee is rounding error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I'm worried about another point of failure."&lt;/strong&gt;&lt;br&gt;
Reasonable concern, but in practice a well-run aggregator &lt;em&gt;improves&lt;/em&gt; reliability because it can fail over between producers automatically. Single-producer setups have no fallback. Multi-producer setups behind an aggregator have several.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What about data privacy?"&lt;/strong&gt;&lt;br&gt;
Pick an aggregator that doesn't log prompts and doesn't train on your data, and the privacy posture is essentially the same as going direct. For workloads that need dedicated compliance (HIPAA, SOC 2, regional data residency), stick with a producer that offers those certifications. For everything else, the aggregator is fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I'll just build my own routing layer."&lt;/strong&gt;&lt;br&gt;
You can, and many teams do. The question is whether routing is your business. For Stripe, payment routing is the business. For Cloudflare, traffic routing is the business. For your AI startup, the chatbot or the agent or the document tool is the business. Build the differentiated thing; rent the boring infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Take Away
&lt;/h2&gt;

&lt;p&gt;The three-layer framing for AI tokens isn't a marketing slide. It's a useful description of where the industry is actually heading, and once you see it you can't unsee it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Producers will keep training better models and competing on capability.&lt;/li&gt;
&lt;li&gt;Schedulers will become a managed service category over the next 2–3 years.&lt;/li&gt;
&lt;li&gt;Aggregators in the middle will quietly become the place where most developers actually live.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building a serious AI application today, the highest-leverage architectural decision you can make is to stop talking to producers directly and start talking to a normalized middle layer. It's the same lesson the web learned with CDNs, that mobile learned with cross-platform SDKs, and that payments learned with Stripe. The middle is where the leverage is.&lt;/p&gt;

&lt;p&gt;The single-model era is over. The multi-model era needs a middle layer. That middle layer is the next critical piece of AI infrastructure.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want to try the multi-model approach hands-on, you can &lt;a href="https://www.haotokai.com" rel="noopener noreferrer"&gt;grab a free API key on Haotokai&lt;/a&gt; and run requests against DeepSeek, Qwen, GLM, Kimi and more in under five minutes. $20 in free credit, no card required to start.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Curious which model to start with? Our &lt;a href="https://blog.haotokai.com/best-chinese-ai-models-2026.html" rel="noopener noreferrer"&gt;Best Chinese AI Models 2026&lt;/a&gt; comparison is a good first read.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
