Lately, "token economy" is hot. Every business model in AI will eventually converge on one unit of account: the token. I buy that thesis. But one premise keeps getting skipped—a token is not a standardized commodity.
Water has standard units. Electricity has standard units. Money, obviously. Token doesn't. It's more like gasoline: 92, 95, and 98 octane are different fuels, priced differently, for different engines. Adding them up by the liter and reporting one number means nothing.
Most contradictions in AI today come down to this.
I. Intelligence Has Tiers
Roughly four.
Top tier. Overseas: OpenAI GPT-5.5, Anthropic Claude Opus 4.7. China: Zhipu GLM-5.1, Moonshot Kimi K2.6, DeepSeek V4-Pro. Xiaomi MiMo-V2.5-Pro is a bit controversial, but usage and data are climbing, so I'll count it. These range from hundreds of billions to over a trillion parameters. Demand is almost unlimited; willingness to pay is fierce. Prices rise, quotas tighten, prices rise again—users keep pouring in. Zhipu's 2025 annual report showed GLM Coding Plan token calls up 15× in six months, with paying developers past 240,000. That's the real demand curve for top-tier tokens.
Mid-tier. This is the awkward gap. MiniMax M2.7, DeepSeek V4-Flash, Xiaomi MiMo-V2.5 standard—these are about it. Moderate size, an order of magnitude cheaper, theoretically the best value. But almost no one is seriously building here. I'll explain why later.
Low-mid. Mostly open source. Alibaba Qwen 3.6 leads, with both 35B-A3B (MoE) and 27B dense versions open. Google Gemma 4 is here too, from E2B to 31B.
On-device. A few billion parameters, or even sub-billion, fitting in a phone or a consumer GPU.
The first imbalance is right here. Top tier is a bloodbath. Mid-tier is empty. Low-mid and on-device are noisy but lack clear scenarios.
II. Speed Is Another Dimension
Tiers are only half the token story.
The other half is speed. GPT-5.5 at 30 TPS versus 200 TPS is a completely different experience.
Here are the 2026 numbers from Artificial Analysis, a commonly cited benchmark:
| Tier | Model | Output TPS |
|---|---|---|
| Flagship Standard | GPT-5.5 (high) | ~68 |
| Flagship Standard | Claude Opus 4.7 | ~48 |
| Flagship Standard | DeepSeek V4-Pro | ~48 |
| Flagship Standard | Kimi K2.6 | ~33 |
| High Speed | DeepSeek V4-Flash | ~126 |
| High Speed | Gemini 3.5 Flash | ~203 |
| Ultra High Speed | GLM-5.1 High-Speed Edition | 400 (official) |
| Ultra High Speed | Cerebras running Kimi K2.6 | 981 |
I wrote a post earlier, A Model 5× Faster Is No Longer the Same Model. The argument: a 5× speedup unlocks product forms that literally didn't exist before. This isn't slightly faster. It's a different species.
The market already prices this. Anthropic Opus Fast: 2.5× speed, 6× price. OpenAI Priority Tier: 2.5× price. Look at those ratios—price rises faster than speed. Not greed. It's a pricing signal. There's a real cohort willing to pay multiples for speed.
Intelligence tier × speed tier. Stack them and you get a matrix. The token in each cell is a different product.
III. Two Demand Tracks, Worlds Apart in Willingness to Pay
Who's burning top-tier tokens? Two main tracks.
First: coding agents. The fastest-growing, highest-burn category worldwide. The surface is a coding agent writing code to solve problems. In practice, people use them for everything. The work just happens to get done through "writing code."
Second: consumer agents. The Claude app, ChatGPT app, Microsoft Copilot, and Zhipu's new AutoClaw (Claw Plan). AutoClaw launched March 2026 and hit 400,000 subscriptions in 20 days. Under the hood it's a coding agent wrapped in a non-technical shell, letting ordinary people "hire an AI employee."
The two tracks have very different willingness to pay.
Coding agent users demand peak intelligence—Opus 4.7, GPT-5.5 tier. Anything less fails. The work is valuable; time saved is valuable. They'll pay for top-tier tokens continuously. Stickiness is another story: when a better model drops, they switch immediately.
Consumer agent users differ. Their tasks are lower-value, they're price-sensitive, and they don't need absolute peak intelligence. A "mid-tier smarts, good value, acceptable speed" model fits them perfectly. The problem: that tier is empty right now, with no real supply. So DeepSeek V4, with extreme cost-performance, quickly captured this segment. I've noticed many friends around me switching to DeepSeek.
Demand looks like this, so model companies follow the money. That's why top-tier models keep screaming compute shortages while mid-tier models have no takers.
IV. The Supply-Side Mismatch: Scarce Cards and Idle Racks at the Same Time
Demand misalignment carries over to the compute market.
Top-tier compute shortage is obvious.
Jensen Huang personally confirmed NVIDIA's Blackwell series (B200/GB200) is "sold out through mid-2026," with new enterprise orders facing 8–16 week lead times. Meta's annual CapEx is expected past $100 billion; Microsoft is spending nearly $35 billion in a single quarter—all scrambling for these chips. In China, the frenzy is over B300 and H200: a B300 server costs ¥7 million and you still can't get one, monthly rent pushed to ¥130,000–200,000. H200 was cleared for sale in China in January 2026; the first 5,000–10,000 module batch was snapped up by top vendors immediately, cluster delivery pushed to Q2 2027. The older H100 has cooled. No one is fighting for it now.
Domestic top-tier chips are even more extreme. Huawei's latest Ascend 950PR only began mass production in March 2026, yet the full-year plan of 750,000 units was completely locked up: ByteDance (350,000), Alibaba (200,000), Tencent/Baidu (100,000), government and enterprise IT innovation (100,000)—orders pushed to 2027. Roughly $16,000 per chip, 1.56 PFLOPS FP4, officially claimed at 2.87× H20 single-card performance. This is the first time in domestic AI chip history that an entire year's production was bought out. When DeepSeek V4 open-sourced, it shipped day-zero support for eight domestic chips, listing Ascend NPUs alongside NVIDIA GPUs in the technical report. GLM-5 was trained entirely on Ascend + MindSpore, with support for seven domestic chips. This is about positioning: anchoring top models on domestic chips is both a technical and supply problem.
The hidden side is massive idle capacity in low-to-mid-range compute.
PPIO founder Yao Xin has said some domestic GPU AI compute centers have idle rates up to 80%. 36Kr reported some centers at only 10–20% utilization. Xinhua put it more bluntly: "General-purpose compute is relatively oversupplied; AI compute is relatively scarce"—an admission of structural mismatch. Prices reflect this: A100 prices crashed over 50%, RTX 4090 hourly rent dropped to ¥1–2, and the 5090 is around ¥2.5.
But the low-to-mid-range mismatch has two distinct bottlenecks.
Mid-tier datacenter cards (H20, L20, Huawei 910B, etc.) are stuck on infrastructure. Inference frameworks optimize for top-tier cards far more than these. KV cache management, MoE expert parallelism, FP8/FP4 precision support—none of the critical paths is mature here. The hardware exists, demand exists, but you can't serve a top experience.
Consumer PCIe cards (4090, 5090, 4090 48GB mods) face the opposite problem. The hardware can run; vLLM already supports the 5090 (needs CUDA 12.8 + falling back to FlashAttention 2, usable enough). What's missing is good models designed for them. The 70B dense tier is obsolete—as of May 2026, the top six open-source models are all MoE; dense has virtually disappeared at the flagship level. MoE total parameters routinely exceed 100B, which won't fit on consumer cards; distilled small models can't match top quality. No one is supplying new, high-quality models tailored to 24GB/32GB/48GB VRAM limits.
So the picture is: 4090/5090 prices are absurdly cheap compared to datacenter cards, yet the mid-tier models you can actually run are still old stock like Llama 3.3 70B from late 2024. Individual developers experimenting locally, small-team PoCs, and privacy-sensitive on-prem deployments can get by. But for enterprise-grade mid-tier inference on these cards, no newly optimized models exist.
The issue isn't "total compute is insufficient." It's "compute can't align with demand."
Outsiders used to quote compute in "petaflops." That was always shaky; in the AI inference era it's nearly useless. Whether a compute unit can serve top-tier models depends on interconnect, memory bandwidth, FP4/FP8 support, KV cache management. A hundred older cards can't match one top-tier card's single-stream speed.
You get a strange picture: top model providers scrambling for chips, while last-gen cards in datacenters can't be rented out even at a discount. Scarcity and glut, side by side.
V. The Market Will Correct the Mismatch, But It Takes Time
This mismatch won't last. The two bottlenecks will be pushed by two different market forces.
The infrastructure gap for mid-tier datacenter cards will be driven by engineering priorities. Inference frameworks follow the money. Once mid-tier model demand grows, top frameworks like vLLM, SGLang, and TensorRT-LLM will eventually be forced to prioritize H20, L20, and 910B optimization. Not glamorous, but inevitable.
The model supply gap for consumer cards is being pushed by distillation and small MoE. DeepSeek-V4 has already distilled a ~9B version; the Qwen series has been working on this. Once someone actually delivers "runs in 32GB VRAM, quality close to top-tier," idle 4090s and 5090s will immediately find work.
Another track is deep binding between domestic chips and domestic models. DeepSeek and Zhipu are both pursuing it; technically it's proven feasible. Once it fully works, the low-to-mid-range compute market will reshuffle structurally.
I'm fairly optimistic this will happen—it just takes time. Maybe a few quarters, maybe a year or two. For those who catch the rhythm, there's a structural window.
VI. Don't Reduce Tokens to a Single Number
Back to the opening line. "Token economy" is a fine term, but it's far less intuitive than selling water or electricity.
It's more like a gas station. Gasoline looks like one thing, but it's actually an intelligence × speed matrix. Layer on the supply-side compute tier mismatch, and you have the real cause behind today's apparently contradictory industry phenomena: why model companies are scrambling for chips, why some AI compute centers sit idle, why fast tier can charge 6×, and why mid-tier intelligence models are slow to arrive.
Next time you see "we've deployed N petaflops" or "we produce X trillion tokens per month," pause and ask: which intelligence tier, which speed tier, which demand tier.
A token is not a thing.
References
Model Versions and Positioning
- OpenAI GPT-5.5 Instant Release
- Claude Opus 4.7 — Anthropic
- Zhipu GLM-5.1 Technical Docs
- Moonshot Kimi K2.6 Release
- DeepSeek V4 — API Docs
- Simon Willison: DeepSeek V4—almost on the frontier
- Xiaomi MiMo-V2.5-Pro Official
- MiniMax M2.5 Release
- Qwen 3.6-35B-A3B — ModelScope
- Google Gemma 4 Release
Speed Data
- Artificial Analysis — GPT-5.5 (high)
- Artificial Analysis — Claude Opus 4.7
- Artificial Analysis — DeepSeek V4 Pro
- Artificial Analysis — DeepSeek V4 Flash
- Artificial Analysis — Kimi K2.6
- Artificial Analysis — Gemini 3.5 Flash
- Zhipu GLM-5.1 High-Speed 400 tokens/s Report (IT Home)
- Cerebras Running Kimi K2.6 at 981 tokens/s
- Claude Opus Fast Mode: 2.5× Speed / 6× Price (Groundy)
- OpenAI Priority Processing Official Docs
Zhipu Products and Financial Reports
- Zhipu 2025 Annual Report (Sina Finance)
- QbitAI: Zhipu Financial Report Analysis
- TMTPost: Zhipu Financial Report Analysis
- AutoClaw / Claw Plan Launch (Jiemian News)
Compute Market
- NVIDIA Blackwell sold out through mid-2026 (FinancialContent)
- Domestic B300 Servers at ¥7 Million, Still Unavailable (Sina Finance)
- H200 Sales Ban Lifted in China: Buy or Rent? (Zhihu)
- 2026 Q1 GPU Rental Market Deep Dive
- High-End GPU Supply-Demand Mismatch Drives Compute Rental Boom (WallstreetCN)
- Huawei Ascend 950PR in Mass Production + Orders Pushed to 2027 (East Money)
- Huawei Ascend AI Chip Three-Year Roadmap: 950PR / 950DT / 960 / 970 (OSCHINA)
- DeepSeek V4 Fully Switches to Huawei Ascend 950PR (CSDN)
- PPIO Yao Xin on AI Compute Center Idle Rates (Leiphone)
- 36Kr: AI Compute Center Utilization Only 10–20%
- Xinhua: General Compute Oversupply, AI Compute Shortage
- A100 Price Trend Report
- RTX 4090 Hourly Rental Price Range ¥1.45–2.29 (Sohu, March 2026)
- RTX 5090 Compute at ¥2.5/Card/Hour (Gongji Compute)
- RTX 4090 48GB Mod Review (Mornai)
- 2026 LLM Landscape: MoE Extincts Dense at the Flagship Level (QubitTool)
- vLLM Deployment Guide on RTX 5090 (GitHub)
- Using a Modded 4090 for a Year: Great for Dev, Disaster for Production (Zhihu)
- DeepSeek V4 Day-0 Support for Eight Domestic Chips
- GLM-5 Supports Seven Domestic Chips (Guancha)
Top comments (0)