guanjiawei

Posted on May 26 • Originally published at guanjiawei.ai

A Token Is Not a Thing

#ai #token #infra #compute

Lately, "token economy" is hot. Every business model in AI will eventually converge on one unit of account: the token. I buy that thesis. But one premise keeps getting skipped—a token is not a standardized commodity.

Water has standard units. Electricity has standard units. Money, obviously. Token doesn't. It's more like gasoline: 92, 95, and 98 octane are different fuels, priced differently, for different engines. Adding them up by the liter and reporting one number means nothing.

Most contradictions in AI today come down to this.

I. Intelligence Has Tiers

Roughly four.

Top tier. Overseas: OpenAI GPT-5.5, Anthropic Claude Opus 4.7. China: Zhipu GLM-5.1, Moonshot Kimi K2.6, DeepSeek V4-Pro. Xiaomi MiMo-V2.5-Pro is a bit controversial, but usage and data are climbing, so I'll count it. These range from hundreds of billions to over a trillion parameters. Demand is almost unlimited; willingness to pay is fierce. Prices rise, quotas tighten, prices rise again—users keep pouring in. Zhipu's 2025 annual report showed GLM Coding Plan token calls up 15× in six months, with paying developers past 240,000. That's the real demand curve for top-tier tokens.

Mid-tier. This is the awkward gap. MiniMax M2.7, DeepSeek V4-Flash, Xiaomi MiMo-V2.5 standard—these are about it. Moderate size, an order of magnitude cheaper, theoretically the best value. But almost no one is seriously building here. I'll explain why later.

Low-mid. Mostly open source. Alibaba Qwen 3.6 leads, with both 35B-A3B (MoE) and 27B dense versions open. Google Gemma 4 is here too, from E2B to 31B.

On-device. A few billion parameters, or even sub-billion, fitting in a phone or a consumer GPU.

The first imbalance is right here. Top tier is a bloodbath. Mid-tier is empty. Low-mid and on-device are noisy but lack clear scenarios.

II. Speed Is Another Dimension

Tiers are only half the token story.

The other half is speed. GPT-5.5 at 30 TPS versus 200 TPS is a completely different experience.

Here are the 2026 numbers from Artificial Analysis, a commonly cited benchmark:

Tier	Model	Output TPS
Flagship Standard	GPT-5.5 (high)	~68
Flagship Standard	Claude Opus 4.7	~48
Flagship Standard	DeepSeek V4-Pro	~48
Flagship Standard	Kimi K2.6	~33
High Speed	DeepSeek V4-Flash	~126
High Speed	Gemini 3.5 Flash	~203
Ultra High Speed	GLM-5.1 High-Speed Edition	400 (official)
Ultra High Speed	Cerebras running Kimi K2.6	981

I wrote a post earlier, A Model 5× Faster Is No Longer the Same Model. The argument: a 5× speedup unlocks product forms that literally didn't exist before. This isn't slightly faster. It's a different species.

The market already prices this. Anthropic Opus Fast: 2.5× speed, 6× price. OpenAI Priority Tier: 2.5× price. Look at those ratios—price rises faster than speed. Not greed. It's a pricing signal. There's a real cohort willing to pay multiples for speed.

Intelligence tier × speed tier. Stack them and you get a matrix. The token in each cell is a different product.

III. Two Demand Tracks, Worlds Apart in Willingness to Pay

Who's burning top-tier tokens? Two main tracks.

First: coding agents. The fastest-growing, highest-burn category worldwide. The surface is a coding agent writing code to solve problems. In practice, people use them for everything. The work just happens to get done through "writing code."

Second: consumer agents. The Claude app, ChatGPT app, Microsoft Copilot, and Zhipu's new AutoClaw (Claw Plan). AutoClaw launched March 2026 and hit 400,000 subscriptions in 20 days. Under the hood it's a coding agent wrapped in a non-technical shell, letting ordinary people "hire an AI employee."

The two tracks have very different willingness to pay.

Coding agent users demand peak intelligence—Opus 4.7, GPT-5.5 tier. Anything less fails. The work is valuable; time saved is valuable. They'll pay for top-tier tokens continuously. Stickiness is another story: when a better model drops, they switch immediately.

Consumer agent users differ. Their tasks are lower-value, they're price-sensitive, and they don't need absolute peak intelligence. A "mid-tier smarts, good value, acceptable speed" model fits them perfectly. The problem: that tier is empty right now, with no real supply. So DeepSeek V4, with extreme cost-performance, quickly captured this segment. I've noticed many friends around me switching to DeepSeek.

Demand looks like this, so model companies follow the money. That's why top-tier models keep screaming compute shortages while mid-tier models have no takers.

IV. The Supply-Side Mismatch: Scarce Cards and Idle Racks at the Same Time

Demand misalignment carries over to the compute market.

Top-tier compute shortage is obvious.

Jensen Huang personally confirmed NVIDIA's Blackwell series (B200/GB200) is "sold out through mid-2026," with new enterprise orders facing 8–16 week lead times. Meta's annual CapEx is expected past $100 billion; Microsoft is spending nearly $35 billion in a single quarter—all scrambling for these chips. In China, the frenzy is over B300 and H200: a B300 server costs ¥7 million and you still can't get one, monthly rent pushed to ¥130,000–200,000. H200 was cleared for sale in China in January 2026; the first 5,000–10,000 module batch was snapped up by top vendors immediately, cluster delivery pushed to Q2 2027. The older H100 has cooled. No one is fighting for it now.

Domestic top-tier chips are even more extreme. Huawei's latest Ascend 950PR only began mass production in March 2026, yet the full-year plan of 750,000 units was completely locked up: ByteDance (350,000), Alibaba (200,000), Tencent/Baidu (100,000), government and enterprise IT innovation (100,000)—orders pushed to 2027. Roughly $16,000 per chip, 1.56 PFLOPS FP4, officially claimed at 2.87× H20 single-card performance. This is the first time in domestic AI chip history that an entire year's production was bought out. When DeepSeek V4 open-sourced, it shipped day-zero support for eight domestic chips, listing Ascend NPUs alongside NVIDIA GPUs in the technical report. GLM-5 was trained entirely on Ascend + MindSpore, with support for seven domestic chips. This is about positioning: anchoring top models on domestic chips is both a technical and supply problem.

The hidden side is massive idle capacity in low-to-mid-range compute.

PPIO founder Yao Xin has said some domestic GPU AI compute centers have idle rates up to 80%. 36Kr reported some centers at only 10–20% utilization. Xinhua put it more bluntly: "General-purpose compute is relatively oversupplied; AI compute is relatively scarce"—an admission of structural mismatch. Prices reflect this: A100 prices crashed over 50%, RTX 4090 hourly rent dropped to ¥1–2, and the 5090 is around ¥2.5.

But the low-to-mid-range mismatch has two distinct bottlenecks.

Mid-tier datacenter cards (H20, L20, Huawei 910B, etc.) are stuck on infrastructure. Inference frameworks optimize for top-tier cards far more than these. KV cache management, MoE expert parallelism, FP8/FP4 precision support—none of the critical paths is mature here. The hardware exists, demand exists, but you can't serve a top experience.

Consumer PCIe cards (4090, 5090, 4090 48GB mods) face the opposite problem. The hardware can run; vLLM already supports the 5090 (needs CUDA 12.8 + falling back to FlashAttention 2, usable enough). What's missing is good models designed for them. The 70B dense tier is obsolete—as of May 2026, the top six open-source models are all MoE; dense has virtually disappeared at the flagship level. MoE total parameters routinely exceed 100B, which won't fit on consumer cards; distilled small models can't match top quality. No one is supplying new, high-quality models tailored to 24GB/32GB/48GB VRAM limits.

So the picture is: 4090/5090 prices are absurdly cheap compared to datacenter cards, yet the mid-tier models you can actually run are still old stock like Llama 3.3 70B from late 2024. Individual developers experimenting locally, small-team PoCs, and privacy-sensitive on-prem deployments can get by. But for enterprise-grade mid-tier inference on these cards, no newly optimized models exist.

The issue isn't "total compute is insufficient." It's "compute can't align with demand."

Outsiders used to quote compute in "petaflops." That was always shaky; in the AI inference era it's nearly useless. Whether a compute unit can serve top-tier models depends on interconnect, memory bandwidth, FP4/FP8 support, KV cache management. A hundred older cards can't match one top-tier card's single-stream speed.

You get a strange picture: top model providers scrambling for chips, while last-gen cards in datacenters can't be rented out even at a discount. Scarcity and glut, side by side.

V. The Market Will Correct the Mismatch, But It Takes Time

This mismatch won't last. The two bottlenecks will be pushed by two different market forces.

The infrastructure gap for mid-tier datacenter cards will be driven by engineering priorities. Inference frameworks follow the money. Once mid-tier model demand grows, top frameworks like vLLM, SGLang, and TensorRT-LLM will eventually be forced to prioritize H20, L20, and 910B optimization. Not glamorous, but inevitable.

The model supply gap for consumer cards is being pushed by distillation and small MoE. DeepSeek-V4 has already distilled a ~9B version; the Qwen series has been working on this. Once someone actually delivers "runs in 32GB VRAM, quality close to top-tier," idle 4090s and 5090s will immediately find work.

Another track is deep binding between domestic chips and domestic models. DeepSeek and Zhipu are both pursuing it; technically it's proven feasible. Once it fully works, the low-to-mid-range compute market will reshuffle structurally.

I'm fairly optimistic this will happen—it just takes time. Maybe a few quarters, maybe a year or two. For those who catch the rhythm, there's a structural window.

VI. Don't Reduce Tokens to a Single Number

Back to the opening line. "Token economy" is a fine term, but it's far less intuitive than selling water or electricity.

It's more like a gas station. Gasoline looks like one thing, but it's actually an intelligence × speed matrix. Layer on the supply-side compute tier mismatch, and you have the real cause behind today's apparently contradictory industry phenomena: why model companies are scrambling for chips, why some AI compute centers sit idle, why fast tier can charge 6×, and why mid-tier intelligence models are slow to arrive.

Next time you see "we've deployed N petaflops" or "we produce X trillion tokens per month," pause and ask: which intelligence tier, which speed tier, which demand tier.

A token is not a thing.