Chinese labs cut LLM API prices six times in the first half of 2026, and three of those cuts were declared permanent. DeepSeek V4-Pro now costs $0.87 per million output tokens. Xiaomi MiMo V2.5 flattened its long-context tier to $3 output. Alibaba’s Qwen3 Max ships at $3.90. Moonshot’s Kimi K2.6 holds the cache-hit floor at $0.07. Zhipu’s GLM-5 sits at $3.20 output. Use this breakdown to choose, test, and route workloads across the top five Chinese frontier APIs in May 2026.
TL;DR
- Cheapest output tokens: DeepSeek V4-Pro at $0.87/MTok, roughly 34x below GPT-5.5.
- Cheapest 1M-context option: Xiaomi MiMo V2.5 Pro at $3/MTok output, flat across input length.
- Best general production balance: Alibaba Qwen3 Max at $3.90/MTok output with 262K context.
- Lowest cache-hit floor: Moonshot Kimi K2.6 at $0.07/MTok cached, useful for long stable prompts.
- Reasoning-heavy workloads: Zhipu GLM-5 at $3.20/MTok output with 200K context.
- Practical takeaway: route by workload. Do not pick one model for everything unless your workload is very narrow.
How the 2026 Chinese LLM price war unfolded
The price drops started in Q4 2025 and accelerated in Q2 2026:
- Q4 2025: DeepSeek V3.2 launches at $0.28/MTok input, undercutting US frontier prices by an order of magnitude. Kimi K2.6 follows with tiered context-aware pricing and a $0.07/MTok cache-hit rate.
- March 2026: Xiaomi unveils MiMo V2-Pro on OpenRouter with competitive tier-based rates.
- April 2026: DeepSeek V4 launches with a 75% promotional discount scheduled to expire May 31.
- May 22, 2026: DeepSeek makes the 75% discount permanent. V4-Pro stays at $0.435 input / $0.87 output. The full breakdown is here.
- May 27, 2026: Xiaomi makes MiMo V2.5 pricing permanent at $1 input / $3 output, removing the long-context multiplier. More on the MiMo cut.
The cuts target different developer pain points:
- DeepSeek: raw cost-per-token.
- MiMo: long-context workloads that other models price out.
- Qwen: production stability and broad capability.
- Kimi: coding agents and repeated prompt-prefix workflows.
- GLM: structured reasoning and chain-of-thought-heavy tasks.
At a glance: top 5 Chinese LLM APIs in May 2026
| Model | Input ($/MTok) | Output ($/MTok) | Cache hit | Context | Best at |
|---|---|---|---|---|---|
| DeepSeek V4-Pro | $0.435 | $0.87 | $0.003625 | 128K | Cheapest per token, coding |
| Xiaomi MiMo V2.5 Pro | $1.00 | $3.00 | $0.20 | 1M | Long-document RAG, repo agents |
| Alibaba Qwen3 Max | $0.78 | $3.90 | $0.156 | 262K | Production balance |
| Moonshot Kimi K2.6 | $0.16–$2.00 tiered | ~$2.50 | $0.07 | 128K | Long system prompts, coding agents |
| Zhipu GLM-5 | $1.00 | $3.20 | Provider-defined | 200K | Structured reasoning |
How to read the table:
- Use flat-rate models for predictable billing. DeepSeek and MiMo are easier to model in production because pricing does not jump across context tiers.
- Benchmark cache-hit pricing separately. Kimi K2.6 and DeepSeek V4-Pro are outliers for repeated prefixes. If your agent reuses a stable system prompt, your effective input cost can be much lower than list input pricing. See this prompt caching deep dive.
- Do not ignore context limits. MiMo V2.5 is the only 1M-context option in this set. If your prompt regularly exceeds 300K tokens, the practical choice narrows quickly.
Selection workflow
Before picking a model, classify your workload:
-
Measure input/output ratio.
- Output-heavy: code generation, content generation, agent chains.
- Input-heavy: RAG, summarization, document analysis.
-
Measure context size.
- Under 128K: all five are possible.
- 128K–262K: Qwen or GLM are practical.
- 300K–1M: MiMo is the main option.
-
Check prompt stability.
- Stable system prompt: prioritize cache-hit pricing.
- Highly variable prompt: prioritize normal input/output rates.
-
Run your own eval.
- Use 50–100 real prompts.
- Score correctness, latency, tool-call validity, and cost.
- Do not rely only on public benchmarks.
A simple routing rule can look like this:
function selectModel({ inputTokens, outputHeavy, stablePrefix, reasoningHeavy, multilingual }) {
if (inputTokens > 300_000) return "xiaomi-mimo-v2.5-pro";
if (reasoningHeavy) return "zhipu-glm-5";
if (stablePrefix) return "moonshot-kimi-k2.6";
if (multilingual) return "alibaba-qwen3-max";
if (outputHeavy) return "deepseek-v4-pro";
return "alibaba-qwen3-max";
}
DeepSeek: the cheapest per token
Models: V4-Pro ($0.435 input / $0.87 output / $0.003625 cache hit, 128K context), V4-Flash ($0.14 / $0.28).
DeepSeek V4-Pro is the price floor of the Chinese frontier-tier shelf. The May 22 permanent cut put output tokens at $0.87/MTok, roughly 34x below GPT-5.5 and 17x below Claude Opus 4.7. Cache-hit pricing at $0.003625/MTok is the lowest first-party rate from any major lab. Pricing is confirmed against DeepSeek’s official pricing page.
Use DeepSeek V4-Pro when
- Your workload is output-heavy.
- You generate code, agent steps, reports, or content at scale.
- Your prompts fit inside 128K context.
- You can accept a small quality gap versus more expensive frontier models.
- You reuse stable 5K–10K-token system prompts and can benefit from prompt caching.
Avoid DeepSeek V4-Pro when
- Your prompts exceed 128K tokens.
- You need sub-second time-to-first-token.
- Your workload depends on long-document retrieval beyond the context window.
Implementation tip
For cost-sensitive generation, route only the final answer or code-generation step to DeepSeek:
const response = await fetch("https://api.deepseek.com/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.DEEPSEEK_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: "deepseek-v4-pro",
messages: [
{ role: "system", content: "You are a concise coding assistant." },
{ role: "user", content: "Write a TypeScript function to validate an email." }
]
})
});
For deeper coverage:
Xiaomi MiMo: the cheapest 1M-context option
Models: MiMo V2.5 Pro ($1.00 input / $3.00 output / $0.20 cache, 1M context), MiMo V2 Flash (~$0.10 / ~$0.40, 256K context).
Xiaomi’s May 27 permanent cut flattened MiMo V2.5 pricing across context windows. The old long-context tiers charged steep multipliers above 256K input tokens. The new pricing applies the same $1/$3 rate whether you send 5K or 950K tokens. The official price-update notice labels the cut “permanent.”
Use MiMo V2.5 Pro when
- You need 300K–1M tokens of context.
- You process large documents, full repositories, or multi-document bundles.
- Predictable long-context billing matters more than minimum per-token price.
- You want to avoid chunking and retrieval complexity for some workloads.
Avoid MiMo V2.5 Pro when
- Your prompts fit under 128K and cost is the main constraint.
- You need very low latency.
- You are building short-prompt chat where DeepSeek is cheaper.
Implementation tip
Use MiMo for long-context branches only:
function shouldUseMiMo(inputTokens) {
return inputTokens > 300_000;
}
Then keep short requests on cheaper models:
const model = shouldUseMiMo(inputTokens)
? "mimo-v2.5-pro"
: "deepseek-v4-pro";
The 1M context window plus competitive cache rate gives MiMo a unique place in the market. Until DeepSeek extends context beyond 128K or Alibaba flattens Qwen’s pricing, MiMo owns the cheap-and-long quadrant.
For deeper coverage:
- How Much Does It Cost to Use Xiaomi MiMo V2.5 in 2026
- MiMo V2-Pro & Omni pricing
- Xiaomi MiMo Orbit free 100T token program
Alibaba Qwen: the production workhorse
Models: Qwen3 Max ($0.78 input / $3.90 output / $0.156 cache, 262K context). Newer Qwen 3.7 Max at $2.50/MTok input with 1M context is in early rollout. Rates verified against pricepertoken’s Qwen3 Max sheet.
Qwen3 Max is Alibaba’s flagship and one of the most-deployed Chinese models in international production. It is not the cheapest option: it is about 1.8x DeepSeek V4-Pro on input and 4.5x on output. The tradeoff is broader tooling support, OpenAI-compatible usage, Anthropic-protocol drop-in support, Alibaba Cloud enterprise hosting, and a 262K context window.
Use Qwen3 Max when
- You need strong general-purpose production quality.
- You serve multilingual users, especially Mandarin and Asian-language-heavy traffic.
- You need 200K–262K context.
- You care about enterprise hosting, SLA, or cloud-region options.
Avoid Qwen3 Max when
- Your workload is output-heavy and cost-sensitive.
- Your prompts fit in DeepSeek’s context window and DeepSeek quality is sufficient.
- You do not need the enterprise ecosystem.
Implementation tip
Use Qwen as the default fallback for mixed traffic:
function routeGeneralRequest(request) {
if (request.outputHeavy && request.inputTokens < 128_000) {
return "deepseek-v4-pro";
}
if (request.inputTokens > 300_000) {
return "mimo-v2.5-pro";
}
return "qwen3-max";
}
For deeper coverage:
Moonshot Kimi: the coding specialist
Models: Kimi K2.6 with context-tiered input pricing ($0.16 to $2.00/MTok across 8K, 32K, 64K, and 128K bands), $0.07/MTok cache-hit floor, output rates around $2.50/MTok in the middle band.
Kimi K2.6 is strongest when your workload reuses a large prefix. The $0.07/MTok cache-hit rate makes repeated system prompts, stable few-shot examples, and long-running agent instructions much cheaper after caching works.
Use Kimi K2.6 when
- You are building coding agents.
- You reuse a large stable system prompt.
- You need strong tool-call format compliance.
- You have long-running chat sessions with repeated instructions.
Avoid Kimi K2.6 when
- Your prompt prefix changes every request.
- You need highly predictable billing.
- Your traffic frequently crosses tier boundaries at 32K, 64K, or 128K input tokens.
Implementation tip
Keep your system prompt stable and put request-specific data later in the prompt. This improves the chance of cache hits.
const messages = [
{
role: "system",
content: STATIC_AGENT_INSTRUCTIONS // keep this byte-stable across calls
},
{
role: "user",
content: dynamicUserTask
}
];
For deeper coverage:
Zhipu GLM: the reasoning challenger
Models: GLM-5 ($1.00 input / $3.20 output, 200K context), GLM-5.1 ($0.98 / $3.08, 200K context). Rates verified against Z.AI’s official pricing overview.
Zhipu’s GLM-5 launched with a 30% price increase over GLM-4.7, then GLM-5.1 arrived at a marginal discount. The positioning is clear: GLM is not the cheapest model in this set, but it is designed for structured reasoning and chain-of-thought-heavy tasks.
Use GLM-5 when
- You need math, formal reasoning, or structured analysis.
- Wrong answers are expensive.
- You are building financial analysis, legal summarization, or scientific reasoning flows.
- Your multi-step agent workflows benefit from clean reasoning traces.
Avoid GLM-5 when
- You optimize primarily for cost.
- Your workload is simple summarization or content generation.
- Strong reasoning does not materially improve the output.
Implementation tip
Route only the hard tail to GLM:
function routeByDifficulty(task) {
if (task.requiresFormalReasoning || task.domainRisk === "high") {
return "glm-5";
}
return "deepseek-v4-pro";
}
For deeper coverage:
- GLM-5 vs DeepSeek V3 vs GPT-5: speed, cost, and practical developer comparison
- GLM-5.1 vs Claude, GPT, Gemini, DeepSeek
Cheapest per workload: buyer’s matrix
| Workload | Winner | Why |
|---|---|---|
| Code generation, output-heavy | DeepSeek V4-Pro | $0.87/MTok output is the lowest |
| Long-document RAG over 300K context | Xiaomi MiMo V2.5 Pro | Only flat-priced 1M-context option |
| Coding agent with stable system prompt | Kimi K2.6 | $0.07/MTok cache-hit floor |
| Multilingual customer support | Alibaba Qwen3 Max | Strongest non-English performance |
| Math, formal reasoning, structured analysis | Zhipu GLM-5 | Best chain-of-thought quality |
Three practical routing patterns:
1. Two-model routing
Send most easy traffic to DeepSeek and reserve another model for the hard tail.
if (request.isRoutine && request.inputTokens < 128_000) {
return "deepseek-v4-pro";
}
return "qwen3-max";
2. Long-context segmentation
Split by context length.
if (inputTokens > 300_000) {
return "mimo-v2.5-pro";
}
return "deepseek-v4-pro";
3. Cache prefix consolidation
Make repeated prompt sections identical across requests:
const CACHEABLE_PREFIX = `
You are an internal code review agent.
Follow the same review rubric for every request.
Return JSON only.
`;
Avoid injecting timestamps, request IDs, or user-specific metadata into the cacheable prefix unless required.
Quality and benchmark notes
Pricing only matters if the model is good enough for your workload.
Per Artificial Analysis, the five models in this comparison cluster within 5 to 10 percentage points of each other on most public benchmarks. The important differences are in the workload tails:
- DeepSeek V4-Pro: strong on coding, with SWE-bench Pro around 55%, and reasoning, with GPQA around 90%. Slight gap to GPT-5.5 on long-horizon agent tasks.
- MiMo V2.5 Pro: strong on long-context retrieval, with over 95% needle accuracy at 800K, and middle-of-pack on coding.
- Qwen3 Max: best non-English performance and strong general production quality.
- Kimi K2.6: strongest tool-call format compliance, especially for parallel tool calls.
- GLM-5: best chain-of-thought reasoning quality in this set.
Run your own 100-sample eval before committing. Public benchmarks are directional. Your production prompts are the real benchmark.
Testing all five with Apidog
A multi-model production deploy needs a multi-model test harness. Apidog can test all five APIs from one workspace because all five accept OpenAI Chat Completions-style request bodies, with minor provider-specific quirks.
Use this workflow:
-
Create one environment per provider
api.deepseek.complatform.xiaomimimo.com- Alibaba Cloud Model Studio
api.moonshot.cnopen.bigmodel.cn
Import the OpenAI Chat Completion schema once
Use the same request body shape, then switch the base URL per environment.
- Run the same scenario across all five models
Track:
- response correctness
- latency
- output token count
- tool-call validity
- total cost
- Validate tool calls with JSON Schema
This catches provider-specific streaming and tool_calls formatting quirks.
Example validation target:
{
"type": "object",
"required": ["tool_calls"],
"properties": {
"tool_calls": {
"type": "array",
"items": {
"type": "object",
"required": ["id", "type", "function"],
"properties": {
"id": { "type": "string" },
"type": { "const": "function" },
"function": {
"type": "object",
"required": ["name", "arguments"],
"properties": {
"name": { "type": "string" },
"arguments": { "type": "string" }
}
}
}
}
}
}
}
Download Apidog, import your test cases, and you can build a five-way comparison quickly.
Related deep dives:
Where the price war goes next
The pricing floor moved twice in May. Two more moves are likely before Q3 closes:
- Qwen response: Alibaba has rarely been first to cut, but consistently follows within weeks. Expect a Qwen3 Max revision or Qwen 3.8 announcement by July.
- GLM response: Zhipu’s 30% increase on GLM-5 looks increasingly contrarian. A GLM-5.2 with a structural cut is plausible.
- Kimi structural simplification: Tiered context pricing is going out of fashion. Moonshot may flatten K2.6 to match MiMo’s structure.
Next steps
- Pick your top three production workloads.
- Map each workload to the buyer’s matrix.
- Run a 100-sample eval across the likely models.
- Normalize your system prompts so cache prefixes are stable.
- Wire an Apidog regression suite across all five providers.
The price floor is still moving. Build your LLM stack so model swaps and routing changes take hours, not weeks.

Top comments (0)