ai, deepseek, machinelearning

桑凯 — Sat, 30 May 2026 14:15:57 +0000

title: The Rise of China's LLMs: A Complete History from 2017 to 2026 published: ture description: From Wu Dao 2.0 (1.75T params) to DeepSeek V3 ($5.6M training cost) — the full story of how Chinese AI labs went from "cheap copy" to genuine competitors. tags: ai, deepseek, machinelearning, llm, china cover_image:
I've been following the AI space since the GPT-2 days, and one thing that consistently surprises people is how far Chinese AI labs have come in just a few years. Most Western developers still think "Chinese AI" = "cheap copy." The reality is far more interesting.

Let me walk through the full timeline.

2017–2020: The Foundation Years
The story starts, as most LLM stories do, with the Transformer paper ("Attention Is All You Need") from Google Brain in 2017. Chinese researchers were deeply involved from the start — several of the paper's authors were Chinese nationals who later returned to China to build AI labs.

2018: BERT-era awakening

When Google released BERT in late 2018, Chinese tech giants jumped in immediately:

Baidu released ERNIE 1.0 in March 2019, beating Google's own BERT on several Chinese NLP benchmarks. ERNIE incorporated knowledge graph embeddings — something BERT didn't have.
Alibaba released its own pretrained models for e-commerce NLP.
Tencent followed with its own family of pretrained models.
But none of these were "large" by today's standards. The parameter counts were in the hundreds of millions, not billions.

2019: GPT-2 triggers the race

OpenAI's GPT-2 (1.5B parameters) made it clear that scaling worked. Chinese labs realized they needed to think bigger. But there was a problem: NVIDIA GPUs were hard to get due to US export restrictions starting to tighten.

This constraint would later become a feature, not a bug — but we'll get to that.

2021: The Year Everything Changed
June 2021 — Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 2.0

This was the moment that shocked the global AI community. Wu Dao 2.0 had 1.75 trillion parameters — at the time, larger than GPT-3 (175B) by a factor of 10x. It was trained on a Chinese-made supercomputer and could generate text, images, and even write poetry.

The Western press mostly ignored it. Those who paid attention dismissed it as "impressive but not practical." In hindsight, this was the first major signal that China was serious about foundation models.

Key Wu Dao 2.0 stats:

1.75T parameters (sparse MoE architecture)
Trained on 4.9 TB of text data
1,000+ GPUs (NVIDIA A100, obtained before export restrictions tightened)
Could generate text, images, and video
Late 2021 — Zhipu AI releases GLM-130B

Tsinghua University spin-off Zhipu AI released the General Language Model (GLM) at 130B parameters. This was significant because it was the first Chinese LLM to explicitly target English + bilingual performance.

2022: The Calm Before the Storm
While OpenAI was quietly training GPT-4 and Anthropic was working on Claude, Chinese labs were making incremental progress:

Alibaba released Tongyi Qianwen (Qwen) 7B and 14B
Baidu launched ERNIE 3.0 Titan (260B parameters)
Huawei released PanGu-Σ (1.085T parameters, MoE)
Tencent open-sourced its Hunyuan model family
None of these made global headlines. The performance gap with GPT-3.5 was real — Chinese models were roughly 6-12 months behind in benchmark scores.

Then ChatGPT launched in November 2022.

2023: China's "200 Models" Era
ChatGPT's launch sent shockwaves through China. Within weeks, over 200 Chinese companies announced LLM projects. The government fast-tracked approval for commercial LLM deployment.

Key events in 2023:

March — Baidu ERNIE Bot Baidu launched ERNIE Bot, China's first public-facing ChatGPT competitor. The launch was rough — the demo was pre-recorded and the actual product had obvious quality issues. Critics called it "embarrassing." But Baidu iterated fast.

April — Alibaba Qwen open-sourcing Alibaba surprised everyone by open-sourcing the Qwen-7B and Qwen-14B models under a permissive license. The global open-source community took notice.

August — China approves commercial LLMs The Chinese government approved 8 LLMs for public commercial use, including Baidu ERNIE, Alibaba Qwen, and Zhipu GLM. This was the starting gun for the AI application boom.

October — DeepSeek enters the chat DeepSeek, a hedge-fund-backed AI lab, released its first model — DeepSeek 67B. It was trained on a relatively modest budget ($12M estimated) and achieved performance comparable to LLaMA 2 70B.

2024: The Open-Source Revolution
This was the year Chinese models stopped being "behind."

January — DeepSeek V2 DeepSeek V2 introduced Mixture-of-Experts (MoE) with a game-changing innovation: Multi-head Latent Attention (MLA). This reduced KV cache usage by 90%, making inference dramatically cheaper.

236B total parameters, 21B active per token
Training cost: ~
10
M
(
v
s
G
P
T
−
4
′
s
e
s
t
i
m
a
t
e
d
10M(vsGPT−4
′
sestimated100M+)
API pricing: $0.14/M input tokens
May — Qwen2 series Alibaba released Qwen2, from 0.5B to 72B. The 72B model was competitive with LLaMA 3 70B. All fully open-source.

December — DeepSeek V3 This was the bombshell. DeepSeek V3:

671B total parameters (37B active)
Trained on 2,048 NVIDIA H800 GPUs for 2.788M GPU hours
Total training cost: $5.576M
Performance: Comparable to GPT-4o and Claude 3.5 Sonnet
API pricing:
0.40
/
M
i
n
p
u
t
,
0.40/Minput,1.60/M output
To put that training cost in perspective:

Model Estimated Training Cost
GPT-4 $100M+
Gemini Ultra $200M+
Llama 3 405B ~$30M
DeepSeek V3 $5.6M
2025: The Chinese Model Explosion
January — DeepSeek R1 DeepSeek released R1, an open reasoning model rivaling OpenAI o1. Cost:
1.10
/
M
i
n
p
u
t
v
s
o
1
′
s
1.10/Minputvso1
′
s15/M. That's 93% cheaper.

March — Qwen3 (235B) Alibaba released Qwen3, a 235B MoE model with 128K context. It matched GPT-4o on MMLU, HumanEval, and multilingual benchmarks.

May — Kimi K2 Moonshot AI released K2, a 1T-parameter MoE model. It led the Chatbot Arena leaderboard for several weeks and was particularly strong at long-context tasks (up to 1M tokens).

Where We Are Today (May 2026)
Model Params (Active) Input $/M Output $/M MMLU HumanEval
GPT-4o ~1.7T (?) $10.00 $30.00 88.7 90.2
Claude 3.5 Sonnet — $3.00 $15.00 88.3 92.0
DeepSeek V3 671B (37B) $0.40 $1.60 88.5 90.5
Qwen3-235B 235B (35B) $0.50 $2.00 88.0 89.8
Kimi K2 1T (32B) $0.50 $2.00 89.1 91.2
The benchmark gap has essentially closed. On some tasks (math, coding, long context), Chinese models actually lead.

What Drove This?
Three factors Western developers should understand:

The compute constraint became an innovation driver When US chip restrictions limited access to NVIDIA H100/B200, Chinese labs had to optimize every last flop. They developed more efficient architectures (MoE, MLA), better training algorithms (FP8 mixed precision), and clever infrastructure hacks (DeepSeek's "DualPipe" algorithm).
Massive domestic talent pool China produces roughly 500,000 engineering graduates per year. Top labs (DeepSeek, Zhipu, Moonshot) recruit from Tsinghua, PKU, and Zhejiang University — all world-class CS programs.
Government + VC funding Chinese AI labs received over $50B in total funding between 2021-2025. The government designated AI as a strategic priority and provided subsidies, data center access, and fast-track regulatory approval.

The Takeaway
Chinese LLMs are no longer "just catching up." They've become the cost-effective option in a market where Western models keep getting more expensive. DeepSeek V3 delivers roughly 88% of GPT-4o's quality at 4% of the price.

The Chinese AI story isn't about geopolitical competition. It's about what happens when brilliant engineers face resource constraints and decide to innovate their way out instead of throwing money at the problem.

Sources: DeepSeek technical reports (arxiv), BAAI publications, Alibaba Qwen papers, Moonshot AI blog, artificial analysis, llm-stats, public API pricing pages. All data current as of May 2026.

Next up in this series: Why Western AI models are so expensive — and whether the pricing is justified.

DEV Community: 桑凯

ai, deepseek, machinelearning