keeper

Posted on May 20

DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem

#ai #hardware #machinelearning #llm

In April 2026, DeepSeek released V4 — a 1.6 trillion parameter MoE model — and for the first time, the technical report listed Huawei's Ascend NPU alongside NVIDIA in its validated hardware list. This is the story of what that means for the actual supply chain, the bottlenecks that remain, and where this is heading.

The Validation That Changed Everything

When DeepSeek released V4 on April 24, 2026, most of the attention went to the model's benchmark scores matching GPT-5 and Claude Opus. But a quieter — and arguably more consequential — event was buried in the fine print:

DeepSeek V4 is the first top-tier model to fully validate inference on Huawei's Ascend 950PR chip.

This wasn't a "it compiles" checkbox validation. The DeepSeek team:

Rewrote 200+ core CUDA operators for Huawei's CANN Next framework
Ran 100,000+ test cases for precision alignment
Invested approximately 30 person-years of engineering effort
Delayed the product launch by over 2 months specifically to complete the port
The initial port ran at 1/35th of target performance — they optimized back to parity

The results on the 950PR are genuine:

Metric	Improvement vs. NVIDIA H20
FP4 compute	2.87x faster (1.56 PFLOPS vs ~0.5)
MoE inference speed	1.5-1.73x (general), up to 1.96x (RL rollout)
Multi-modal generation	+60% faster
HBM capacity	112GB vs 96GB

A caveat: the H20 is NVIDIA's "China special" — deliberately crippled by export controls. This doesn't mean 950PR beats H100 or B200. But it does mean that for inference workloads on Chinese soil, domestic hardware is now a credible alternative, not a consolation prize.

The Chip That Does It: Ascend 950's Dual-Architecture Strategy

What makes the 950 interesting isn't just the raw spec sheet — it's the architectural cleverness. Huawei realized that LLM inference has two fundamentally different phases, and designed two separate chips sharing the same die:

Variant	Phase	Memory	Bandwidth	Shipping
950PR	Prefill + Recommendation	HiBL 1.0 (128GB)	1.6 TB/s	✅ Now (mass production since March 2026)
950DT	Decode + Training	HiZQ 2.0 (144GB)	4 TB/s	Q4 2026

Prefill (reading the entire input + computing KV cache) is compute-bound — it needs raw FLOPs, not memory bandwidth. A cheaper HBM works fine. Decode (generating one token at a time) is memory-bandwidth-bound — the bottleneck is how fast you can feed weights to the compute units. Here, 4 TB/s bandwidth makes a real difference.

The 950DT's 4 TB/s HiZQ 2.0 memory puts it in the same league as NVIDIA's H200 (141GB / 4.8 TB/s). It won't be available until Q4 2026, but that's when the training-side gap starts to close.

The Self-Developed HBM Bet

HBM accounts for roughly 50% of an AI chip's cost. Huawei's decision to develop its own HBM — HiBL (low-cost/Budget Line) and HiZQ (high-performance) — isn't just about supply chain security. It enables customization that off-the-shelf HBM can't provide.

The local HBM supply chain is making real progress:

Milestone	Status	Timeline
CXMT (长鑫存储) HBM3 samples delivered to Huawei	✅	Done
CXMT Shanghai packaging fab	🟡 Construction	End of 2026
CXMT HBM3E development	🟡 In progress	Target 2027
CXMT HBM3 mass production	❌ Not started — no volume orders yet	Delayed

The bottleneck: CXMT's HBM3 is still in testing. Raw materials only support sample runs, not mass production. The Huawei alliance is also working with Fujian Jinhua (福建晋华) and Wuhan Xinxin (武汉新芯) as secondary foundries, but these are supplementary capacity, not primary sources.

The pragmatic reality: HiBL 1.0 and HiZQ 2.0 are likely "self-developed" at the packaging and controller level, not at the DRAM die level. Huawei takes available DRAM dies, packages them with proprietary 2.5D stacking, and adds custom controllers. This is why HiBL 1.0's 1.6 TB/s bandwidth is achievable — it's bounded by the dies they can source, not by their design ambition.

The Five Bottlenecks That Limit Delivery

HBM gets the headlines, but it's not the only constraint. Here are all five, ranked by severity:

🔴 1. Advanced Manufacturing (SMIC)

The hardest bottleneck. SMIC's N+2 (equivalent to 7nm, using DUV multipatterning since EUV is unavailable) has a monthly capacity of approximately 35,000-38,000 12-inch wafers. At ~92% yield, that translates to roughly 750,000 Ascend 950 chips per year.

750K sounds like a lot, but it serves the entire Chinese AI market. NVIDIA ships millions of H100/B200 units annually. The capacity gap is orders of magnitude.

SMIC plans to double capacity to 70,000 wafers/month during 2026, but without EUV, each generation becomes exponentially harder. The 950DT uses the same N+2 process. The absolute ceiling of domestic advanced manufacturing will remain the binding constraint through at least 2028.

🟡 2. Advanced Packaging

Ascend 950 requires 2.5D Chiplet packaging (2 compute dies + 2 I/O dies + HBM). This isn't a "nice to have" — without it, you can't assemble the chip.

Supplier	Status
JCET (长电科技) — Dongguan HBM base	Running at full capacity
Tongfu Micro (通富微电) — SJ1/SJ lines	Fully loaded, emergency expansion
QuLiang Electronics (渠梁电子)	Accelerating expansion

Packaging capacity is the tightest short-term bottleneck. New capacity from JCET and Tongfu's expansion won't meaningfully add supply until 2027. This is why "advanced packaging stocks" are the hottest semiconductor theme on China's A-share market in 2026.

🟡 3. Interconnect: Making 8,192 Cards Work as One Computer

The Atlas 950 SuperNode (8,192 cards, 160 cabinets, 1,000 square meters) requires a new interconnect protocol — Lingqu 2.0 / UnifiedBus. The predecessor (Lingqu 1.0) was validated on 384-card Atlas 900 systems (300+ deployed). Scaling from 384 to 8,192 is a leap in complexity:

Full optical interconnect between cabinets
16 PB/s total bandwidth (10x global internet peak traffic)
All-liquid cooling at MW-scale per cluster

This is a 2026 Q4 delivery. The engineering risk is real, but Huawei's track record with Lingqu 1.0 (proven at scale) suggests this is a schedule risk, not a technology risk.

🟢 4. Software Ecosystem (CANN)

CANN was fully open-sourced in December 2025. DeepSeek V4's successful port is the single biggest validation event to date. But the developer count gap is stark: ~87,000 CANN developers vs. ~3 million CUDA developers.

Huawei's strategy is "CUDA-to-CANN automated conversion tools" combined with PyTorch compatibility layers. This works for standard model architectures. Edge cases still require manual operator rewriting — the same 30 person-years of work that DeepSeek invested.

For large enterprises with dedicated ML teams, this is doable. For smaller teams, it's a barrier.

🟢 5. Cooling and Power

Per-chip TDP is ~310W. At supernode scale, total power draw is in megawatts. Full liquid cooling is mandatory, and green power alignment adds infrastructure complexity. This is solvable — the technology exists — but deployment speed varies across data center operators.

The Long-Term Outlook: An Honest Assessment

The trajectory is real

Huawei has a clear 3-generation roadmap:

Generation	FP8	FP4	Memory BW	Expected
950 (PR + DT)	1 PFLOPS	2 PFLOPS	4 TB/s	2026
960	2 PFLOPS	4 PFLOPS	~8 TB/s	Q4 2027
970	4 PFLOPS	8 PFLOPS	~12-16 TB/s	Q4 2028

Each generation roughly doubles specs. Revenue hit $12 billion in 2026 (up 60% from $7.5B in 2025). The business is scaling.

But it needs context

Dimension	2026 Reality	2028 Target
Supply-demand gap	🔴 Severe (750K chips vs 2-3x demand)	🟡 Improving but not balanced
Performance vs. NVIDIA	🟡 950PR ≈ H200 (which is last gen for NVIDIA)	🟡 960 ≈ 70-80% of 2027 NVIDIA
Process node	🔴 7nm (no EUV)	🔴 Still 7nm — Chiplet mitigates but can't eliminate
Market share (China inference)	~20%	40-50% (projected)

The honest assessment: Ascend will not "catch up" to NVIDIA in absolute terms. The process gap (7nm DUV vs 3nm EUV+) is physical and cannot be willed away. But it doesn't need to catch up. The Chinese AI chip market is structurally bifurcating:

Ascend takes ~50% of domestic demand + NVIDIA holds the high end through H20 and smuggled/cloud-accessible H100 + Other domestic players (Cambricon, Moore Threads, Biren) split the remainder

For anyone building AI products for the Chinese market: this is not a question of "whether to switch." It's "when to switch." For anyone building for global markets: unaffected — continue with CUDA.

Two technology worlds are solidifying: CUDA World and CANN World.

What DeepSeek V4 on Ascend 950 really proved

Before April 2026, Huawei could say "our chips work." After April 2026, DeepSeek proved it with a 1.6T-parameter model, real production traffic, and actual cost numbers. The credibility gap is closed.

The remaining bottlenecks are all physical or temporal: more chips, more packaging lines, more fab capacity, more time for the ecosystem to mature. None of these have a quick fix. But they also don't depend on any single breakthrough — they're a production scaling problem, and production scaling responds to money and time.

China's AI chip ecosystem just passed its most important stress test. The bottlenecks that remain are hard, but they're the kind of hard that follows linear progress curves — not the binary win/lose of "can this even work."

DEV Community