In April 2026, DeepSeek released V4 — a 1.6 trillion parameter MoE model — and for the first time, the technical report listed Huawei's Ascend NPU alongside NVIDIA in its validated hardware list. This is the story of what that means for the actual supply chain, the bottlenecks that remain, and where this is heading.
The Validation That Changed Everything
When DeepSeek released V4 on April 24, 2026, most of the attention went to the model's benchmark scores matching GPT-5 and Claude Opus. But a quieter — and arguably more consequential — event was buried in the fine print:
DeepSeek V4 is the first top-tier model to fully validate inference on Huawei's Ascend 950PR chip.
This wasn't a "it compiles" checkbox validation. The DeepSeek team:
- Rewrote 200+ core CUDA operators for Huawei's CANN Next framework
- Ran 100,000+ test cases for precision alignment
- Invested approximately 30 person-years of engineering effort
- Delayed the product launch by over 2 months specifically to complete the port
- The initial port ran at 1/35th of target performance — they optimized back to parity
The results on the 950PR are genuine:
| Metric | Improvement vs. NVIDIA H20 |
|---|---|
| FP4 compute | 2.87x faster (1.56 PFLOPS vs ~0.5) |
| MoE inference speed | 1.5-1.73x (general), up to 1.96x (RL rollout) |
| Multi-modal generation | +60% faster |
| HBM capacity | 112GB vs 96GB |
A caveat: the H20 is NVIDIA's "China special" — deliberately crippled by export controls. This doesn't mean 950PR beats H100 or B200. But it does mean that for inference workloads on Chinese soil, domestic hardware is now a credible alternative, not a consolation prize.
The Chip That Does It: Ascend 950's Dual-Architecture Strategy
What makes the 950 interesting isn't just the raw spec sheet — it's the architectural cleverness. Huawei realized that LLM inference has two fundamentally different phases, and designed two separate chips sharing the same die:
| Variant | Phase | Memory | Bandwidth | Shipping |
|---|---|---|---|---|
| 950PR | Prefill + Recommendation | HiBL 1.0 (128GB) | 1.6 TB/s | ✅ Now (mass production since March 2026) |
| 950DT | Decode + Training | HiZQ 2.0 (144GB) | 4 TB/s | Q4 2026 |
Prefill (reading the entire input + computing KV cache) is compute-bound — it needs raw FLOPs, not memory bandwidth. A cheaper HBM works fine. Decode (generating one token at a time) is memory-bandwidth-bound — the bottleneck is how fast you can feed weights to the compute units. Here, 4 TB/s bandwidth makes a real difference.
The 950DT's 4 TB/s HiZQ 2.0 memory puts it in the same league as NVIDIA's H200 (141GB / 4.8 TB/s). It won't be available until Q4 2026, but that's when the training-side gap starts to close.
The Self-Developed HBM Bet
HBM accounts for roughly 50% of an AI chip's cost. Huawei's decision to develop its own HBM — HiBL (low-cost/Budget Line) and HiZQ (high-performance) — isn't just about supply chain security. It enables customization that off-the-shelf HBM can't provide.
The local HBM supply chain is making real progress:
| Milestone | Status | Timeline |
|---|---|---|
| CXMT (长鑫存储) HBM3 samples delivered to Huawei | ✅ | Done |
| CXMT Shanghai packaging fab | 🟡 Construction | End of 2026 |
| CXMT HBM3E development | 🟡 In progress | Target 2027 |
| CXMT HBM3 mass production | ❌ Not started — no volume orders yet | Delayed |
The bottleneck: CXMT's HBM3 is still in testing. Raw materials only support sample runs, not mass production. The Huawei alliance is also working with Fujian Jinhua (福建晋华) and Wuhan Xinxin (武汉新芯) as secondary foundries, but these are supplementary capacity, not primary sources.
The pragmatic reality: HiBL 1.0 and HiZQ 2.0 are likely "self-developed" at the packaging and controller level, not at the DRAM die level. Huawei takes available DRAM dies, packages them with proprietary 2.5D stacking, and adds custom controllers. This is why HiBL 1.0's 1.6 TB/s bandwidth is achievable — it's bounded by the dies they can source, not by their design ambition.
The Five Bottlenecks That Limit Delivery
HBM gets the headlines, but it's not the only constraint. Here are all five, ranked by severity:
🔴 1. Advanced Manufacturing (SMIC)
The hardest bottleneck. SMIC's N+2 (equivalent to 7nm, using DUV multipatterning since EUV is unavailable) has a monthly capacity of approximately 35,000-38,000 12-inch wafers. At ~92% yield, that translates to roughly 750,000 Ascend 950 chips per year.
750K sounds like a lot, but it serves the entire Chinese AI market. NVIDIA ships millions of H100/B200 units annually. The capacity gap is orders of magnitude.
SMIC plans to double capacity to 70,000 wafers/month during 2026, but without EUV, each generation becomes exponentially harder. The 950DT uses the same N+2 process. The absolute ceiling of domestic advanced manufacturing will remain the binding constraint through at least 2028.
🟡 2. Advanced Packaging
Ascend 950 requires 2.5D Chiplet packaging (2 compute dies + 2 I/O dies + HBM). This isn't a "nice to have" — without it, you can't assemble the chip.
| Supplier | Status |
|---|---|
| JCET (长电科技) — Dongguan HBM base | Running at full capacity |
| Tongfu Micro (通富微电) — SJ1/SJ lines | Fully loaded, emergency expansion |
| QuLiang Electronics (渠梁电子) | Accelerating expansion |
Packaging capacity is the tightest short-term bottleneck. New capacity from JCET and Tongfu's expansion won't meaningfully add supply until 2027. This is why "advanced packaging stocks" are the hottest semiconductor theme on China's A-share market in 2026.
🟡 3. Interconnect: Making 8,192 Cards Work as One Computer
The Atlas 950 SuperNode (8,192 cards, 160 cabinets, 1,000 square meters) requires a new interconnect protocol — Lingqu 2.0 / UnifiedBus. The predecessor (Lingqu 1.0) was validated on 384-card Atlas 900 systems (300+ deployed). Scaling from 384 to 8,192 is a leap in complexity:
- Full optical interconnect between cabinets
- 16 PB/s total bandwidth (10x global internet peak traffic)
- All-liquid cooling at MW-scale per cluster
This is a 2026 Q4 delivery. The engineering risk is real, but Huawei's track record with Lingqu 1.0 (proven at scale) suggests this is a schedule risk, not a technology risk.
🟢 4. Software Ecosystem (CANN)
CANN was fully open-sourced in December 2025. DeepSeek V4's successful port is the single biggest validation event to date. But the developer count gap is stark: ~87,000 CANN developers vs. ~3 million CUDA developers.
Huawei's strategy is "CUDA-to-CANN automated conversion tools" combined with PyTorch compatibility layers. This works for standard model architectures. Edge cases still require manual operator rewriting — the same 30 person-years of work that DeepSeek invested.
For large enterprises with dedicated ML teams, this is doable. For smaller teams, it's a barrier.
🟢 5. Cooling and Power
Per-chip TDP is ~310W. At supernode scale, total power draw is in megawatts. Full liquid cooling is mandatory, and green power alignment adds infrastructure complexity. This is solvable — the technology exists — but deployment speed varies across data center operators.
The Long-Term Outlook: An Honest Assessment
The trajectory is real
Huawei has a clear 3-generation roadmap:
| Generation | FP8 | FP4 | Memory BW | Expected |
|---|---|---|---|---|
| 950 (PR + DT) | 1 PFLOPS | 2 PFLOPS | 4 TB/s | 2026 |
| 960 | 2 PFLOPS | 4 PFLOPS | ~8 TB/s | Q4 2027 |
| 970 | 4 PFLOPS | 8 PFLOPS | ~12-16 TB/s | Q4 2028 |
Each generation roughly doubles specs. Revenue hit $12 billion in 2026 (up 60% from $7.5B in 2025). The business is scaling.
But it needs context
| Dimension | 2026 Reality | 2028 Target |
|---|---|---|
| Supply-demand gap | 🔴 Severe (750K chips vs 2-3x demand) | 🟡 Improving but not balanced |
| Performance vs. NVIDIA | 🟡 950PR ≈ H200 (which is last gen for NVIDIA) | 🟡 960 ≈ 70-80% of 2027 NVIDIA |
| Process node | 🔴 7nm (no EUV) | 🔴 Still 7nm — Chiplet mitigates but can't eliminate |
| Market share (China inference) | ~20% | 40-50% (projected) |
The honest assessment: Ascend will not "catch up" to NVIDIA in absolute terms. The process gap (7nm DUV vs 3nm EUV+) is physical and cannot be willed away. But it doesn't need to catch up. The Chinese AI chip market is structurally bifurcating:
Ascend takes ~50% of domestic demand + NVIDIA holds the high end through H20 and smuggled/cloud-accessible H100 + Other domestic players (Cambricon, Moore Threads, Biren) split the remainder
For anyone building AI products for the Chinese market: this is not a question of "whether to switch." It's "when to switch." For anyone building for global markets: unaffected — continue with CUDA.
Two technology worlds are solidifying: CUDA World and CANN World.
What DeepSeek V4 on Ascend 950 really proved
Before April 2026, Huawei could say "our chips work." After April 2026, DeepSeek proved it with a 1.6T-parameter model, real production traffic, and actual cost numbers. The credibility gap is closed.
The remaining bottlenecks are all physical or temporal: more chips, more packaging lines, more fab capacity, more time for the ecosystem to mature. None of these have a quick fix. But they also don't depend on any single breakthrough — they're a production scaling problem, and production scaling responds to money and time.
China's AI chip ecosystem just passed its most important stress test. The bottlenecks that remain are hard, but they're the kind of hard that follows linear progress curves — not the binary win/lose of "can this even work."
Top comments (0)