DeepSeek V4 Pro Review 2026: MIT 1.6T MoE for Self-Hosters

#deepseek #llm #moe #localllm

This article was originally published on aifoss.dev

TL;DR: DeepSeek V4 Pro is a 1.6T-parameter (49B active) MoE model released April 24, 2026 under the MIT license, with full open weights on HuggingFace. The benchmarks are frontier-class, but at ~862 GB of weights it is a datacenter model — no consumer rig runs it. For self-hosters, the open-source story that actually matters is V4-Flash (284B / 13B active) and the MIT license that lets you deploy either commercially.

	V4 Pro (self-hosted)	V4 Pro (API)	V4-Flash (self-hosted)
Best for	Sovereign datacenter inference	Frontier quality, zero ops	Real single-node self-hosting
Min hardware	8× H100 / 4× H200 (FP8)	API only	1× A100 80GB (FP8)
Weights size	~862 GB	n/a	~158 GB
License	MIT	Proprietary endpoint	MIT
Context	1M tokens	1M tokens	1M tokens
Cost	Hardware + power	$0.435/$0.87 per 1M	Hardware + power

Honest take: If you have a GPU cluster and a compliance reason, self-host V4 Pro. Everyone else should run V4-Flash locally or hit the V4 Pro API — at $0.87 per million output tokens, paying for Pro is cheaper than the electricity to fake it on quantized hardware.

What DeepSeek V4 Pro Is

DeepSeek released the V4 series on April 24, 2026. There are two open-weight checkpoints: V4-Pro (1.6T total parameters, ~49B activated per token) and V4-Flash (284B total, ~13B activated). Both are Mixture-of-Experts models, both ship under the MIT license, and both support a context window of up to 1 million tokens.

V4-Pro was pre-trained on 33T tokens. The headline architectural change is a hybrid attention scheme combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). The point of that combination is long-context efficiency: at a 1M-token context, DeepSeek reports V4-Pro needs roughly 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That is the difference between a 1M context that is a marketing number and one you can actually serve.

Key specs as of release:

Parameters: 1.6T total / ~49B active (MoE) for Pro; 284B / ~13B for Flash
Context: up to 1,000,000 tokens
License: MIT — commercial use, fine-tuning, redistribution, no revenue cap
Released: April 24, 2026 by DeepSeek
HuggingFace: deepseek-ai/DeepSeek-V4-Pro, deepseek-ai/DeepSeek-V4-Flash
Serving: requires vLLM ≥ 0.7.0 or SGLang ≥ 0.4.4

The MIT license is the load-bearing detail. A 1.6T model that beats most of the closed frontier on coding, released with weights you can legally deploy commercially and fine-tune, is a different category of thing than an API you rent. Whether you can run it is a separate question — and the honest answer for almost everyone is "not on your own hardware."

The $10B Open-Source Bet, In Plain Terms

The reason this release got outsized attention isn't just the benchmarks. In May 2026, DeepSeek's first outside funding round started closing. Reporting varies on the exact figure: Bloomberg framed it as a ~70 billion yuan (~$10 billion) raise in valuation terms, while CNBC and The Information later pegged the actual money raised closer to $7–7.4 billion, with founder Liang Wenfeng reportedly committing around 20 billion yuan of his own capital. Don't anchor on a single dollar figure — the reports genuinely disagree, and the round had not officially closed as of early June 2026.

What's consistent across every report is the strategic stance: Liang told investors DeepSeek will keep releasing open-source models rather than pivot to short-term commercialization. For a self-hoster, that pledge is worth more than the exact size of the round. It signals that V4-Pro's MIT weights are a deliberate strategy, not a one-off, which lowers the risk of building a private stack on the DeepSeek line and having the rug pulled in a future "open-weights but non-commercial" relicense.

Benchmark Reality Check

DeepSeek positions V4-Pro-Max (the maximum-reasoning-effort mode) as a frontier coding model. Independent and aggregator sources report the following for V4-Pro, and you should treat secondary-source benchmark numbers as directional rather than gospel:

Benchmark	V4-Pro (reported)	What it measures
SWE-bench Verified	80.6%	Real GitHub issue resolution
LiveCodeBench	93.5%	Competitive coding, contamination-resistant
Codeforces (rating)	3206	Algorithmic problem solving
GPQA Diamond	90.1	Graduate-level science reasoning
MMLU-Pro	87.5	Broad knowledge, harder MMLU variant

If those hold up, V4-Pro sits in the same conversation as the top closed models on coding — at open weights and a fraction of the API price. The number that matters most for the "should I pay or self-host" decision is SWE-bench Verified: 80.6% is genuinely strong, and it's the kind of agentic coding workload where a 1M context plus cheap cached input changes how you'd structure a coding agent.

One caveat worth stating plainly: aggregator sites are not the model card, and "Max" reasoning mode inflates latency and token cost. Benchmark a representative slice of your own workload before you treat any of these as a procurement decision.

Can You Actually Self-Host It?

This is where the romance meets the spec sheet. V4-Pro's weights are roughly 862 GB. Full BF16 is around 3.2 TB. The realistic deployment targets:

FP8: ~500 GB minimum — at least 4× H200 (141 GB each) or 8× H100 (80 GB each).
INT4: a 4× H100 cluster (320 GB) becomes viable, with measurable quality loss on reasoning and math.
Consumer GPUs: not happening for Pro. A single RTX 4090 holds 24 GB. You would need dozens, and the interconnect would be the bottleneck long before VRAM.

For the cluster-class hardware Pro demands, renting is almost always the right first move. A few hours on rented H100/H200 nodes via RunPod costs less than the depreciation on a single owned card, and you can validate the deployment before committing capital. If you're sizing a permanent local build for big MoE models, the GPU-server tradeoffs are covered in more depth at runaihome.com.

V4-Flash is the model self-hosters will actually run. Its FP8 instruct checkpoint is ~158 GB. Add ~10 GB for a full 1M-token KV cache and runtime overhead, and you're budgeting roughly 170–175 GB — a single 8×24 GB rig, a 2× H100 node, or a single A100 80GB at reduced context. Unsloth published GGUF quants for V4-Flash within about 48 hours of release; community reports put Q4_K_M as the sweet spot, fitting on 1× 80 GB or 2× 48 GB while staying close to FP8 quality. Aggressive INT4 (GGUF/AWQ/GPTQ) can squeeze Flash to ~80 GB — potentially 4× RTX 4090 — but the quality drop on math and complex instruction-following is real, not theoretical.

A Minimal vLLM Deployment

Here's a realistic single-node V4-Flash launch on an 8×80 GB box. The flags matter more than usual for a model this size: tensor parallelism across all GPUs, an explicit context cap, and the trust-remote-code flag for the custom attention.

# Requires vLLM >= 0.7.0
pip install "vllm>=0.7.0"

# Serve V4-Flash with an OpenAI-compatible endpoint
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --port 8000

Expected startup log (abbreviated):

INFO server_args.py: Using FP8 weights, 158.2 GiB total
INFO worker.py: TP=8, KV cache allocated for 131072 tokens
INFO api_server.py: Started server on http://0.0.0.0:8000

Then it's a drop-in OpenAI client:


bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Refactor this