DEV Community

정상록
정상록

Posted on

DeepSeek V4-Pro: 1.6T MIT-Licensed Open Weights at 88% Less Than Opus 4.7

While OpenAI builds a desktop super-app, Anthropic doubles down on enterprise lock-in with Opus 4.7, and Google pushes Gemini 3.1-Pro paid tier, DeepSeek went the exact opposite direction. On April 24, 2026, they released V4-Pro and V4-Flash on Hugging Face under MIT license. 1.6 trillion parameters. Currently the largest open-weight model ever published.

If you've been watching the open-source vs frontier gap close in slow motion since Llama 405B, this is the moment that gap collapsed for coding workloads.

The Numbers That Matter

Total parameters:   1.6T
Active per token:   49B (MoE)
Context window:     1M tokens (native, not retrofitted)
Pretrain corpus:    33T tokens
License:            MIT (commercial use OK)
Enter fullscreen mode Exit fullscreen mode

Pricing vs Big 3

Model Input ($/M) Output ($/M)
V4-Pro $1.74 $3.48
GPT-5.5 ~$10 $30
Claude Opus 4.7 ~$15 $25
V4-Flash $0.14 $0.28

V4-Pro output is 88% cheaper than Opus 4.7. V4-Flash input is roughly 1.4% the cost of GPT-5.5 input. Fortune's startup desk literally wrote that this "embarrasses Western AI labs' pricing pages."

This isn't a loss-leader. It's structural — and that's the engineering story worth your time.

How They Got There — Hybrid Attention

The headline architectural change is CSA + HCA interleaved attention (Compressed Sparse Attention + Heavily Compressed Attention). Two attention variants stacked in alternating layers.

The result at 1M context, vs DeepSeek V3.2:

  • Single-token inference FLOPs: 27% of V3.2
  • KV cache memory: 10% of V3.2

V4-Flash is even more aggressive: 10% FLOPs, 7% KV cache.

For anyone who has tried to run long-context production workloads, this is the difference between "interesting demo" and "we can actually serve this." The bottleneck for 1M context has always been KV cache memory linearly blowing up. Compress that to 1/10, and a single GPU can serve sessions that previously needed entire boxes.

Add to that:

  • FP4 quantization-aware training (precision baked in from pretraining, not bolted on after)
  • A new optimizer (details in the partial tech report)
  • Reworked residual connections

The architecture doesn't just scale — it scales efficiently.

Where V4-Pro Wins

Coding (#1 across the board):

LiveCodeBench:   93.5%   (vs Gemini 91.7%, Claude 88.8%)
Codeforces:      3206    (#1)
BrowseComp:      83.4%   (vs Opus 4.7 at 79.3%)
Terminal-Bench:  67.9%   (close to Opus 4.7's 69.4%)
Enter fullscreen mode Exit fullscreen mode

If your stack is heavy on code generation, code review, or autonomous coding agents, V4-Pro is now the cost-performance leader by a wide margin.

Where V4-Pro Loses

Knowledge reasoning and complex agent workflows still belong to Big 3:

GPQA Diamond:     90.1%   (Opus 4.7: 94.2%)
SWE-bench Pro:    55.4%   (Opus 4.7: 64.3%)
HLE:              behind frontier
Enter fullscreen mode Exit fullscreen mode

The honest summary: coding is the blade, knowledge reasoning is still catching up.

What This Means for Self-Hosting

Three concrete changes:

1. Data-sovereign workloads finally have a frontier-class option

If your org can't send data to external APIs (healthcare, finance, public sector, legal), Llama 405B was the previous best self-host coding option — and it lagged frontier by a meaningful margin. V4-Pro closes that gap on the workloads where it matters most.

2. Token-cost-sensitive products can rebuild on V4-Flash

If you're a SaaS startup paying $X0K/month for Haiku or 4o-mini at scale, V4-Flash at $0.14/M input is roughly 1/10th the cost. Self-hosted, the marginal cost approaches zero.

3. The hardware story is about to get interesting

DeepSeek hinted at Huawei Ascend 950 integration. If that lands, the implication is "frontier-class model on non-NVIDIA silicon at lower TCO" — which would be the first credible break in NVIDIA dependency for inference.

Try It

# Pull from Hugging Face
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

# Or use the API
curl https://api.deepseek.com/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Open Questions

  1. Will the GA release close the GPQA / SWE-bench gap with Opus 4.7?
  2. How quickly will domain-specific fine-tunes outperform Big 3 in verticals?
  3. Does this pressure OpenAI / Anthropic to release older model weights?

If you're running self-host benchmarks on actual workloads, drop your numbers in the comments. Curious what the production-side picture looks like.


Sources

Top comments (0)