TL;DR — DeepSeek shipped V4 preview on the same day as OpenAI's GPT-5.5. Features include 1.6T-parameter Pro, 284B Flash, 1M context on both, Apache 2.0 weights on Hugging Face, and API pricing of $1.74 / $3.48 per million tokens for Pro—significantly less expensive than Opus 4.7, GPT-5.5, or Kimi K2.6. ofox will support it at first opportunity.
What DeepSeek shipped
From the official announcement on April 24 2026:
-
Two variants:
deepseek-v4-pro(1.6T total parameters, 49B activated) anddeepseek-v4-flash(284B total, 13B activated). Both are MoE. - 1M-token context on both, max output 384K.
-
Dual modes: Thinking / Non-Thinking, with three effort levels (
high,max, plus non-think). See thinking mode docs. - Open source, Apache 2.0 — weights on Hugging Face.
-
API live today. Same
base_url, change model ID. Both OpenAI ChatCompletions and Anthropic protocols supported. -
Deprecation:
deepseek-chatanddeepseek-reasonerretire July 24 2026. They currently route todeepseek-v4-flash.
The timing is deliberate. OpenAI shipped GPT-5.5 the same day. DeepSeek needed a launch window where "open-source 1M-context MoE at a fraction of the cost" would not be buried under a closed-source price hike. Shipping simultaneously allowed both to split the news cycle.
Architecture — the part that actually matters
V4 introduces a hybrid attention mechanism: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA). Combined with Manifold-Constrained Hyper-Connections (mHC) for residual signal propagation and the Muon optimizer for training stability, the efficiency gains at 1M context are:
- 27% of V3.2's single-token inference FLOPs
- 10% of V3.2's KV cache
This represents the primary efficiency narrative. Long-context inference was historically the main cost barrier for open models serving 1M windows; V4 reduces KV cache requirements by roughly an order of magnitude. The model was pre-trained on 32T+ tokens using FP4 + FP8 mixed precision — MoE experts at FP4, most other parameters at FP8.
The Flash variant is not a trimmed Pro — it is a separately trained MoE at 284B / 13B activated. Flash-Max (max thinking effort) approaches Pro-level reasoning on most benchmarks with substantially lower serving cost.
The Arena Code numbers
Arena AI's live code leaderboard placed V4-Pro Thinking at #3 among open models, ahead of prior DeepSeek releases by a substantial margin:
| Rank | Model | Elo |
|---|---|---|
| 1 | GLM-5.1 | 1,534 |
| 2 | Kimi-K2.6 | 1,529 |
| 3 | DeepSeek-V4 Pro (Thinking) | 1,456 |
| 4 | GLM-4.7 | 1,440 |
| 12 | DeepSeek-V3.2 (Thinking) | 1,368 |
The V3.2 → V4-Pro jump is 88 Elo — approximately the same differential between #3 and #13 on the current board. This represents a genuine generational advancement, not an incremental refresh.
Full benchmark grid — vs K2.6, GLM-5.1, Opus 4.6, GPT-5.4, Gemini 3.1 Pro
DeepSeek published comprehensive head-to-head comparisons against top open and closed models.
The honest assessment, benchmark by benchmark:
Where V4-Pro wins outright:
| Benchmark | V4-Pro Max | K2.6 Thinking | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| Chinese-SimpleQA | 84.4 | 75.9 | 76.2 | 76.8 | 85.9 |
| LiveCodeBench | 93.5 | 89.6 | 88.8 | — | 91.7 |
| Codeforces (rating) | 3206 | — | — | 3168 | 3052 |
| HMMT 2026 Feb | 95.2 | 92.7 | 96.2 | 97.7 | 94.7 |
| IMOAnswerBench | 89.8 | 86.0 | 75.3 | 91.4 | 81.0 |
| MCPAtlas Public | 73.6 | 66.6 | 73.8 | 67.2 | 69.2 |
Codeforces 3206 is significant. This score exceeds GPT-5.4 (xHigh) at 3168 — representing competitive-programming territory where closed frontier models have traditionally held dominance.
Where V4-Pro loses to K2.6:
| Benchmark | V4-Pro | K2.6 Thinking |
|---|---|---|
| SWE Pro (resolved) | 55.4 | 58.6 |
| SWE Multilingual | 76.2 | 76.7 |
| HLE w/tools | 48.2 | 54.0 |
| GPQA Diamond | 90.1 | 90.5 |
SWE-Bench Pro is the most consequential metric for "fix a real GitHub issue" scenarios. K2.6's 58.6 versus V4-Pro's 55.4 represents a 3-point gap — modest but consistent with the Arena Code leaderboard where K2.6 maintains a 73 Elo advantage.
Where V4-Pro trails the closed frontier:
- MRCR 1M (long-context retrieval): 83.5 vs Opus 4.6's 92.9. Opus remains the long-context leader.
- CorpusQA 1M: 62.0 vs Opus 71.7. The pattern persists.
- GDPval-AA (Elo): 1554 vs GPT-5.4's 1674 and Opus 4.6's 1619. Knowledge-work economic value still favors proprietary models.
- HLE (no tools): 37.7 vs Gemini 3.1 Pro's 44.4.
Flash-Max holds up:
V4-Flash-Max achieves 86.2 on MMLU-Pro (Pro at 87.5), 91.6 on LiveCodeBench (Pro at 93.5), and 52.6 on SWE-Pro (Pro at 55.4). On most tasks the quality gap between Flash and Pro remains narrow — while Flash commands dramatically reduced serving cost.
Pricing — where V4 really changes the calculus
From the DeepSeek pricing documentation:
| Model | Input (miss) | Input (hit) | Output |
|---|---|---|---|
deepseek-v4-flash |
$0.14 / M | $0.028 / M | $0.28 / M |
deepseek-v4-pro |
$1.74 / M | $0.145 / M | $3.48 / M |
Comparison to the frontier:
| Model | Input | Output |
|---|---|---|
| DeepSeek V4-Pro | $1.74 | $3.48 |
| Kimi K2.6 (non-think) | $1.40 | $5.60 |
| GPT-5.5 | $5.00 | $30.00 |
| Claude Opus 4.7 | $15.00 | $75.00 |
V4-Pro output is $3.48 versus GPT-5.5's $30. That represents 8.6× cost reduction. Against Opus 4.7 the advantage is 21×. Flash at $0.28 output approaches negligible pricing.
This is the most significant story of the release. You can deploy a 1M-context, Codeforces-3200-tier reasoning model in production for the budget previously required for a mid-tier chat endpoint.
Community takes
First-day reactions from open-source and research communities:
- "Apache 2.0 matters." V3 was MIT; V4 shifts to Apache 2.0, providing enterprises enhanced patent protection. For commercial deployments this constitutes the material change.
- "Chinese SimpleQA is a wake-up call." 84.4 on Chinese-SimpleQA surpasses every proprietary model except Gemini 3.1 Pro. For Chinese-first applications this represents the first open-weight option achieving genuine parity with leading closed models.
- "SWE-Pro is closer than the Arena board suggests." K2.6 leads by 3 points on SWE-Pro, but V4-Pro leads on LiveCodeBench and Codeforces. Short-form code generation versus long-horizon codebase resolution represent different competencies, with success splitting cleanly across the two approaches.
- "The 1M context is real, but not Opus-level." MRCR and CorpusQA demonstrate Opus 4.6 continues to dominate long-context retrieval. V4's advantage is efficiency (10% KV cache), not superior absolute retrieval capability.
Access via ofox (coming soon)
ofox currently serves deepseek/deepseek-v3.2. V4-Pro and V4-Flash are being incorporated at the earliest opportunity — anticipate availability on the model list shortly.
For immediate V4 access, you can call DeepSeek's API directly:
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": "Port this Rust service to Go, preserving concurrency semantics"}],
extra_body={"thinking": {"type": "enabled"}}
)
print(response.choices[0].message.content)
Once ofox integrates V4 into the aggregator, migration requires a single line — same ofox key, same https://api.ofox.ai/v1 base URL, just deepseek/deepseek-v4-pro or deepseek/deepseek-v4-flash. Register at ofox.ai and one credential will support V4 upon release alongside GPT-5.5, Claude, Gemini, Kimi K2.6, and competitors.
Should you switch?
Switch to V4-Pro if running Kimi K2.6 for Chinese-heavy applications, competitive-programming-style code generation, or Codeforces-grade reasoning. The Chinese SimpleQA and Codeforces benchmarks justify the transition.
Switch to V4-Flash if operating anything in the $1-2 per million output token range. Flash-Max's reasoning trails Pro by 1-3 points on most knowledge benchmarks, while costing 12× less on output.
Stay on K2.6 if your workload involves SWE-Bench-style codebase resolution, agent tool calls under high concurrency, or situations where the Arena Code delta (K2.6 +73 Elo) aligns with your task.
Stay on closed frontier (GPT-5.5 / Opus 4.7) if tasks demand long-context retrieval over millions of tokens (Opus MRCR still dominates), GDPval-grade knowledge work (GPT-5.4 still leads), or agentic terminal workflows (GPT-5.5 Terminal-Bench 82.7% stands alone).
Related reading
- Kimi K2.6 Released: 256K Context, Native Video, Beats Claude Opus 4.6
- GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5
- GLM 5 API: Pricing, Pony-Alpha, and Zhipu's New Frontier
Originally published on ofox.ai/blog.
Top comments (0)