Owen

Posted on Apr 24 • Originally published at ofox.ai

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

#ai #deepseek #opensource #benchmarks

TL;DR — DeepSeek shipped V4 preview on the same day as OpenAI's GPT-5.5. Features include 1.6T-parameter Pro, 284B Flash, 1M context on both, Apache 2.0 weights on Hugging Face, and API pricing of $1.74 / $3.48 per million tokens for Pro—significantly less expensive than Opus 4.7, GPT-5.5, or Kimi K2.6. ofox will support it at first opportunity.

What DeepSeek shipped

From the official announcement on April 24 2026:

Two variants: deepseek-v4-pro (1.6T total parameters, 49B activated) and deepseek-v4-flash (284B total, 13B activated). Both are MoE.
1M-token context on both, max output 384K.
Dual modes: Thinking / Non-Thinking, with three effort levels (high, max, plus non-think). See thinking mode docs.
Open source, Apache 2.0 — weights on Hugging Face.
API live today. Same base_url, change model ID. Both OpenAI ChatCompletions and Anthropic protocols supported.
Deprecation: deepseek-chat and deepseek-reasoner retire July 24 2026. They currently route to deepseek-v4-flash.

The timing is deliberate. OpenAI shipped GPT-5.5 the same day. DeepSeek needed a launch window where "open-source 1M-context MoE at a fraction of the cost" would not be buried under a closed-source price hike. Shipping simultaneously allowed both to split the news cycle.

Architecture — the part that actually matters

V4 introduces a hybrid attention mechanism: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA). Combined with Manifold-Constrained Hyper-Connections (mHC) for residual signal propagation and the Muon optimizer for training stability, the efficiency gains at 1M context are:

27% of V3.2's single-token inference FLOPs
10% of V3.2's KV cache

This represents the primary efficiency narrative. Long-context inference was historically the main cost barrier for open models serving 1M windows; V4 reduces KV cache requirements by roughly an order of magnitude. The model was pre-trained on 32T+ tokens using FP4 + FP8 mixed precision — MoE experts at FP4, most other parameters at FP8.

The Flash variant is not a trimmed Pro — it is a separately trained MoE at 284B / 13B activated. Flash-Max (max thinking effort) approaches Pro-level reasoning on most benchmarks with substantially lower serving cost.

The Arena Code numbers

Arena AI's live code leaderboard placed V4-Pro Thinking at #3 among open models, ahead of prior DeepSeek releases by a substantial margin:

Rank	Model	Elo
1	GLM-5.1	1,534
2	Kimi-K2.6	1,529
3	DeepSeek-V4 Pro (Thinking)	1,456
4	GLM-4.7	1,440
12	DeepSeek-V3.2 (Thinking)	1,368

The V3.2 → V4-Pro jump is 88 Elo — approximately the same differential between #3 and #13 on the current board. This represents a genuine generational advancement, not an incremental refresh.

Full benchmark grid — vs K2.6, GLM-5.1, Opus 4.6, GPT-5.4, Gemini 3.1 Pro

DeepSeek published comprehensive head-to-head comparisons against top open and closed models.

The honest assessment, benchmark by benchmark:

Where V4-Pro wins outright:

Benchmark	V4-Pro Max	K2.6 Thinking	Opus 4.6	GPT-5.4	Gemini 3.1 Pro
Chinese-SimpleQA	84.4	75.9	76.2	76.8	85.9
LiveCodeBench	93.5	89.6	88.8	—	91.7
Codeforces (rating)	3206	—	—	3168	3052
HMMT 2026 Feb	95.2	92.7	96.2	97.7	94.7
IMOAnswerBench	89.8	86.0	75.3	91.4	81.0
MCPAtlas Public	73.6	66.6	73.8	67.2	69.2

Codeforces 3206 is significant. This score exceeds GPT-5.4 (xHigh) at 3168 — representing competitive-programming territory where closed frontier models have traditionally held dominance.

Where V4-Pro loses to K2.6:

Benchmark	V4-Pro	K2.6 Thinking
SWE Pro (resolved)	55.4	58.6
SWE Multilingual	76.2	76.7
HLE w/tools	48.2	54.0
GPQA Diamond	90.1	90.5

SWE-Bench Pro is the most consequential metric for "fix a real GitHub issue" scenarios. K2.6's 58.6 versus V4-Pro's 55.4 represents a 3-point gap — modest but consistent with the Arena Code leaderboard where K2.6 maintains a 73 Elo advantage.

Where V4-Pro trails the closed frontier:

MRCR 1M (long-context retrieval): 83.5 vs Opus 4.6's 92.9. Opus remains the long-context leader.
CorpusQA 1M: 62.0 vs Opus 71.7. The pattern persists.
GDPval-AA (Elo): 1554 vs GPT-5.4's 1674 and Opus 4.6's 1619. Knowledge-work economic value still favors proprietary models.
HLE (no tools): 37.7 vs Gemini 3.1 Pro's 44.4.

Flash-Max holds up:

V4-Flash-Max achieves 86.2 on MMLU-Pro (Pro at 87.5), 91.6 on LiveCodeBench (Pro at 93.5), and 52.6 on SWE-Pro (Pro at 55.4). On most tasks the quality gap between Flash and Pro remains narrow — while Flash commands dramatically reduced serving cost.

Pricing — where V4 really changes the calculus

From the DeepSeek pricing documentation:

Model	Input (miss)	Input (hit)	Output
`deepseek-v4-flash`	$0.14 / M	$0.028 / M	$0.28 / M
`deepseek-v4-pro`	$1.74 / M	$0.145 / M	$3.48 / M

Comparison to the frontier:

Model	Input	Output
DeepSeek V4-Pro	$1.74	$3.48
Kimi K2.6 (non-think)	$1.40	$5.60
GPT-5.5	$5.00	$30.00
Claude Opus 4.7	$15.00	$75.00

V4-Pro output is $3.48 versus GPT-5.5's $30. That represents 8.6× cost reduction. Against Opus 4.7 the advantage is 21×. Flash at $0.28 output approaches negligible pricing.

This is the most significant story of the release. You can deploy a 1M-context, Codeforces-3200-tier reasoning model in production for the budget previously required for a mid-tier chat endpoint.

Community takes

First-day reactions from open-source and research communities:

"Apache 2.0 matters." V3 was MIT; V4 shifts to Apache 2.0, providing enterprises enhanced patent protection. For commercial deployments this constitutes the material change.
"Chinese SimpleQA is a wake-up call." 84.4 on Chinese-SimpleQA surpasses every proprietary model except Gemini 3.1 Pro. For Chinese-first applications this represents the first open-weight option achieving genuine parity with leading closed models.
"SWE-Pro is closer than the Arena board suggests." K2.6 leads by 3 points on SWE-Pro, but V4-Pro leads on LiveCodeBench and Codeforces. Short-form code generation versus long-horizon codebase resolution represent different competencies, with success splitting cleanly across the two approaches.
"The 1M context is real, but not Opus-level." MRCR and CorpusQA demonstrate Opus 4.6 continues to dominate long-context retrieval. V4's advantage is efficiency (10% KV cache), not superior absolute retrieval capability.

Access via ofox (coming soon)

ofox currently serves deepseek/deepseek-v3.2. V4-Pro and V4-Flash are being incorporated at the earliest opportunity — anticipate availability on the model list shortly.

For immediate V4 access, you can call DeepSeek's API directly:

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Port this Rust service to Go, preserving concurrency semantics"}],
    extra_body={"thinking": {"type": "enabled"}}
)
print(response.choices[0].message.content)

Once ofox integrates V4 into the aggregator, migration requires a single line — same ofox key, same https://api.ofox.ai/v1 base URL, just deepseek/deepseek-v4-pro or deepseek/deepseek-v4-flash. Register at ofox.ai and one credential will support V4 upon release alongside GPT-5.5, Claude, Gemini, Kimi K2.6, and competitors.

Should you switch?

Switch to V4-Pro if running Kimi K2.6 for Chinese-heavy applications, competitive-programming-style code generation, or Codeforces-grade reasoning. The Chinese SimpleQA and Codeforces benchmarks justify the transition.

Switch to V4-Flash if operating anything in the $1-2 per million output token range. Flash-Max's reasoning trails Pro by 1-3 points on most knowledge benchmarks, while costing 12× less on output.

Stay on K2.6 if your workload involves SWE-Bench-style codebase resolution, agent tool calls under high concurrency, or situations where the Arena Code delta (K2.6 +73 Elo) aligns with your task.

Stay on closed frontier (GPT-5.5 / Opus 4.7) if tasks demand long-context retrieval over millions of tokens (Opus MRCR still dominates), GDPval-grade knowledge work (GPT-5.4 still leads), or agentic terminal workflows (GPT-5.5 Terminal-Bench 82.7% stands alone).

DEV Community