DEV Community

Owen
Owen

Posted on • Originally published at ofox.ai

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

TL;DR — DeepSeek shipped V4 preview on the same day as OpenAI's GPT-5.5. Features include 1.6T-parameter Pro, 284B Flash, 1M context on both, Apache 2.0 weights on Hugging Face, and API pricing of $1.74 / $3.48 per million tokens for Pro—significantly less expensive than Opus 4.7, GPT-5.5, or Kimi K2.6. ofox will support it at first opportunity.

What DeepSeek shipped

From the official announcement on April 24 2026:

  • Two variants: deepseek-v4-pro (1.6T total parameters, 49B activated) and deepseek-v4-flash (284B total, 13B activated). Both are MoE.
  • 1M-token context on both, max output 384K.
  • Dual modes: Thinking / Non-Thinking, with three effort levels (high, max, plus non-think). See thinking mode docs.
  • Open source, Apache 2.0 — weights on Hugging Face.
  • API live today. Same base_url, change model ID. Both OpenAI ChatCompletions and Anthropic protocols supported.
  • Deprecation: deepseek-chat and deepseek-reasoner retire July 24 2026. They currently route to deepseek-v4-flash.

The timing is deliberate. OpenAI shipped GPT-5.5 the same day. DeepSeek needed a launch window where "open-source 1M-context MoE at a fraction of the cost" would not be buried under a closed-source price hike. Shipping simultaneously allowed both to split the news cycle.

Architecture — the part that actually matters

V4 introduces a hybrid attention mechanism: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA). Combined with Manifold-Constrained Hyper-Connections (mHC) for residual signal propagation and the Muon optimizer for training stability, the efficiency gains at 1M context are:

  • 27% of V3.2's single-token inference FLOPs
  • 10% of V3.2's KV cache

This represents the primary efficiency narrative. Long-context inference was historically the main cost barrier for open models serving 1M windows; V4 reduces KV cache requirements by roughly an order of magnitude. The model was pre-trained on 32T+ tokens using FP4 + FP8 mixed precision — MoE experts at FP4, most other parameters at FP8.

The Flash variant is not a trimmed Pro — it is a separately trained MoE at 284B / 13B activated. Flash-Max (max thinking effort) approaches Pro-level reasoning on most benchmarks with substantially lower serving cost.

The Arena Code numbers

Arena AI's live code leaderboard placed V4-Pro Thinking at #3 among open models, ahead of prior DeepSeek releases by a substantial margin:

Rank Model Elo
1 GLM-5.1 1,534
2 Kimi-K2.6 1,529
3 DeepSeek-V4 Pro (Thinking) 1,456
4 GLM-4.7 1,440
12 DeepSeek-V3.2 (Thinking) 1,368

The V3.2 → V4-Pro jump is 88 Elo — approximately the same differential between #3 and #13 on the current board. This represents a genuine generational advancement, not an incremental refresh.

Full benchmark grid — vs K2.6, GLM-5.1, Opus 4.6, GPT-5.4, Gemini 3.1 Pro

DeepSeek published comprehensive head-to-head comparisons against top open and closed models.

The honest assessment, benchmark by benchmark:

Where V4-Pro wins outright:

Benchmark V4-Pro Max K2.6 Thinking Opus 4.6 GPT-5.4 Gemini 3.1 Pro
Chinese-SimpleQA 84.4 75.9 76.2 76.8 85.9
LiveCodeBench 93.5 89.6 88.8 91.7
Codeforces (rating) 3206 3168 3052
HMMT 2026 Feb 95.2 92.7 96.2 97.7 94.7
IMOAnswerBench 89.8 86.0 75.3 91.4 81.0
MCPAtlas Public 73.6 66.6 73.8 67.2 69.2

Codeforces 3206 is significant. This score exceeds GPT-5.4 (xHigh) at 3168 — representing competitive-programming territory where closed frontier models have traditionally held dominance.

Where V4-Pro loses to K2.6:

Benchmark V4-Pro K2.6 Thinking
SWE Pro (resolved) 55.4 58.6
SWE Multilingual 76.2 76.7
HLE w/tools 48.2 54.0
GPQA Diamond 90.1 90.5

SWE-Bench Pro is the most consequential metric for "fix a real GitHub issue" scenarios. K2.6's 58.6 versus V4-Pro's 55.4 represents a 3-point gap — modest but consistent with the Arena Code leaderboard where K2.6 maintains a 73 Elo advantage.

Where V4-Pro trails the closed frontier:

  • MRCR 1M (long-context retrieval): 83.5 vs Opus 4.6's 92.9. Opus remains the long-context leader.
  • CorpusQA 1M: 62.0 vs Opus 71.7. The pattern persists.
  • GDPval-AA (Elo): 1554 vs GPT-5.4's 1674 and Opus 4.6's 1619. Knowledge-work economic value still favors proprietary models.
  • HLE (no tools): 37.7 vs Gemini 3.1 Pro's 44.4.

Flash-Max holds up:

V4-Flash-Max achieves 86.2 on MMLU-Pro (Pro at 87.5), 91.6 on LiveCodeBench (Pro at 93.5), and 52.6 on SWE-Pro (Pro at 55.4). On most tasks the quality gap between Flash and Pro remains narrow — while Flash commands dramatically reduced serving cost.

Pricing — where V4 really changes the calculus

From the DeepSeek pricing documentation:

Model Input (miss) Input (hit) Output
deepseek-v4-flash $0.14 / M $0.028 / M $0.28 / M
deepseek-v4-pro $1.74 / M $0.145 / M $3.48 / M

Comparison to the frontier:

Model Input Output
DeepSeek V4-Pro $1.74 $3.48
Kimi K2.6 (non-think) $1.40 $5.60
GPT-5.5 $5.00 $30.00
Claude Opus 4.7 $15.00 $75.00

V4-Pro output is $3.48 versus GPT-5.5's $30. That represents 8.6× cost reduction. Against Opus 4.7 the advantage is 21×. Flash at $0.28 output approaches negligible pricing.

This is the most significant story of the release. You can deploy a 1M-context, Codeforces-3200-tier reasoning model in production for the budget previously required for a mid-tier chat endpoint.

Community takes

First-day reactions from open-source and research communities:

  • "Apache 2.0 matters." V3 was MIT; V4 shifts to Apache 2.0, providing enterprises enhanced patent protection. For commercial deployments this constitutes the material change.
  • "Chinese SimpleQA is a wake-up call." 84.4 on Chinese-SimpleQA surpasses every proprietary model except Gemini 3.1 Pro. For Chinese-first applications this represents the first open-weight option achieving genuine parity with leading closed models.
  • "SWE-Pro is closer than the Arena board suggests." K2.6 leads by 3 points on SWE-Pro, but V4-Pro leads on LiveCodeBench and Codeforces. Short-form code generation versus long-horizon codebase resolution represent different competencies, with success splitting cleanly across the two approaches.
  • "The 1M context is real, but not Opus-level." MRCR and CorpusQA demonstrate Opus 4.6 continues to dominate long-context retrieval. V4's advantage is efficiency (10% KV cache), not superior absolute retrieval capability.

Access via ofox (coming soon)

ofox currently serves deepseek/deepseek-v3.2. V4-Pro and V4-Flash are being incorporated at the earliest opportunity — anticipate availability on the model list shortly.

For immediate V4 access, you can call DeepSeek's API directly:

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Port this Rust service to Go, preserving concurrency semantics"}],
    extra_body={"thinking": {"type": "enabled"}}
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Once ofox integrates V4 into the aggregator, migration requires a single line — same ofox key, same https://api.ofox.ai/v1 base URL, just deepseek/deepseek-v4-pro or deepseek/deepseek-v4-flash. Register at ofox.ai and one credential will support V4 upon release alongside GPT-5.5, Claude, Gemini, Kimi K2.6, and competitors.

Should you switch?

Switch to V4-Pro if running Kimi K2.6 for Chinese-heavy applications, competitive-programming-style code generation, or Codeforces-grade reasoning. The Chinese SimpleQA and Codeforces benchmarks justify the transition.

Switch to V4-Flash if operating anything in the $1-2 per million output token range. Flash-Max's reasoning trails Pro by 1-3 points on most knowledge benchmarks, while costing 12× less on output.

Stay on K2.6 if your workload involves SWE-Bench-style codebase resolution, agent tool calls under high concurrency, or situations where the Arena Code delta (K2.6 +73 Elo) aligns with your task.

Stay on closed frontier (GPT-5.5 / Opus 4.7) if tasks demand long-context retrieval over millions of tokens (Opus MRCR still dominates), GDPval-grade knowledge work (GPT-5.4 still leads), or agentic terminal workflows (GPT-5.5 Terminal-Bench 82.7% stands alone).

Related reading


Originally published on ofox.ai/blog.

Top comments (0)