After much anticipation and three delays, the "shining star of domestic AI," DeepSeek, has finally released its latest iteration: DeepSeek V4.
While the rest of the industry was busy launching new models and boasting about benchmarks, DeepSeek remained steadfast, focusing on its own rhythm. Finally, last week, DeepSeek V4 was quietly released.
The DeepSeek V4 series includes DeepSeek-V4-Pro (1.6T total parameters, 49B active) and DeepSeek-V4-Flash (284B total parameters, 13B active). Both models natively support an ultra-long context window of one million tokens. Through deep architectural improvements, they have achieved a significant breakthrough in long-text reasoning efficiency.
Hybrid Attention Architecture: Solving Long-Context Bottlenecks
When processing ultra-long contexts, traditional attention mechanisms often face the dilemma of computational complexity growing quadratically. DeepSeek V4 introduces a Hybrid Attention Architecture to optimize this process using two different compression strategies.
This hybrid architecture consists of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses the Key-Value Cache (KV Cache) for every 4 tokens into a single entry and uses a sparse attention strategy, allowing each query token to focus on only a few compressed KV entries. HCA takes a more aggressive approach, compressing every 128 tokens into one entry while maintaining dense attention.
This design performs exceptionally well in million-token scenarios. Compared to the previous DeepSeek-V3.2, the inference computation per token for DeepSeek-V4-Pro has dropped to 27%, and the KV cache VRAM usage has been slashed to just 10%. For developers with limited hardware resources, this efficiency boost significantly lowers the barrier to entry for ultra-long text applications.
Architectural Optimization: mHC Links and Muon Optimizer
Beyond the attention mechanism, DeepSeek V4 has upgraded its underlying stability and convergence speed.
The model introduces manifold-constrained Hyper-Connection (mHC) technology, an upgrade over traditional residual connections. By constraining residual mappings to specific manifolds, mHC enhances signal propagation stability across multi-layer networks, ensuring the model's expressive power even as parameter scales expand.
Regarding optimization algorithms, DeepSeek V4 adopts the Muon optimizer. Replacing the commonly used AdamW in most modules, it utilizes Newton-Schulz iteration for orthogonalization. Muon provides faster convergence and stronger training stability. To prevent numerical explosion in attention scores, the team applied RMSNorm directly to the query and key inputs, discarding the traditional QK-Clip technique.
Infrastructure Support: TileLang and FP4 Training
Efficient models require strong infrastructure. DeepSeek V4 uses TileLang, a domain-specific language (DSL) for kernel development. By replacing hundreds of fragmented operators with fused kernels, it ensures operational efficiency while improving development flexibility.
To address VRAM concerns, DeepSeek V4 introduced FP4 quantization-aware training in its later stages. Both MoE (Mixture of Experts) weights and the QK path of the CSA indexer are implemented with FP4 quantization. Notably, the dequantization process from FP4 to FP8 is lossless, allowing the model to reuse existing FP8 training frameworks while achieving nearly a 2x speedup during deployment.
Training Data and Performance
DeepSeek V4 was pre-trained on over 32T tokens. For post-training, the team used a two-stage paradigm: first, independently cultivating expert models in fields like math, code, and creative writing, then integrating these specialized abilities into a unified model via Online Policy Distillation (OPD).
In benchmarks, DeepSeek-V4-Pro-Max shows extreme competitiveness. In the knowledge-based SimpleQA test, it outperformed many leading open-source models. In the MRCR 1M long-context retrieval task, the model maintained high recall stability even at the million-token level.
For programming and Agent tasks, DeepSeek V4 equally shines. In rankings like LiveCodeBench and SWE Verified, the Pro version is now capable of going head-to-head with top-tier closed-source models.
Flexible Inference Modes
DeepSeek V4 offers three inference modes to suit different scenarios:
- Non-think Mode: Provides fast, intuitive responses—perfect for daily conversations or low-risk decision-making.
- Think High Mode: Enables logical analysis. It is slightly slower but offers higher accuracy, suitable for solving complex problems.
- Think Max Mode: By injecting specific system prompts and extending the thinking token length, this mode pushes the model's reasoning limits to handle boundary cases.
While DeepSeek-V4-Pro focuses on the performance ceiling—being highly competitive in programming, math, and STEM—DeepSeek-V4-Flash focuses on speed and cost. Despite having fewer active parameters, the Flash version's reasoning capability approaches the Pro version in most scenarios, especially for daily tasks and basic agent applications.
Detailed Pricing
I claim DeepSeek V4 is the most cost-effective large model—who’s with me?
DeepSeek-V4-Pro
- Input (Cache Hit): 1 RMB / million tokens
- Input (Cache Miss): 12 RMB / million tokens
- Output: 24 RMB / million tokens
DeepSeek-V4-Flash
- Input (Cache Hit): 0.2 RMB / million tokens
- Input (Cache Miss): 1 RMB / million tokens
- Output: 2 RMB / million tokens
According to official data, this pricing is 1/20th to 1/40th that of its competitors. The extremely low cache-hit price provides massive cost savings for developers frequently calling long-context backgrounds.
Usage and API Guide
Users can currently experience DeepSeek V4 through multiple channels.
Web and Mobile
Visit the official chat platform at chat.deepseek.com or use the official DeepSeek App. The platform has integrated Expert Mode and Instant Mode, supporting full-text reading of up to a million words. It is now possible to perform precise analysis on dozens of deep reports or entire project background documents.
API Integration
For us developers, the API is where the action is. The DeepSeek API is compatible with OpenAI and Anthropic formats. With a simple configuration change, you can quickly migrate existing apps to DeepSeek V4.
Inference Mode Example (Python)
DeepSeek V4 supports controlling thinking depth via parameters. Before you start, make sure your Python environment is ready. If not, you can use ServBay for a one-click Python environment installation.
Here is a code example to access deepseek-v4-pro with Deep Thinking mode enabled:
import os
from openai import OpenAI
# Install OpenAI SDK first: pip3 install openai
client = OpenAI(
api_key=os.environ.get('DEEPSEEK_API_KEY'),
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a professional technical document analyst."},
{"role": "user", "content": "Please analyze the core architectural design of this project."},
],
stream=False,
# Configuration for Deep Thinking mode
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}}
)
print(response.choices[0].message.content)
Integration Tips
- Full-Text Reading: Leverage the 1M context window to input entire books, multiple industry reports, or complete codebases directly as context.
- Parameter Tuning: For API developers, it is suggested to set
temperatureto 1.0 andtop_pto 1.0. If usingThink Maxmode for extremely complex logic, it is recommended to reserve at least 384K of the context window for best results.
Summary
The release of DeepSeek V4 has raised the bar for the cost-performance ratio of domestic large models. Whether it’s the Pro version for ultimate performance or the Flash version for speed and economy, the innovation in the underlying architecture has effectively solved the long-text reasoning bottleneck.
For users dealing with deep analysis, long document parsing, or complex code logic, DeepSeek V4 is undoubtedly the most cost-effective choice currently on the market.





Top comments (0)