DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026
TL;DR Summary
- DeepSeek-V3 is a 671B parameter Mixture-of-Experts model with only 37B activated per token — rivaling GPT-4o and Claude 3.5 Sonnet on benchmarks
- Trained on 14.8 trillion tokens using innovative FP8 mixed precision — only 2.664M H800 GPU hours for full pre-training, with zero irrecoverable loss spikes
- 104k GitHub stars, MIT license, commercial use allowed — open weights available on Hugging Face
- 8 inference backends supported: SGLang, LMDeploy, TensorRT-LLM, vLLM, LightLLM, AMD GPU, Huawei Ascend NPU, and the reference demo
- Knowledge distilled from DeepSeek-R1 reasoning model into V3, improving reasoning while maintaining output style control
Direct Answer Block
DeepSeek-V3 is a 671B-parameter Mixture-of-Experts language model that activates only 37B parameters per token using 256 experts with 8 active per forward pass. It's open-source (MIT code license, model agreement for weights), commercially usable, and deployable locally via 8 inference backends including SGLang, vLLM, and TensorRT-LLM on both NVIDIA and AMD GPUs.
Introduction
The AI model market has a dirty secret: most frontier models lock you into API subscriptions, vendor infrastructure, and per-token pricing that scales with your usage. DeepSeek-V3 breaks that model — literally and commercially. It's a 671B-parameter Mixture-of-Experts architecture that activates only 37B parameters per token, making it efficient enough to deploy on your own hardware. With 104k GitHub stars, benchmark scores competitive with GPT-4o and Claude 3.5 Sonnet, and MIT-licensed code, it represents the leading edge of what open-source AI can achieve in 2026.
How does DeepSeek-V3's Mixture-of-Experts architecture activate only 37B of 671B parameters per token?
DeepSeek-V3 uses 256 experts with 8 active per token in a Mixture-of-Experts (MoE) architecture. This means only 37B of the 671B total parameters are activated for any given token prediction — a 5.5% activation ratio.
The architecture builds on two innovations validated in DeepSeek-V2:
Multi-head Latent Attention (MLA)
MLA compresses the key-value cache into a low-dimensional latent space, dramatically reducing memory usage during inference. This is what makes the 128K context window practical — standard attention would require prohibitive KV-cache memory at this scale.
Auxiliary-loss-free load balancing
Traditional MoE models use an auxiliary loss term to encourage balanced expert utilization — but this creates a tradeoff between load balance and model quality. DeepSeek-V3 pioneers a strategy that achieves load balancing without degrading performance. The model learns to distribute tokens across experts naturally, without the quality penalty that auxiliary losses impose.
According to the DeepSeek-V3 technical report (arXiv:2412.19437): "We pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing."
Multi-Token Prediction (MTP)
DeepSeek-V3 trains with a multi-token prediction objective — predicting multiple future tokens at each position rather than just the next one. This improves model quality and can be used for speculative decoding during inference to accelerate generation. The MTP module weights add 14B parameters to the 671B main model (685B total on Hugging Face), but MTP support is still under active community development.
The training was remarkably stable: "Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks." This is unusual for models of this scale and speaks to the quality of the FP8 training framework.
How does FP8 mixed precision training work — and why did it take 2.664M GPU hours with zero loss spikes?
FP8 (8-bit floating point) training represents a significant departure from the industry-standard BF16/FP16 approach. DeepSeek-V3 is, according to the paper, the first extremely large-scale model to validate the feasibility and effectiveness of FP8 training.
The key innovations:
FP8 mixed precision framework: Not all operations use FP8. The framework selectively applies FP8 to matrix multiplications and attention computations where precision loss is minimal, while keeping sensitive operations (normalization, softmax) in higher precision. This achieves the speed of FP8 with the stability of FP16.
Full computation-communication overlap: In cross-node MoE training, the communication bottleneck between nodes often leaves GPUs idle. DeepSeek-V3 co-designed algorithms, frameworks, and hardware to nearly achieve full overlap — computation continues while communication happens, dramatically improving efficiency.
"Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap." — DeepSeek-V3 Technical Report
The full pre-training cost of 2.664M H800 GPU hours on 14.8T tokens is remarkably economical for a model of this capability. For context, this is roughly 1/10th to 1/20th of the estimated training cost of comparable closed-source frontier models. The subsequent fine-tuning stages (SFT + RL) required only an additional 0.1M GPU hours.
How does DeepSeek-V3 compare to GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.1 405B on code, math, and reasoning benchmarks?
DeepSeek-V3 dominates open-source models and is competitive with closed-source frontier models. Here are the key comparisons from the published benchmark tables:
Code benchmarks
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 Sonnet | LLaMA 3.1 405B |
|---|---|---|---|---|
| HumanEval-Mul (Pass@1) | 82.6 | 80.5 | 81.7 | 77.2 |
| LiveCodeBench (Pass@1) | 37.6 | 34.2 | 32.8 | 30.1 |
| Codeforces (Percentile) | 51.6 | 23.6 | 20.3 | 25.3 |
| SWE Verified (Resolved) | 42.0 | 38.8 | 50.8 | 24.5 |
| Aider-Polyglot (Acc.) | 49.6 | 16.0 | 45.3 | 5.8 |
DeepSeek-V3 is the strongest open-source coding model and leads on competitive programming benchmarks (Codeforces percentile: 51.6 vs GPT-4o's 23.6).
Math benchmarks
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 Sonnet | LLaMA 3.1 405B |
|---|---|---|---|---|
| AIME 2024 (Pass@1) | 39.2 | 9.3 | 16.0 | 23.3 |
| MATH-500 (EM) | 90.2 | 74.6 | 78.3 | 73.8 |
| CNMO 2024 (Pass@1) | 43.2 | 10.8 | 13.1 | 6.8 |
DeepSeek-V3 is in a different tier on math — the AIME gap (39.2 vs 9.3 for GPT-4o) is a 4x improvement. This is largely attributed to the knowledge distillation from DeepSeek-R1's long Chain-of-Thought reasoning.
General benchmarks
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMLU (EM) | 88.5 | 87.2 | 88.3 |
| MMLU-Redux (EM) | 89.1 | 88.0 | 88.9 |
| DROP (3-shot F1) | 91.6 | 83.7 | 88.3 |
| GPQA-Diamond (Pass@1) | 59.1 | 49.9 | 65.0 |
On standard academic benchmarks, DeepSeek-V3 leads or ties in most categories. Claude 3.5 Sonnet holds the edge on GPQA-Diamond (graduate-level reasoning). On open-ended generation (Arena-Hard: 85.5, AlpacaEval 2.0: 70.0), DeepSeek-V3 convincingly leads all compared models.
How do you run DeepSeek-V3 locally — and which of the 8 inference backends should you choose?
DeepSeek-V3 can be deployed locally through eight inference backends. Here's how to choose:
| Backend | Best for | GPU Support | Key Features |
|---|---|---|---|
| SGLang (recommended) | Production serving | NVIDIA, AMD | MLA optimizations, DP Attention, FP8, Torch Compile, multi-node TP |
| LMDeploy (recommended) | Offline + online deployment | NVIDIA | Pipeline processing, PyTorch integration |
| TensorRT-LLM (recommended) | Maximum performance | NVIDIA | BF16, INT4/8 quantization, FP8 coming soon |
| vLLM (recommended) | Standard serving | NVIDIA, AMD | Tensor + pipeline parallelism, FP8 + BF16 |
| LightLLM | Multi-node deployment | NVIDIA | FP8/BF16, PD-disaggregation |
| AMD GPU | AMD hardware | AMD | Via SGLang, BF16 + FP8 |
| Huawei Ascend NPU | Ascend hardware | Ascend | Via MindIE, BF16 |
| DeepSeek-Infer Demo | Learning/experimentation | NVIDIA | Reference implementation, Linux + Python 3.10 only |
Quick start with SGLang (recommended):
# See full instructions at:
# https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
Model weights conversion (FP8 to BF16):
cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights
System requirements: Linux with Python 3.10 only. Mac and Windows are not supported natively (use cloud deployment or WSL on Windows). Multi-node GPU setup required for the full model — this is a 671B parameter model, not a laptop deployment. The mini model runs on smaller setups; the full model requires multiple H800/H100 GPUs.
Note: Hugging Face's Transformers library does not yet directly support DeepSeek-V3. Use one of the inference backends listed above.
How does Multi-Token Prediction (MTP) accelerate inference through speculative decoding?
Multi-Token Prediction is a training objective where the model predicts multiple future tokens at each position, rather than just the next one. During inference, this enables speculative decoding:
- The model makes a "fast" prediction of the next few tokens using the MTP heads
- A verification pass confirms these tokens against the main model
- Accepted tokens are committed; rejected tokens trigger re-generation
The MTP module adds 14B parameters (separate from the 671B main model weights). The technical report states that MTP "can also be used for speculative decoding for inference acceleration." Community support for MTP in inference backends is still under active development — SGLang tracks progress at github.com/sgl-project/sglang/issues/2591.
The practical benefit: for latency-sensitive applications (chat, code completion), MTP speculative decoding can significantly reduce the wall-clock time per response by generating multiple tokens per forward pass rather than one at a time.
How did DeepSeek distill reasoning capabilities from R1 into V3 — and what does it mean for open-source model quality?
The distillation from DeepSeek-R1 is one of the most technically interesting aspects of DeepSeek-V3. The approach:
- DeepSeek-R1 is a long Chain-of-Thought reasoning model — it thinks step-by-step, verifies its work, and reflects on errors before producing final answers
- The verification and reflection patterns from R1's reasoning traces are extracted
- These patterns are distilled into DeepSeek-V3 through the post-training pipeline, which "elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3"
"Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3." — DeepSeek-V3 Technical Report
The key distinction: this is not making V3 generate long Chain-of-Thought traces. It's distilling the cognitive patterns (verify assumptions, reflect on contradictions, break down multi-step problems) while maintaining V3's standard output style and length. The result is improved reasoning (visible in the AIME 2024 and MATH-500 scores) without the verbosity and latency cost of full CoT.
This distillation approach is a model for the open-source community: you can take a specialized reasoning model's capabilities and inject them into a general-purpose model through post-training, without changing the model architecture or inference characteristics.
Frequently Asked Questions
Q: Can DeepSeek-V3 run on a single consumer GPU?
No. The full 671B model requires multiple H800/H100 GPUs across nodes. Even with only 37B active per token, the total model must be loaded into memory. For single-GPU setups, consider quantized variants or smaller models from the DeepSeek family.
Q: Is DeepSeek-V3 free for commercial use?
The code is MIT licensed (free for any use). The model weights have a separate Model License that permits commercial use. Check the LICENSE-MODEL file in the repository for specific terms.
Q: How does DeepSeek-V3 compare to DeepSeek-R1?
R1 is a reasoning-specialized model that generates long Chain-of-Thought traces. V3 is a general-purpose model with R1's reasoning patterns distilled in. V3 is faster, more efficient, and better for general tasks. R1 is stronger on tasks requiring explicit step-by-step reasoning.
Q: Why is FP8 training significant?
FP8 uses 8-bit floating point (vs the standard 16-bit), halving memory requirements and doubling theoretical throughput for matrix operations. Previous attempts at FP8 training at scale resulted in instability. DeepSeek-V3's successful FP8 pre-training at 671B parameters validates the approach for future large-scale models.
Q: Does DeepSeek-V3 support function calling and tool use?
The base and chat models support standard prompting patterns. Tool use capabilities depend on the inference backend and prompting approach — SGLang and vLLM support OpenAI-compatible API serving with function calling.
Q: What's the difference between the Base and Chat models?
Base is the pre-trained model (14.8T tokens, no fine-tuning). Chat is the instruction-tuned model with SFT and RL post-training, including the R1 reasoning distillation. Use Chat for conversational and task-oriented applications; use Base for fine-tuning on domain-specific data.
Glossary
- Mixture-of-Experts (MoE): An architecture where only a subset of model parameters (experts) are activated per token, enabling larger total models with lower per-token compute
- FP8 Mixed Precision: Training using 8-bit floating point for most operations while keeping critical computations at higher precision — DeepSeek-V3 is the first extremely large model to validate this
- Multi-head Latent Attention (MLA): An attention mechanism that compresses the KV-cache into a low-dimensional latent space, enabling long context windows (128K) with manageable memory
- Multi-Token Prediction (MTP): Training objective predicting multiple future tokens per position, enabling speculative decoding for faster inference
- Auxiliary-loss-free load balancing: A strategy for MoE models that balances expert utilization without the quality penalty of traditional load-balancing loss terms
- Speculative decoding: An inference acceleration technique where a faster "draft" model predicts multiple tokens that are then verified by the main model
Author
Ramsis Hammadi — AI/ML engineer specializing in GenAI, LLM engineering, and automation. Full bio →

Top comments (0)