Ramsis Hammadi

Posted on May 20 • Edited on May 21

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

#ai #deepseek #llm #webdev

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

TL;DR Summary

DeepSeek-V3 is a 671B parameter Mixture-of-Experts model with only 37B activated per token — rivaling GPT-4o and Claude 3.5 Sonnet on benchmarks
Trained on 14.8 trillion tokens using innovative FP8 mixed precision — only 2.664M H800 GPU hours for full pre-training, with zero irrecoverable loss spikes
104k GitHub stars, MIT license, commercial use allowed — open weights available on Hugging Face
8 inference backends supported: SGLang, LMDeploy, TensorRT-LLM, vLLM, LightLLM, AMD GPU, Huawei Ascend NPU, and the reference demo
Knowledge distilled from DeepSeek-R1 reasoning model into V3, improving reasoning while maintaining output style control

Direct Answer Block

DeepSeek-V3 is a 671B-parameter Mixture-of-Experts language model that activates only 37B parameters per token using 256 experts with 8 active per forward pass. It's open-source (MIT code license, model agreement for weights), commercially usable, and deployable locally via 8 inference backends including SGLang, vLLM, and TensorRT-LLM on both NVIDIA and AMD GPUs.

Introduction

The AI model market has a dirty secret: most frontier models lock you into API subscriptions, vendor infrastructure, and per-token pricing that scales with your usage. DeepSeek-V3 breaks that model — literally and commercially. It's a 671B-parameter Mixture-of-Experts architecture that activates only 37B parameters per token, making it efficient enough to deploy on your own hardware. With 104k GitHub stars, benchmark scores competitive with GPT-4o and Claude 3.5 Sonnet, and MIT-licensed code, it represents the leading edge of what open-source AI can achieve in 2026.

How does DeepSeek-V3's Mixture-of-Experts architecture activate only 37B of 671B parameters per token?

DeepSeek-V3 uses 256 experts with 8 active per token in a Mixture-of-Experts (MoE) architecture. This means only 37B of the 671B total parameters are activated for any given token prediction — a 5.5% activation ratio.

The architecture builds on two innovations validated in DeepSeek-V2:

Multi-head Latent Attention (MLA)

MLA compresses the key-value cache into a low-dimensional latent space, dramatically reducing memory usage during inference. This is what makes the 128K context window practical — standard attention would require prohibitive KV-cache memory at this scale.

Auxiliary-loss-free load balancing

Traditional MoE models use an auxiliary loss term to encourage balanced expert utilization — but this creates a tradeoff between load balance and model quality. DeepSeek-V3 pioneers a strategy that achieves load balancing without degrading performance. The model learns to distribute tokens across experts naturally, without the quality penalty that auxiliary losses impose.

According to the DeepSeek-V3 technical report (arXiv:2412.19437): "We pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing."

Multi-Token Prediction (MTP)

DeepSeek-V3 trains with a multi-token prediction objective — predicting multiple future tokens at each position rather than just the next one. This improves model quality and can be used for speculative decoding during inference to accelerate generation. The MTP module weights add 14B parameters to the 671B main model (685B total on Hugging Face), but MTP support is still under active community development.

The training was remarkably stable: "Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks." This is unusual for models of this scale and speaks to the quality of the FP8 training framework.

How does FP8 mixed precision training work — and why did it take 2.664M GPU hours with zero loss spikes?

FP8 (8-bit floating point) training represents a significant departure from the industry-standard BF16/FP16 approach. DeepSeek-V3 is, according to the paper, the first extremely large-scale model to validate the feasibility and effectiveness of FP8 training.

The key innovations:

FP8 mixed precision framework: Not all operations use FP8. The framework selectively applies FP8 to matrix multiplications and attention computations where precision loss is minimal, while keeping sensitive operations (normalization, softmax) in higher precision. This achieves the speed of FP8 with the stability of FP16.

Full computation-communication overlap: In cross-node MoE training, the communication bottleneck between nodes often leaves GPUs idle. DeepSeek-V3 co-designed algorithms, frameworks, and hardware to nearly achieve full overlap — computation continues while communication happens, dramatically improving efficiency.

"Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap." — DeepSeek-V3 Technical Report

The full pre-training cost of 2.664M H800 GPU hours on 14.8T tokens is remarkably economical for a model of this capability. For context, this is roughly 1/10th to 1/20th of the estimated training cost of comparable closed-source frontier models. The subsequent fine-tuning stages (SFT + RL) required only an additional 0.1M GPU hours.

How does DeepSeek-V3 compare to GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.1 405B on code, math, and reasoning benchmarks?

DeepSeek-V3 dominates open-source models and is competitive with closed-source frontier models. Here are the key comparisons from the published benchmark tables:

Code benchmarks

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5 Sonnet	LLaMA 3.1 405B
HumanEval-Mul (Pass@1)	82.6	80.5	81.7	77.2
LiveCodeBench (Pass@1)	37.6	34.2	32.8	30.1
Codeforces (Percentile)	51.6	23.6	20.3	25.3
SWE Verified (Resolved)	42.0	38.8	50.8	24.5
Aider-Polyglot (Acc.)	49.6	16.0	45.3	5.8

DeepSeek-V3 is the strongest open-source coding model and leads on competitive programming benchmarks (Codeforces percentile: 51.6 vs GPT-4o's 23.6).

Math benchmarks

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5 Sonnet	LLaMA 3.1 405B
AIME 2024 (Pass@1)	39.2	9.3	16.0	23.3
MATH-500 (EM)	90.2	74.6	78.3	73.8
CNMO 2024 (Pass@1)	43.2	10.8	13.1	6.8

DeepSeek-V3 is in a different tier on math — the AIME gap (39.2 vs 9.3 for GPT-4o) is a 4x improvement. This is largely attributed to the knowledge distillation from DeepSeek-R1's long Chain-of-Thought reasoning.

General benchmarks

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5 Sonnet
MMLU (EM)	88.5	87.2	88.3
MMLU-Redux (EM)	89.1	88.0	88.9
DROP (3-shot F1)	91.6	83.7	88.3
GPQA-Diamond (Pass@1)	59.1	49.9	65.0

On standard academic benchmarks, DeepSeek-V3 leads or ties in most categories. Claude 3.5 Sonnet holds the edge on GPQA-Diamond (graduate-level reasoning). On open-ended generation (Arena-Hard: 85.5, AlpacaEval 2.0: 70.0), DeepSeek-V3 convincingly leads all compared models.

How do you run DeepSeek-V3 locally — and which of the 8 inference backends should you choose?

DeepSeek-V3 can be deployed locally through eight inference backends. Here's how to choose:

Backend	Best for	GPU Support	Key Features
SGLang (recommended)	Production serving	NVIDIA, AMD	MLA optimizations, DP Attention, FP8, Torch Compile, multi-node TP
LMDeploy (recommended)	Offline + online deployment	NVIDIA	Pipeline processing, PyTorch integration
TensorRT-LLM (recommended)	Maximum performance	NVIDIA	BF16, INT4/8 quantization, FP8 coming soon
vLLM (recommended)	Standard serving	NVIDIA, AMD	Tensor + pipeline parallelism, FP8 + BF16
LightLLM	Multi-node deployment	NVIDIA	FP8/BF16, PD-disaggregation
AMD GPU	AMD hardware	AMD	Via SGLang, BF16 + FP8
Huawei Ascend NPU	Ascend hardware	Ascend	Via MindIE, BF16
DeepSeek-Infer Demo	Learning/experimentation	NVIDIA	Reference implementation, Linux + Python 3.10 only

Quick start with SGLang (recommended):

# See full instructions at:
# https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Model weights conversion (FP8 to BF16):

cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights

System requirements: Linux with Python 3.10 only. Mac and Windows are not supported natively (use cloud deployment or WSL on Windows). Multi-node GPU setup required for the full model — this is a 671B parameter model, not a laptop deployment. The mini model runs on smaller setups; the full model requires multiple H800/H100 GPUs.

Note: Hugging Face's Transformers library does not yet directly support DeepSeek-V3. Use one of the inference backends listed above.

How does Multi-Token Prediction (MTP) accelerate inference through speculative decoding?

Multi-Token Prediction is a training objective where the model predicts multiple future tokens at each position, rather than just the next one. During inference, this enables speculative decoding:

The model makes a "fast" prediction of the next few tokens using the MTP heads
A verification pass confirms these tokens against the main model
Accepted tokens are committed; rejected tokens trigger re-generation

The MTP module adds 14B parameters (separate from the 671B main model weights). The technical report states that MTP "can also be used for speculative decoding for inference acceleration." Community support for MTP in inference backends is still under active development — SGLang tracks progress at github.com/sgl-project/sglang/issues/2591.

The practical benefit: for latency-sensitive applications (chat, code completion), MTP speculative decoding can significantly reduce the wall-clock time per response by generating multiple tokens per forward pass rather than one at a time.

How did DeepSeek distill reasoning capabilities from R1 into V3 — and what does it mean for open-source model quality?

The distillation from DeepSeek-R1 is one of the most technically interesting aspects of DeepSeek-V3. The approach:

DeepSeek-R1 is a long Chain-of-Thought reasoning model — it thinks step-by-step, verifies its work, and reflects on errors before producing final answers
The verification and reflection patterns from R1's reasoning traces are extracted
These patterns are distilled into DeepSeek-V3 through the post-training pipeline, which "elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3"

"Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3." — DeepSeek-V3 Technical Report

The key distinction: this is not making V3 generate long Chain-of-Thought traces. It's distilling the cognitive patterns (verify assumptions, reflect on contradictions, break down multi-step problems) while maintaining V3's standard output style and length. The result is improved reasoning (visible in the AIME 2024 and MATH-500 scores) without the verbosity and latency cost of full CoT.

This distillation approach is a model for the open-source community: you can take a specialized reasoning model's capabilities and inject them into a general-purpose model through post-training, without changing the model architecture or inference characteristics.

Frequently Asked Questions

Q: Can DeepSeek-V3 run on a single consumer GPU?

No. The full 671B model requires multiple H800/H100 GPUs across nodes. Even with only 37B active per token, the total model must be loaded into memory. For single-GPU setups, consider quantized variants or smaller models from the DeepSeek family.

Q: Is DeepSeek-V3 free for commercial use?

The code is MIT licensed (free for any use). The model weights have a separate Model License that permits commercial use. Check the LICENSE-MODEL file in the repository for specific terms.

Q: How does DeepSeek-V3 compare to DeepSeek-R1?

R1 is a reasoning-specialized model that generates long Chain-of-Thought traces. V3 is a general-purpose model with R1's reasoning patterns distilled in. V3 is faster, more efficient, and better for general tasks. R1 is stronger on tasks requiring explicit step-by-step reasoning.

Q: Why is FP8 training significant?

FP8 uses 8-bit floating point (vs the standard 16-bit), halving memory requirements and doubling theoretical throughput for matrix operations. Previous attempts at FP8 training at scale resulted in instability. DeepSeek-V3's successful FP8 pre-training at 671B parameters validates the approach for future large-scale models.

Q: Does DeepSeek-V3 support function calling and tool use?

The base and chat models support standard prompting patterns. Tool use capabilities depend on the inference backend and prompting approach — SGLang and vLLM support OpenAI-compatible API serving with function calling.

Q: What's the difference between the Base and Chat models?

Base is the pre-trained model (14.8T tokens, no fine-tuning). Chat is the instruction-tuned model with SFT and RL post-training, including the R1 reasoning distillation. Use Chat for conversational and task-oriented applications; use Base for fine-tuning on domain-specific data.

Glossary

Mixture-of-Experts (MoE): An architecture where only a subset of model parameters (experts) are activated per token, enabling larger total models with lower per-token compute
FP8 Mixed Precision: Training using 8-bit floating point for most operations while keeping critical computations at higher precision — DeepSeek-V3 is the first extremely large model to validate this
Multi-head Latent Attention (MLA): An attention mechanism that compresses the KV-cache into a low-dimensional latent space, enabling long context windows (128K) with manageable memory
Multi-Token Prediction (MTP): Training objective predicting multiple future tokens per position, enabling speculative decoding for faster inference
Auxiliary-loss-free load balancing: A strategy for MoE models that balances expert utilization without the quality penalty of traditional load-balancing loss terms
Speculative decoding: An inference acceleration technique where a faster "draft" model predicts multiple tokens that are then verified by the main model

Author

Ramsis Hammadi — AI/ML engineer specializing in GenAI, LLM engineering, and automation. Full bio →

DEV Community

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

TL;DR Summary

Direct Answer Block

Introduction

How does DeepSeek-V3's Mixture-of-Experts architecture activate only 37B of 671B parameters per token?

Multi-head Latent Attention (MLA)

Auxiliary-loss-free load balancing

Multi-Token Prediction (MTP)

How does FP8 mixed precision training work — and why did it take 2.664M GPU hours with zero loss spikes?

How does DeepSeek-V3 compare to GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.1 405B on code, math, and reasoning benchmarks?

Code benchmarks

Math benchmarks

General benchmarks

How do you run DeepSeek-V3 locally — and which of the 8 inference backends should you choose?

Quick start with SGLang (recommended):

Model weights conversion (FP8 to BF16):

How does Multi-Token Prediction (MTP) accelerate inference through speculative decoding?

How did DeepSeek distill reasoning capabilities from R1 into V3 — and what does it mean for open-source model quality?

Frequently Asked Questions

Q: Can DeepSeek-V3 run on a single consumer GPU?

Q: Is DeepSeek-V3 free for commercial use?

Q: How does DeepSeek-V3 compare to DeepSeek-R1?

Q: Why is FP8 training significant?

Q: Does DeepSeek-V3 support function calling and tool use?

Q: What's the difference between the Base and Chat models?

Glossary

Author

Top comments (0)