Hassann

Posted on Apr 24 • Originally published at apidog.com

How to Run DeepSeek V4 Locally ?

DeepSeek V4 dropped on April 23, 2026 with MIT-licensed weights on Hugging Face. That single license choice opens up frontier AI for any team wanting to run models on their own hardware. V4-Flash (284B total, 13B active) fits on two H100s at FP8. V4-Pro (1.6T total, 49B active) requires a cluster but matches GPT-5.5 and Claude Opus 4.6 on code and reasoning workloads.

Try Apidog today

This guide walks through local deployment: hardware requirements, quantization, vLLM and SGLang setup, tool-use configuration, and a validation workflow in Apidog to confirm your local server before sending production traffic.

For product overview, see what is DeepSeek V4. For hosted API usage, see how to use the DeepSeek V4 API. For cost details, see DeepSeek V4 API pricing.

TL;DR

V4-Flash: Runs on 2 × H100 80GB at FP8, or 1 × H100 at INT4. Weights ≈ 500GB (FP8).
V4-Pro: Needs 16+ H100s at FP8 for production throughput.
vLLM: Fastest path to OpenAI-compatible server. vllm>=0.9.0 adds V4 support.
SGLang: Alternative for better tool-use and structured-output features.
Quantization: AWQ INT4 or GPTQ INT4 fits V4-Flash on a single 80GB card (~5% quality loss).
Use Apidog to test http://localhost:8000/v1 and reuse your hosted API collections.

Who should self-host

Self-hosting V4 is right for:

Compliance-bound teams: Health, finance, legal, or defense use-cases where data cannot leave the network. MIT-licensed open weights means no usage agreement or cross-border data flows.
Large stable workloads: Above ~200B tokens/month, dedicated hardware beats API costs. Example: V4-Pro API = $1.74/M input + $3.48/M output.
Fine-tuning and research: Base checkpoints are for further pre-training/domain adaptation. MIT license allows redistribution of tuned models.

Not for: Prototypers, teams without GPU ops experience, or workloads < $200/month on the hosted API—operational overhead will outweigh cost savings at small scale.

Hardware requirements

DeepSeek V4 uses FP4 + FP8 mixed precision, so VRAM needs are lower than raw parameter counts suggest.

Variant	Total params	Active params	FP8 VRAM	INT4 VRAM	Minimum cards
V4-Flash	284B	13B	~500GB	~140GB	2 × H100 80GB (FP8) or 1 × H100 (INT4)
V4-Pro	1.6T	49B	~2.4TB	~700GB	16 × H100 80GB (FP8) or 8 × H100 (INT4)

Notes:

MoE memory is total, not active: All experts must fit in VRAM, not just the active subset.
H200 and MI300X: 141GB/192GB cards need fewer GPUs.
Consumer GPUs: Not supported (V4-Flash at INT4 won't run on RTX 5090 24GB).
Apple Silicon: M3/M4 Max with 128GB unified memory can run V4-Flash at high quantization, but only for dev, not deployment.

Step 1: Download the weights

Official Hugging Face repos:

deepseek-ai/DeepSeek-V4-Flash
deepseek-ai/DeepSeek-V4-Pro
Use deepseek-ai/DeepSeek-V4-Flash-Base and DeepSeek-V4-Pro-Base for fine-tuning.

Download example:

pip install -U "huggingface_hub[cli]"
huggingface-cli login

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./models/deepseek-v4-flash \
  --local-dir-use-symlinks False

Reserve ~500GB disk for V4-Flash, several TBs for V4-Pro.
ModelScope is faster for users in China.

Step 2: Pick a serving engine

Two main options:

vLLM: High throughput, OpenAI-compatible, largest community—recommended for most teams.
SGLang: Better for tool-use, structured output, and long context. Use if you need advanced function calling.

Both support V4 as of their April 2026 releases.

Step 3: Serve V4-Flash with vLLM

pip install "vllm>=0.9.0"

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --max-model-len 1048576 \
  --dtype auto \
  --enable-prefix-caching \
  --port 8000

Flags:

--tensor-parallel-size 2: Splits model across 2 H100s. Raise for more GPUs.
--max-model-len 1048576: Full 1M-token context window. Reduce to save VRAM.
--enable-prefix-caching: Enables fast repeated prefixes (mirrors hosted API cache).
--dtype auto: Uses FP8 mixed precision.

Server runs OpenAI-compatible endpoints at http://localhost:8000/v1.

Step 4: Serve V4-Pro with vLLM

Requires a cluster:

vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --max-model-len 524288 \
  --enable-prefix-caching \
  --port 8000

--max-model-len 524288 (512K) fits on a 16-H100 box; increase if VRAM allows.
Use both pipeline and tensor parallelism for multi-node setups.

Step 5: Serve with SGLang (the tool-use alternative)

pip install "sglang[all]>=0.4.0"

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp 2 \
  --context-length 1048576 \
  --port 30000

OpenAI-compatible endpoint at http://localhost:30000/v1
SGLang's lang DSL enables better function calling and structured output.

Step 6: Quantize for a single-GPU box

INT4 quantization allows V4-Flash on a single 80GB GPU with minimal quality drop.

AWQ (recommended)

pip install autoawq

python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = './models/deepseek-v4-flash'
out_path = './models/deepseek-v4-flash-awq'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config={'w_bit': 4, 'q_group_size': 128})
model.save_quantized(out_path)
tokenizer.save_pretrained(out_path)
"

GPTQ

pip install auto-gptq
# Follow the GPTQ quantization recipe; similar pattern to AWQ.

Serve quantized checkpoints with vLLM using --quantization awq or --quantization gptq.

Step 7: Test with Apidog

Always validate your local server before sending production traffic.

Download Apidog.
Create a collection targeting http://localhost:8000/v1/chat/completions.
Paste in your standard test prompt (same as hosted API).
Run a 500K-token context test to confirm KV cache stability.
Run a tool-calling flow end-to-end before connecting agent loops.

Your hosted DeepSeek V4 API collections work locally—just change the base URL.

Observability and monitoring

Track these from day one:

Tokens per second: Both prompt and generation. vLLM exposes /metrics in Prometheus format.
GPU utilization: Use nvidia-smi or DCGM. Sustained <70% means batch size is likely too small.
KV cache hit rate: With --enable-prefix-caching, vLLM reports this. Falling rates signal prompt churn.
Request latency (p50/p95/p99): Use tracing. High p99 with stable p50 means some requests are stalling the queue.

Send all four to Grafana or your existing observability stack.

Fine-tuning V4 Base checkpoints

Base checkpoints are for continued pre-training and SFT. Standard SFT (with LoRA):

pip install "torch>=2.6" transformers accelerate peft trl

# Standard SFT with LoRA on V4-Flash-Base
python -m trl sft \
  --model_name_or_path deepseek-ai/DeepSeek-V4-Flash-Base \
  --dataset_name your-org/your-sft-set \
  --output_dir ./models/v4-flash-custom \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --learning_rate 2e-5 \
  --bf16 true \
  --use_peft true \
  --lora_r 64 \
  --lora_alpha 128

Full-parameter tuning on V4-Pro is for research labs. LoRA adapters on V4-Flash-Base provide substantial quality gain for practical compute.

Common pitfalls

OOM at startup: Usually --max-model-len is too high or --tensor-parallel-size too low. Lower context or increase parallelism.
Slow first request: vLLM compiles kernels lazily. Warm up with a dummy request.
Tool-use parsing errors: DeepSeek encoding differs from OpenAI's. Use SDK versions with explicit V4 support.
FP8 errors on old GPUs: A100s lack FP8 support. Use BF16 and expect 2x VRAM needs.

When self-hosting pays off

Break-even vs. hosted DeepSeek V4 pricing:

V4-Flash at 200B input + 20B output/month: ~$33.6K on API. 8 × H100 box rents ≈ $20K/month. Self-hosting saves ~40%.
V4-Pro at 500B input + 50B output/month: ~$1.04M on API. 16 × H100 cluster rents ≈ $35K/month. Self-hosting saves >95%.

Break-even for V4-Flash: ≈ 100B tokens/month. Below that, hosted API is cheaper and simpler.

FAQ

Can I run V4-Flash on a single A100?

Heavy quantization and reduced context will run (INT4 on 80GB A100 = 5–15 tok/s), but H100 is much faster.

Does V4 support LoRA fine-tuning?

Yes. Use Base checkpoints with TRL or Axolotl pipelines. MoE routing doesn't impact LoRA.

Is the local server OpenAI-compatible?

Yes. Both vLLM and SGLang expose /v1/chat/completions and /v1/completions with OpenAI request shape. The hosted API guide applies to localhost.

How do I enable thinking mode locally?

Pass thinking_mode: "thinking" or "thinking_max" in the request body. vLLM and SGLang forward the flag.

Can I stream from a local V4 server?

Yes. Set stream: true as you would for OpenAI or hosted DeepSeek API.

Cheapest way to experiment before buying hardware?

Rent a single H100 on RunPod or Lambda, run V4-Flash at INT4, and benchmark with your prompts. $10–$30 is enough for a real-world throughput check.

DEV Community