How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide
DeepSeek's models are MIT-licensed and open-source — meaning you can run them on your own hardware, no API key required, no monthly costs, data never leaves your machine.
Here's a complete guide to running DeepSeek locally in 2026, covering three methods depending on your setup.
Which Model Should You Run?
Before picking a deployment method, pick a model size:
| Model | Active Params | VRAM (Q4 quant) | Sweet Spot For |
|---|---|---|---|
| R1 Distill 7B | 7B | ~5 GB | RTX 3060, M2 Pro |
| R1 Distill 14B | 14B | ~10 GB | RTX 3090, M2 Max ← recommended |
| R1 Distill 32B | 32B | ~22 GB | RTX 4090, A100 40G |
| V3 / V4 Full | 671B–1.6T | 400+ GB | Multi-GPU server |
For most developers: R1 Distill 14B with Q4 quantization. Runs on a single RTX 3090 or Apple M2 Max, competitive reasoning quality, fast enough for interactive dev work.
Method 1: Ollama (Easiest)
Ollama handles download, quantization, and serving in one command. Works on macOS, Linux, and Windows.
Install
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download from ollama.com
Pull and Run
ollama run deepseek-r1:7b # ~4.5 GB download
ollama run deepseek-r1:14b # ~9 GB download ← recommended
ollama run deepseek-r1:32b # ~20 GB download
ollama run deepseek-v3:latest # ~220 GB — only for high-end setups
First run downloads the model; subsequent runs start instantly.
Use via Python (OpenAI-compatible)
Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any string works
)
response = client.chat.completions.create(
model="deepseek-r1:14b",
messages=[{"role": "user", "content": "Explain the A2A protocol in 3 sentences"}]
)
print(response.choices[0].message.content)
Use with LangChain
from langchain_ollama import ChatOllama
llm = ChatOllama(model="deepseek-r1:14b")
result = llm.invoke("What's the difference between MCP and A2A?")
Use with CrewAI
from crewai import LLM
llm = LLM(
model="ollama/deepseek-r1:14b",
base_url="http://localhost:11434"
)
Method 2: LM Studio (GUI, Windows-Friendly)
LM Studio is a desktop app — no CLI needed. Best for non-technical users or anyone who wants a ChatGPT-like interface locally.
- Download from lmstudio.ai
- Open → Discover tab → search
deepseek - Select
DeepSeek-R1-Distill-Qwen-14B-GGUF→ Download - Load the model → chat directly
- Optional: enable Local Server (port 1234) for API access
# LM Studio local server — same API pattern as Ollama
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-qwen-14b",
messages=[{"role": "user", "content": "..."}]
)
Method 3: Docker + vLLM (Production Grade)
For high-throughput production workloads on a GPU server:
docker pull vllm/vllm-openai:latest
# Run DeepSeek R1 14B (needs ~30 GB VRAM)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--tensor-parallel-size 1
API available at http://localhost:8000/v1 — OpenAI-compatible.
For multi-GPU (e.g., 2x A100 for 32B):
docker run --runtime nvidia --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--tensor-parallel-size 2
Bonus: Direct from Hugging Face
If you need the raw weights for fine-tuning or custom inference:
pip install huggingface_hub
# Download model weights
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--local-dir ./models/deepseek-r1-14b
# Run with Transformers
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = './models/deepseek-r1-14b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
"
Quick Comparison
| Method | Setup Difficulty | Performance | Best For |
|---|---|---|---|
| Ollama | ⭐ Easy | Good | Dev, experimentation |
| LM Studio | ⭐ Easy | Good | Non-technical users, Windows |
| vLLM + Docker | ⭐⭐⭐ Hard | Best | Production, high throughput |
| HF Transformers | ⭐⭐⭐ Hard | Best | Fine-tuning, custom inference |
FAQ
Is DeepSeek open source?
Yes — MIT license. Download, modify, commercialize freely. Weights on Hugging Face and ModelScope.
Does Apple Silicon GPU acceleration work?
Yes. Ollama and LM Studio both support Apple Metal — significantly faster than CPU-only.
Can I use it offline after downloading?
Yes — model runs fully offline. Only the initial download requires internet.
How does local DeepSeek compare to the API?
Same model weights, so quality is identical. Quantized local versions (Q4/Q8) are slightly lower quality than the full-precision API. Speed depends on your hardware.
Find DeepSeek and 400+ AI agent tools, frameworks, and LLM APIs at AgDex.ai — the curated directory for AI builders.
Top comments (0)