Agdex AI

Posted on Apr 25

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

#programming #webdev #deepseek #ai

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

DeepSeek's models are MIT-licensed and open-source — meaning you can run them on your own hardware, no API key required, no monthly costs, data never leaves your machine.

Here's a complete guide to running DeepSeek locally in 2026, covering three methods depending on your setup.

Which Model Should You Run?

Before picking a deployment method, pick a model size:

Model	Active Params	VRAM (Q4 quant)	Sweet Spot For
R1 Distill 7B	7B	~5 GB	RTX 3060, M2 Pro
R1 Distill 14B	14B	~10 GB	RTX 3090, M2 Max ← recommended
R1 Distill 32B	32B	~22 GB	RTX 4090, A100 40G
V3 / V4 Full	671B–1.6T	400+ GB	Multi-GPU server

For most developers: R1 Distill 14B with Q4 quantization. Runs on a single RTX 3090 or Apple M2 Max, competitive reasoning quality, fast enough for interactive dev work.

Method 1: Ollama (Easiest)

Ollama handles download, quantization, and serving in one command. Works on macOS, Linux, and Windows.

Install

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from ollama.com

Pull and Run

ollama run deepseek-r1:7b    # ~4.5 GB download
ollama run deepseek-r1:14b   # ~9 GB download  ← recommended
ollama run deepseek-r1:32b   # ~20 GB download
ollama run deepseek-v3:latest # ~220 GB — only for high-end setups

First run downloads the model; subsequent runs start instantly.

Use via Python (OpenAI-compatible)

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
)

response = client.chat.completions.create(
    model="deepseek-r1:14b",
    messages=[{"role": "user", "content": "Explain the A2A protocol in 3 sentences"}]
)
print(response.choices[0].message.content)

Use with LangChain

from langchain_ollama import ChatOllama

llm = ChatOllama(model="deepseek-r1:14b")
result = llm.invoke("What's the difference between MCP and A2A?")

Use with CrewAI

from crewai import LLM

llm = LLM(
    model="ollama/deepseek-r1:14b",
    base_url="http://localhost:11434"
)

Method 2: LM Studio (GUI, Windows-Friendly)

LM Studio is a desktop app — no CLI needed. Best for non-technical users or anyone who wants a ChatGPT-like interface locally.

Download from lmstudio.ai
Open → Discover tab → search deepseek
Select DeepSeek-R1-Distill-Qwen-14B-GGUF → Download
Load the model → chat directly
Optional: enable Local Server (port 1234) for API access

# LM Studio local server — same API pattern as Ollama
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)
response = client.chat.completions.create(
    model="deepseek-r1-distill-qwen-14b",
    messages=[{"role": "user", "content": "..."}]
)

Method 3: Docker + vLLM (Production Grade)

For high-throughput production workloads on a GPU server:

docker pull vllm/vllm-openai:latest

# Run DeepSeek R1 14B (needs ~30 GB VRAM)
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
  --tensor-parallel-size 1

API available at http://localhost:8000/v1 — OpenAI-compatible.

For multi-GPU (e.g., 2x A100 for 32B):

docker run --runtime nvidia --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --tensor-parallel-size 2

Bonus: Direct from Hugging Face

If you need the raw weights for fine-tuning or custom inference:

pip install huggingface_hub

# Download model weights
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
  --local-dir ./models/deepseek-r1-14b

# Run with Transformers
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = './models/deepseek-r1-14b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
"

Quick Comparison

Method	Setup Difficulty	Performance	Best For
Ollama	⭐ Easy	Good	Dev, experimentation
LM Studio	⭐ Easy	Good	Non-technical users, Windows
vLLM + Docker	⭐⭐⭐ Hard	Best	Production, high throughput
HF Transformers	⭐⭐⭐ Hard	Best	Fine-tuning, custom inference

FAQ

Is DeepSeek open source?
Yes — MIT license. Download, modify, commercialize freely. Weights on Hugging Face and ModelScope.

Does Apple Silicon GPU acceleration work?
Yes. Ollama and LM Studio both support Apple Metal — significantly faster than CPU-only.

Can I use it offline after downloading?
Yes — model runs fully offline. Only the initial download requires internet.

How does local DeepSeek compare to the API?
Same model weights, so quality is identical. Quantized local versions (Q4/Q8) are slightly lower quality than the full-precision API. Speed depends on your hardware.

Find DeepSeek and 400+ AI agent tools, frameworks, and LLM APIs at AgDex.ai — the curated directory for AI builders.

DEV Community

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

Which Model Should You Run?

Method 1: Ollama (Easiest)

Install

Pull and Run

Use via Python (OpenAI-compatible)

Use with LangChain

Use with CrewAI

Method 2: LM Studio (GUI, Windows-Friendly)

Method 3: Docker + vLLM (Production Grade)

Bonus: Direct from Hugging Face

Quick Comparison

FAQ

Top comments (0)