DEV Community

Agdex AI
Agdex AI

Posted on

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

DeepSeek's models are MIT-licensed and open-source — meaning you can run them on your own hardware, no API key required, no monthly costs, data never leaves your machine.

Here's a complete guide to running DeepSeek locally in 2026, covering three methods depending on your setup.

Which Model Should You Run?

Before picking a deployment method, pick a model size:

Model Active Params VRAM (Q4 quant) Sweet Spot For
R1 Distill 7B 7B ~5 GB RTX 3060, M2 Pro
R1 Distill 14B 14B ~10 GB RTX 3090, M2 Max ← recommended
R1 Distill 32B 32B ~22 GB RTX 4090, A100 40G
V3 / V4 Full 671B–1.6T 400+ GB Multi-GPU server

For most developers: R1 Distill 14B with Q4 quantization. Runs on a single RTX 3090 or Apple M2 Max, competitive reasoning quality, fast enough for interactive dev work.


Method 1: Ollama (Easiest)

Ollama handles download, quantization, and serving in one command. Works on macOS, Linux, and Windows.

Install

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from ollama.com
Enter fullscreen mode Exit fullscreen mode

Pull and Run

ollama run deepseek-r1:7b    # ~4.5 GB download
ollama run deepseek-r1:14b   # ~9 GB download  ← recommended
ollama run deepseek-r1:32b   # ~20 GB download
ollama run deepseek-v3:latest # ~220 GB — only for high-end setups
Enter fullscreen mode Exit fullscreen mode

First run downloads the model; subsequent runs start instantly.

Use via Python (OpenAI-compatible)

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
)

response = client.chat.completions.create(
    model="deepseek-r1:14b",
    messages=[{"role": "user", "content": "Explain the A2A protocol in 3 sentences"}]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Use with LangChain

from langchain_ollama import ChatOllama

llm = ChatOllama(model="deepseek-r1:14b")
result = llm.invoke("What's the difference between MCP and A2A?")
Enter fullscreen mode Exit fullscreen mode

Use with CrewAI

from crewai import LLM

llm = LLM(
    model="ollama/deepseek-r1:14b",
    base_url="http://localhost:11434"
)
Enter fullscreen mode Exit fullscreen mode

Method 2: LM Studio (GUI, Windows-Friendly)

LM Studio is a desktop app — no CLI needed. Best for non-technical users or anyone who wants a ChatGPT-like interface locally.

  1. Download from lmstudio.ai
  2. Open → Discover tab → search deepseek
  3. Select DeepSeek-R1-Distill-Qwen-14B-GGUF → Download
  4. Load the model → chat directly
  5. Optional: enable Local Server (port 1234) for API access
# LM Studio local server — same API pattern as Ollama
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)
response = client.chat.completions.create(
    model="deepseek-r1-distill-qwen-14b",
    messages=[{"role": "user", "content": "..."}]
)
Enter fullscreen mode Exit fullscreen mode

Method 3: Docker + vLLM (Production Grade)

For high-throughput production workloads on a GPU server:

docker pull vllm/vllm-openai:latest

# Run DeepSeek R1 14B (needs ~30 GB VRAM)
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
  --tensor-parallel-size 1
Enter fullscreen mode Exit fullscreen mode

API available at http://localhost:8000/v1 — OpenAI-compatible.

For multi-GPU (e.g., 2x A100 for 32B):

docker run --runtime nvidia --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --tensor-parallel-size 2
Enter fullscreen mode Exit fullscreen mode

Bonus: Direct from Hugging Face

If you need the raw weights for fine-tuning or custom inference:

pip install huggingface_hub

# Download model weights
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
  --local-dir ./models/deepseek-r1-14b

# Run with Transformers
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = './models/deepseek-r1-14b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
"
Enter fullscreen mode Exit fullscreen mode

Quick Comparison

Method Setup Difficulty Performance Best For
Ollama ⭐ Easy Good Dev, experimentation
LM Studio ⭐ Easy Good Non-technical users, Windows
vLLM + Docker ⭐⭐⭐ Hard Best Production, high throughput
HF Transformers ⭐⭐⭐ Hard Best Fine-tuning, custom inference

FAQ

Is DeepSeek open source?
Yes — MIT license. Download, modify, commercialize freely. Weights on Hugging Face and ModelScope.

Does Apple Silicon GPU acceleration work?
Yes. Ollama and LM Studio both support Apple Metal — significantly faster than CPU-only.

Can I use it offline after downloading?
Yes — model runs fully offline. Only the initial download requires internet.

How does local DeepSeek compare to the API?
Same model weights, so quality is identical. Quantized local versions (Q4/Q8) are slightly lower quality than the full-precision API. Speed depends on your hardware.


Find DeepSeek and 400+ AI agent tools, frameworks, and LLM APIs at AgDex.ai — the curated directory for AI builders.

Top comments (0)