Agdex AI

Posted on Apr 26

How to Run DeepSeek Locally in 2026: Ollama, LM Studio and vLLM Setup Guide

#llm #python

How to Run DeepSeek Locally in 2026: Ollama, LM Studio & vLLM Setup Guide

DeepSeek's models are MIT-licensed and open-source — meaning you can run them on your own hardware, no API key required, no monthly costs, data never leaves your machine.

Here's a complete guide to running DeepSeek locally in 2026, covering three methods depending on your setup.

Which Model Should You Run?

Before picking a deployment method, pick a model size:

Model	Active Params	VRAM (Q4 quant)	Sweet Spot For
R1 Distill 7B	7B	~5 GB	RTX 3060, M2 Pro
R1 Distill 14B	14B	~10 GB	RTX 3090, M2 Max ← recommended
R1 Distill 32B	32B	~22 GB	RTX 4090, A100 40G
V3 / V4 Full	671B–1.6T	400+ GB	Multi-GPU server

For most developers: R1 Distill 14B with Q4 quantization. Runs on a single RTX 3090 or Apple M2 Max, competitive reasoning quality, fast enough for interactive dev work.

Method 1: Ollama (Easiest)

Ollama handles download, quantization, and serving in one command. Works on macOS, Linux, and Windows.

Install

\`bash

macOS / Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows: download from ollama.com

`\

Pull and Run

\bash ollama run deepseek-r1:7b # ~4.5 GB download ollama run deepseek-r1:14b # ~9 GB download ← recommended ollama run deepseek-r1:32b # ~20 GB download ollama run deepseek-v3:latest # ~220 GB — only for high-end setups \\

First run downloads the model; subsequent runs start instantly.

Use via Python (OpenAI-compatible)

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434\:

\`python
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any string works
)

response = client.chat.completions.create(
model="deepseek-r1:14b",
messages=[{"role": "user", "content": "Explain the A2A protocol in 3 sentences"}]
)
print(response.choices[0].message.content)
`\

Use with LangChain

\`python
from langchain_ollama import ChatOllama

llm = ChatOllama(model="deepseek-r1:14b")
result = llm.invoke("What's the difference between MCP and A2A?")
`\

Use with CrewAI

\`python
from crewai import LLM

llm = LLM(
model="ollama/deepseek-r1:14b",
base_url="http://localhost:11434"
)
`\

Method 2: LM Studio (GUI, Windows-Friendly)

LM Studio is a desktop app — no CLI needed. Best for non-technical users or anyone who wants a ChatGPT-like interface locally.

Download from lmstudio.ai
Open → Discover tab → search deepseek\
Select DeepSeek-R1-Distill-Qwen-14B-GGUF\ → Download
Load the model → chat directly
Optional: enable Local Server (port 1234) for API access

\`python

LM Studio local server — same API pattern as Ollama

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-qwen-14b",
messages=[{"role": "user", "content": "..."}]
)
`\

Method 3: Docker + vLLM (Production Grade)

For high-throughput production workloads on a GPU server:

\`bash
docker pull vllm/vllm-openai:latest

Run DeepSeek R1 14B (needs ~30 GB VRAM)

docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--tensor-parallel-size 1
`\

API available at http://localhost:8000/v1\ — OpenAI-compatible.

For multi-GPU (e.g., 2x A100 for 32B):

\bash docker run --runtime nvidia --gpus all \\ -p 8000:8000 \\ vllm/vllm-openai:latest \\ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \\ --tensor-parallel-size 2 \\

Bonus: Direct from Hugging Face

If you need the raw weights for fine-tuning or custom inference:

\`bash
pip install huggingface_hub

Download model weights

huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--local-dir ./models/deepseek-r1-14b

Run with Transformers

python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = './models/deepseek-r1-14b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
"
`\

Quick Comparison

Method	Setup Difficulty	Performance	Best For
Ollama	⭐ Easy	Good	Dev, experimentation
LM Studio	⭐ Easy	Good	Non-technical users, Windows
vLLM + Docker	⭐⭐⭐ Hard	Best	Production, high throughput
HF Transformers	⭐⭐⭐ Hard	Best	Fine-tuning, custom inference

FAQ

Is DeepSeek open source?
Yes — MIT license. Download, modify, commercialize freely. Weights on Hugging Face and ModelScope.

Does Apple Silicon GPU acceleration work?
Yes. Ollama and LM Studio both support Apple Metal — significantly faster than CPU-only.

Can I use it offline after downloading?
Yes — model runs fully offline. Only the initial download requires internet.

How does local DeepSeek compare to the API?
Same model weights, so quality is identical. Quantized local versions (Q4/Q8) are slightly lower quality than the full-precision API. Speed depends on your hardware.

Find DeepSeek and 400+ AI agent tools, frameworks, and LLM APIs at AgDex.ai — the curated directory for AI builders.

DEV Community