This article was originally published on runaihome.com
DeepSeek R1 is a reasoning model — it "thinks out loud" before answering, producing measurably better results on math, coding, and logic than instruction-tuned chat models of similar size. The full 671B version is server territory. The six distilled variants are not. They range from 1.1GB on disk to 43GB, cover every consumer GPU tier from an 8GB RTX 3060 up to dual RTX 3090s, and even the 7B distill surpasses QwQ-32B-Preview on AIME 2024 math benchmarks — a larger model, outrun by something that fits in 8GB of VRAM.
The decision you're making isn't "should I run DeepSeek R1 locally." It's "which distilled size won't bottleneck my hardware."
Why distilled models exist (and why 671B isn't on this list)
The full DeepSeek-R1 is a Mixture-of-Experts (MoE) model with 671 billion total parameters. At Q4_K_M quantization, the GGUF weights come in at roughly 404GB — you need either a multi-GPU server with 512GB of VRAM or one of the extreme quantizations (Unsloth's IQ1_S at 131GB, for comparison). That's not a home lab setup.
Instead, DeepSeek trained six smaller dense models by fine-tuning existing open-source checkpoints on 800,000 samples of R1's reasoning chains. The base models are Qwen2.5 (for the 1.5B, 7B, 14B, and 32B variants) and Llama 3 (for the 8B and 70B). The distillation process transfers R1's chain-of-thought reasoning style into a much smaller architecture — the 7B distill still hits 55.5% on AIME 2024 math problems, a score that would have been considered strong frontier-model performance a year ago.
VRAM and disk requirements at a glance
All sizes shown at Q4_K_M quantization, which is the Ollama default and the recommended starting point for home inference. VRAM figures include headroom for KV cache at normal context lengths.
| Model | Ollama tag | Download size | Min VRAM | Comfortable VRAM |
|---|---|---|---|---|
| R1 Distill Qwen 1.5B | deepseek-r1:1.5b |
1.1 GB | 4 GB | 6 GB |
| R1 Distill Qwen 7B | deepseek-r1:7b |
4.7 GB | 6 GB | 8 GB |
| R1 Distill Llama 8B | deepseek-r1:8b |
5.2 GB | 6 GB | 8 GB |
| R1 Distill Qwen 14B | deepseek-r1:14b |
9.0 GB | 12 GB | 16 GB |
| R1 Distill Qwen 32B | deepseek-r1:32b |
20 GB | 24 GB | 24 GB |
| R1 Distill Llama 70B | deepseek-r1:70b |
43 GB | 40 GB | 48 GB |
The 32B fits exactly in a 24GB card (RTX 3090 or RTX 4090) with about 4GB of headroom for the KV cache. The 70B does not fit in any single consumer GPU — it needs dual 24GB cards or Apple Silicon with ≥64GB unified memory.
Tier 1: 1.5B — testing and edge devices
The 1.5B distill is based on Qwen2.5-Math-1.5B, a math-specialized base, which gives it disproportionate strength on numerical problems: 28.9% on AIME 2024 and 83.9% on MATH-500 from a model that weighs 1.1GB.
That math score (83.9% on MATH-500) matches what full-scale GPT-3.5-class models produced two years ago. For general conversation and creative writing, the 1.5B will disappoint — the context window fills quickly with its reasoning chains and it often loses track of multi-step instructions. But for a math homework assistant or a low-power edge inference setup, it's remarkable that it runs at all in 4GB of VRAM.
If you're on a laptop with an RTX 4060 (8GB) and want to experiment with R1's reasoning behavior without committing, the 1.5B is your zero-risk entry point. It also runs fully on CPU with 16GB of system RAM at 5–10 tokens/sec — tolerable for occasional use.
Tier 2: 7B and 8B — the 8GB sweet spot
Two distills occupy this tier and they use different base models, which matters.
R1-Distill-Qwen-7B (based on Qwen2.5-Math-7B): 55.5% on AIME 2024, 92.8% on MATH-500. The Qwen2.5-Math base gives it exceptional numerical reasoning for its size. At 4.7GB download and 6-8GB VRAM, this fits on an RTX 3060 with room to spare.
R1-Distill-Llama-8B (based on Llama-3.1-8B): 50.4% on AIME 2024, 89.1% on MATH-500. Marginally weaker on math benchmarks than the 7B despite being larger, because the Llama base is a generalist rather than math-specialized. The advantage: the Llama-3.1-8B base is better on general instruction following and multi-turn conversation. For coding tasks and mixed-domain use, the 8B often feels more capable in practice even though the math benchmark says otherwise.
For most 8GB GPU owners, the 7B is the better technical choice and the 8B is the better practical choice. Run both and see which answers feel more coherent for your use cases — the performance difference on real tasks is smaller than the benchmark gap suggests.
On an RTX 4090, either model runs at roughly 80–100 tok/s. On an RTX 3060 or RTX 4060 (12GB or 8GB), expect 40–60 tok/s. Both are fast enough that you won't notice the reasoning latency in the chain-of-thought — the throughput keeps up with your reading speed.
Tier 3: 14B — the 16GB upgrade
The 14B distill (based on Qwen2.5-14B) is where reasoning quality makes a noticeable jump: 69.7% on AIME 2024 and 93.9% on MATH-500. It scores within 3 percentage points of the 32B on math, at half the VRAM.
At Q4_K_M, the 14B weights are 9.0GB on disk. An RTX 4060 Ti 16GB runs it fully in VRAM with 7GB of headroom for context. An RTX 4070 Super (12GB) can run it with tight headroom — the model weights fit at ~9GB, but you'll need to cap your context length to avoid OOM errors. A 16GB card is the comfortable minimum.
Benchmark performance on RTX 4090: 58.6 tok/s for the 14B distill running under Ollama. On RTX 3090 (24GB, 936 GB/s bandwidth): roughly 45–50 tok/s — the 3090 is close enough to the 4090 in memory bandwidth that the gap is small for this model size.
This is the tier most home lab builders should target if they have 16GB VRAM. The quality jump from 7B to 14B is larger than any hardware upgrade you'll make for the same money. You're getting frontier-2024 reasoning capability in a $300–400 GPU config.
Tier 4: 32B — the 24GB card's reason to exist
The 32B distill (based on Qwen2.5-32B) is the practical ceiling for single-card consumer inference. At 20GB download and ~20GB VRAM for the weights, it fits exactly in an RTX 3090 or RTX 4090 with 4GB left for the KV cache.
Quality numbers: 72.6% on AIME 2024, 94.3% on MATH-500, 57.2% on LiveCodeBench. That math score is within 2.6 percentage points of the full 671B R1 model. For coding tasks, it performs at a level that was competitive with GPT-4 at launch.
Performance on RTX 4090 (1008 GB/s bandwidth): approximately 38–42 tok/s for the 32B at Q4_K_M under Ollama. With a reasoning model, this matters more than it would for a chat model — R1's thinking chains can run 500–2000 tokens before the actual answer appears. At 40 tok/s, a 1000-token reasoning chain takes about 25 seconds, then the answer follows immediately. Most users find this tolerable for tasks where quality matters more than latency.
Performance on RTX 3090 (936 GB/s bandwidth): 28–35 tok/s. The 3090 is within 10–15% of the 4090 in memory bandwidth, so inference speed tracks closely. If you already have a 3090, the 32B is its intended workload.
The 32B is not the right pick for fast-turnaround coding assistance where you want sub-2-second first-token responses. For that use case, drop to the 7B or 14B. The 32B earns its VRAM for tasks where you want the best answer you can get locally: research, complex math, architecture decisions, long-document analysis.
Tier 5: 70B — dual-GPU or Mac Studio territory
The 70B distill (based on Llama-3.3-70B-Instruct) needs 40GB+ of VRAM. On consumer hardware, that means either dual RTX 3090s (
Top comments (0)