Max Vyaznikov

Posted on Mar 12

Running DeepSeek, Llama 3, and Qwen Locally: Complete GPU Requirements Guide

#machinelearning

Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.

Quick Reference: VRAM Requirements

Model	FP16	Q8	Q4_K_M	Min GPU
Llama 3.1 8B	16 GB	8.5 GB	5 GB	RTX 3060 12GB
Llama 3.1 70B	140 GB	70 GB	40 GB	2× RTX 3090
Llama 3.1 405B	810 GB	405 GB	228 GB	8× A100 80GB
Qwen2.5 7B	14 GB	7.5 GB	4.5 GB	RTX 3060 8GB
Qwen2.5 14B	28 GB	14 GB	8.5 GB	RTX 4060 Ti 16GB
Qwen2.5 32B	64 GB	32 GB	18 GB	RTX 3090 24GB
Qwen2.5 72B	144 GB	72 GB	41 GB	2× RTX 3090
Mistral Small 24B	48 GB	24 GB	14 GB	RTX 4080 16GB
Mistral Large 123B	246 GB	123 GB	69 GB	4× RTX 3090
DeepSeek V3 671B	1,340 GB	670 GB	376 GB	5× A100 80GB
DeepSeek R1 671B	1,340 GB	670 GB	376 GB	5× A100 80GB
Phi-3.5 Mini 3.8B	7.6 GB	4 GB	2.5 GB	RTX 3060 8GB
Gemma 2 27B	54 GB	27 GB	16 GB	RTX 4080 16GB

For any model, you can calculate exact VRAM needs at the VRAM calculator on gpuark.com.

Model-by-Model Deep Dive

Llama 3.1 — The All-Rounder

Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Llama 3.1 8B (auto-downloads ~4.7GB)
ollama run llama3.1

# Or the 70B if you have the VRAM
ollama run llama3.1:70b

8B at Q4_K_M: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.

70B at Q4_K_M: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.

405B: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.

DeepSeek V3 / R1 — The MoE Giants

DeepSeek V3 (671B) uses Mixture of Experts — only ~37B parameters active per token, but all 671B must fit in memory. This means:

At Q4_K_M: ~376 GB VRAM minimum
Realistic minimum: 5× A100 80GB (400 GB total)
On consumer hardware: not feasible for the full model

But: DeepSeek R1 distilled versions exist:

DeepSeek-R1-7B: 4.5 GB at Q4 — runs on any modern GPU
DeepSeek-R1-14B: 8.5 GB at Q4 — RTX 4060 Ti
DeepSeek-R1-32B: 18 GB at Q4 — RTX 3090
DeepSeek-R1-70B: 40 GB at Q4 — 2× RTX 3090

The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.

Qwen2.5 — Best for Coding

Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:

# Qwen2.5-Coder-14B — best coding model for 16GB GPUs
ollama run qwen2.5-coder:14b

# Qwen2.5-32B — strong general model for 24GB GPUs
ollama run qwen2.5:32b

Qwen2.5-Coder-14B at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.

Mistral — Efficient and Fast

Mistral models are known for good quality-to-size ratio:

# Mistral Small 24B — best quality under 16GB
ollama run mistral-small

# Mistral Large 123B — needs serious hardware
ollama run mistral-large

Mistral Small 24B at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.

GPU Setup Recommendations

Beginner Setup (~$400)

GPU: RTX 4060 Ti 16GB
Models: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)
Software: Ollama + Open WebUI

Enthusiast Setup (~$700)

GPU: Used RTX 3090 24GB
Models: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model
Software: Ollama or ExLlamaV2 + TabbyAPI

Power User Setup (~$1,400)

GPUs: 2× Used RTX 3090 (48GB total)
Models: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B
Software: llama.cpp with --tensor-split 24,24

Prosumer Setup (~$2,000)

GPU: RTX 4090 + used RTX 3090
Models: Same as above, faster inference
Software: ExLlamaV2 with tensor parallelism

Performance Tips

1. Use the right quantization

Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.

2. Optimize KV cache

# llama.cpp: limit context to what you need
llama-server -m model.gguf -c 4096  # instead of default 8192+

Halving context length saves significant VRAM.

3. Flash Attention

Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).

4. CPU offloading for oversized models

# llama.cpp: offload only some layers to GPU
llama-server -m model.gguf -ngl 20  # 20 layers on GPU, rest on CPU

Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.

Conclusion

The local LLM ecosystem has matured enormously. For most developers:

Start with Ollama — zero-friction setup
Get at least 16GB VRAM — opens up 24B models
24GB (RTX 3090) is the sweet spot — runs everything up to 34B comfortably
Two GPUs if you need 70B+ — pipeline parallelism just works

The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.

What's your local LLM setup? Drop your GPU + favorite model in the comments!

DEV Community