Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.
Quick Reference: VRAM Requirements
| Model | FP16 | Q8 | Q4_K_M | Min GPU |
|---|---|---|---|---|
| Llama 3.1 8B | 16 GB | 8.5 GB | 5 GB | RTX 3060 12GB |
| Llama 3.1 70B | 140 GB | 70 GB | 40 GB | 2× RTX 3090 |
| Llama 3.1 405B | 810 GB | 405 GB | 228 GB | 8× A100 80GB |
| Qwen2.5 7B | 14 GB | 7.5 GB | 4.5 GB | RTX 3060 8GB |
| Qwen2.5 14B | 28 GB | 14 GB | 8.5 GB | RTX 4060 Ti 16GB |
| Qwen2.5 32B | 64 GB | 32 GB | 18 GB | RTX 3090 24GB |
| Qwen2.5 72B | 144 GB | 72 GB | 41 GB | 2× RTX 3090 |
| Mistral Small 24B | 48 GB | 24 GB | 14 GB | RTX 4080 16GB |
| Mistral Large 123B | 246 GB | 123 GB | 69 GB | 4× RTX 3090 |
| DeepSeek V3 671B | 1,340 GB | 670 GB | 376 GB | 5× A100 80GB |
| DeepSeek R1 671B | 1,340 GB | 670 GB | 376 GB | 5× A100 80GB |
| Phi-3.5 Mini 3.8B | 7.6 GB | 4 GB | 2.5 GB | RTX 3060 8GB |
| Gemma 2 27B | 54 GB | 27 GB | 16 GB | RTX 4080 16GB |
For any model, you can calculate exact VRAM needs at the VRAM calculator on gpuark.com.
Model-by-Model Deep Dive
Llama 3.1 — The All-Rounder
Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run Llama 3.1 8B (auto-downloads ~4.7GB)
ollama run llama3.1
# Or the 70B if you have the VRAM
ollama run llama3.1:70b
8B at Q4_K_M: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.
70B at Q4_K_M: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.
405B: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.
DeepSeek V3 / R1 — The MoE Giants
DeepSeek V3 (671B) uses Mixture of Experts — only ~37B parameters active per token, but all 671B must fit in memory. This means:
- At Q4_K_M: ~376 GB VRAM minimum
- Realistic minimum: 5× A100 80GB (400 GB total)
- On consumer hardware: not feasible for the full model
But: DeepSeek R1 distilled versions exist:
- DeepSeek-R1-7B: 4.5 GB at Q4 — runs on any modern GPU
- DeepSeek-R1-14B: 8.5 GB at Q4 — RTX 4060 Ti
- DeepSeek-R1-32B: 18 GB at Q4 — RTX 3090
- DeepSeek-R1-70B: 40 GB at Q4 — 2× RTX 3090
The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.
Qwen2.5 — Best for Coding
Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:
# Qwen2.5-Coder-14B — best coding model for 16GB GPUs
ollama run qwen2.5-coder:14b
# Qwen2.5-32B — strong general model for 24GB GPUs
ollama run qwen2.5:32b
Qwen2.5-Coder-14B at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.
Mistral — Efficient and Fast
Mistral models are known for good quality-to-size ratio:
# Mistral Small 24B — best quality under 16GB
ollama run mistral-small
# Mistral Large 123B — needs serious hardware
ollama run mistral-large
Mistral Small 24B at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.
GPU Setup Recommendations
Beginner Setup (~$400)
- GPU: RTX 4060 Ti 16GB
- Models: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)
- Software: Ollama + Open WebUI
Enthusiast Setup (~$700)
- GPU: Used RTX 3090 24GB
- Models: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model
- Software: Ollama or ExLlamaV2 + TabbyAPI
Power User Setup (~$1,400)
- GPUs: 2× Used RTX 3090 (48GB total)
- Models: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B
-
Software: llama.cpp with
--tensor-split 24,24
Prosumer Setup (~$2,000)
- GPU: RTX 4090 + used RTX 3090
- Models: Same as above, faster inference
- Software: ExLlamaV2 with tensor parallelism
Performance Tips
1. Use the right quantization
Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.
2. Optimize KV cache
# llama.cpp: limit context to what you need
llama-server -m model.gguf -c 4096 # instead of default 8192+
Halving context length saves significant VRAM.
3. Flash Attention
Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).
4. CPU offloading for oversized models
# llama.cpp: offload only some layers to GPU
llama-server -m model.gguf -ngl 20 # 20 layers on GPU, rest on CPU
Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.
Conclusion
The local LLM ecosystem has matured enormously. For most developers:
- Start with Ollama — zero-friction setup
- Get at least 16GB VRAM — opens up 24B models
- 24GB (RTX 3090) is the sweet spot — runs everything up to 34B comfortably
- Two GPUs if you need 70B+ — pipeline parallelism just works
The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.
What's your local LLM setup? Drop your GPU + favorite model in the comments!
Top comments (0)