DEV Community

Max Vyaznikov
Max Vyaznikov

Posted on

Running DeepSeek, Llama 3, and Qwen Locally: Complete GPU Requirements Guide

Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.

Quick Reference: VRAM Requirements

Model FP16 Q8 Q4_K_M Min GPU
Llama 3.1 8B 16 GB 8.5 GB 5 GB RTX 3060 12GB
Llama 3.1 70B 140 GB 70 GB 40 GB 2× RTX 3090
Llama 3.1 405B 810 GB 405 GB 228 GB 8× A100 80GB
Qwen2.5 7B 14 GB 7.5 GB 4.5 GB RTX 3060 8GB
Qwen2.5 14B 28 GB 14 GB 8.5 GB RTX 4060 Ti 16GB
Qwen2.5 32B 64 GB 32 GB 18 GB RTX 3090 24GB
Qwen2.5 72B 144 GB 72 GB 41 GB 2× RTX 3090
Mistral Small 24B 48 GB 24 GB 14 GB RTX 4080 16GB
Mistral Large 123B 246 GB 123 GB 69 GB 4× RTX 3090
DeepSeek V3 671B 1,340 GB 670 GB 376 GB 5× A100 80GB
DeepSeek R1 671B 1,340 GB 670 GB 376 GB 5× A100 80GB
Phi-3.5 Mini 3.8B 7.6 GB 4 GB 2.5 GB RTX 3060 8GB
Gemma 2 27B 54 GB 27 GB 16 GB RTX 4080 16GB

For any model, you can calculate exact VRAM needs at the VRAM calculator on gpuark.com.

Model-by-Model Deep Dive

Llama 3.1 — The All-Rounder

Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Llama 3.1 8B (auto-downloads ~4.7GB)
ollama run llama3.1

# Or the 70B if you have the VRAM
ollama run llama3.1:70b
Enter fullscreen mode Exit fullscreen mode

8B at Q4_K_M: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.

70B at Q4_K_M: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.

405B: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.

DeepSeek V3 / R1 — The MoE Giants

DeepSeek V3 (671B) uses Mixture of Experts — only ~37B parameters active per token, but all 671B must fit in memory. This means:

  • At Q4_K_M: ~376 GB VRAM minimum
  • Realistic minimum: 5× A100 80GB (400 GB total)
  • On consumer hardware: not feasible for the full model

But: DeepSeek R1 distilled versions exist:

  • DeepSeek-R1-7B: 4.5 GB at Q4 — runs on any modern GPU
  • DeepSeek-R1-14B: 8.5 GB at Q4 — RTX 4060 Ti
  • DeepSeek-R1-32B: 18 GB at Q4 — RTX 3090
  • DeepSeek-R1-70B: 40 GB at Q4 — 2× RTX 3090

The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.

Qwen2.5 — Best for Coding

Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:

# Qwen2.5-Coder-14B — best coding model for 16GB GPUs
ollama run qwen2.5-coder:14b

# Qwen2.5-32B — strong general model for 24GB GPUs
ollama run qwen2.5:32b
Enter fullscreen mode Exit fullscreen mode

Qwen2.5-Coder-14B at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.

Mistral — Efficient and Fast

Mistral models are known for good quality-to-size ratio:

# Mistral Small 24B — best quality under 16GB
ollama run mistral-small

# Mistral Large 123B — needs serious hardware
ollama run mistral-large
Enter fullscreen mode Exit fullscreen mode

Mistral Small 24B at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.

GPU Setup Recommendations

Beginner Setup (~$400)

  • GPU: RTX 4060 Ti 16GB
  • Models: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)
  • Software: Ollama + Open WebUI

Enthusiast Setup (~$700)

  • GPU: Used RTX 3090 24GB
  • Models: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model
  • Software: Ollama or ExLlamaV2 + TabbyAPI

Power User Setup (~$1,400)

  • GPUs: 2× Used RTX 3090 (48GB total)
  • Models: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B
  • Software: llama.cpp with --tensor-split 24,24

Prosumer Setup (~$2,000)

  • GPU: RTX 4090 + used RTX 3090
  • Models: Same as above, faster inference
  • Software: ExLlamaV2 with tensor parallelism

Performance Tips

1. Use the right quantization

Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.

2. Optimize KV cache

# llama.cpp: limit context to what you need
llama-server -m model.gguf -c 4096  # instead of default 8192+
Enter fullscreen mode Exit fullscreen mode

Halving context length saves significant VRAM.

3. Flash Attention

Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).

4. CPU offloading for oversized models

# llama.cpp: offload only some layers to GPU
llama-server -m model.gguf -ngl 20  # 20 layers on GPU, rest on CPU
Enter fullscreen mode Exit fullscreen mode

Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.

Conclusion

The local LLM ecosystem has matured enormously. For most developers:

  1. Start with Ollama — zero-friction setup
  2. Get at least 16GB VRAM — opens up 24B models
  3. 24GB (RTX 3090) is the sweet spot — runs everything up to 34B comfortably
  4. Two GPUs if you need 70B+ — pipeline parallelism just works

The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.


What's your local LLM setup? Drop your GPU + favorite model in the comments!

Top comments (0)