TL;DR: VRAM matters more than GPU power. Most people overestimate what they need—and underestimate what actually runs well.
The confusing part about local LLMs
If you’ve tried running models locally (Ollama, llama.cpp, LM Studio, etc.), you’ve probably asked:
- “Can my GPU run this model?”
- “Why does it technically load but run painfully slow?”
- “Do I need 24GB VRAM for everything?”
The answers online are inconsistent.
So instead of relying on benchmarks, I started tracking what actually works in real setups.
🧠 The simple rule most people miss
If it doesn’t fit comfortably in VRAM, it doesn’t really “run”.
Yes, you can offload to CPU or swap memory—but the experience quickly degrades.
📊 Practical VRAM breakdown
Here’s a simplified version of what consistently works:
🟢 Under 8GB (S-tier)
- 7B models (quantized)
-
Good for:
- basic chat
- light coding help
-
Limitations:
- struggles with longer context
- slower responses
🟡 8–16GB (M-tier)
- 7B → very smooth
- 13B → usable but sometimes tight
👉 This is where most consumer GPUs sit.
🟠 16–24GB (L-tier)
- 13B → comfortable
- 34B → possible with quantization
👉 This is the sweet spot for serious local use
🔴 24GB+ (XL-tier)
- 34B → usable
- 70B → technically possible, but often inefficient
👉 At this level, cloud often makes more sense unless you specifically need local.
⚙️ What actually matters (more than people think)
1. VRAM > raw GPU performance
A faster GPU doesn’t help if the model barely fits.
2. Quantization changes everything
Q4 vs Q5 can be the difference between:
- “runs fine”
- “completely unusable”
3. Model size ≠ better experience
In many real-world setups:
- 13B models feel better than 70B
- simply because they’re faster and more responsive
💡 What you should actually choose
If you’re deciding today:
- Casual use → 7B
- Daily use / coding / workflows → 13B
- Larger than that → consider cloud
📦 Where this comes from
I’ve been collecting patterns from:
- community setups
- repeated VRAM constraints
- consistent performance ranges
👉 Dataset (still evolving):
https://github.com/airdropkalami/awesome-gpu-for-llm
🔗 More detailed breakdowns
If you want deeper guides:
- https://bestgpuforllm.com/articles/how-much-vram-for-llm/
- https://bestgpuforllm.com/articles/best-gpu-for-ollama/
Final thought
Most people don’t need a bigger GPU.
They need:
- the right model size
- the right quantization
- and realistic expectations
If you’re running local LLMs, what GPU + model combo has worked best for you?
Top comments (0)