Thurmon Demich

Posted on Apr 29

How Much VRAM Do You Actually Need for Local LLMs?

#ai #llm #machinelearning #performance

TL;DR: VRAM matters more than GPU power. Most people overestimate what they need—and underestimate what actually runs well.

The confusing part about local LLMs

If you’ve tried running models locally (Ollama, llama.cpp, LM Studio, etc.), you’ve probably asked:

“Can my GPU run this model?”
“Why does it technically load but run painfully slow?”
“Do I need 24GB VRAM for everything?”

The answers online are inconsistent.

So instead of relying on benchmarks, I started tracking what actually works in real setups.

🧠 The simple rule most people miss

If it doesn’t fit comfortably in VRAM, it doesn’t really “run”.

Yes, you can offload to CPU or swap memory—but the experience quickly degrades.

📊 Practical VRAM breakdown

Here’s a simplified version of what consistently works:

🟢 Under 8GB (S-tier)

7B models (quantized)
Good for:
- basic chat
- light coding help
Limitations:
- struggles with longer context
- slower responses

🟡 8–16GB (M-tier)

7B → very smooth
13B → usable but sometimes tight

👉 This is where most consumer GPUs sit.

🟠 16–24GB (L-tier)

13B → comfortable
34B → possible with quantization

👉 This is the sweet spot for serious local use

🔴 24GB+ (XL-tier)

34B → usable
70B → technically possible, but often inefficient

👉 At this level, cloud often makes more sense unless you specifically need local.

⚙️ What actually matters (more than people think)

1. VRAM > raw GPU performance

A faster GPU doesn’t help if the model barely fits.

2. Quantization changes everything

Q4 vs Q5 can be the difference between:

“runs fine”
“completely unusable”

3. Model size ≠ better experience

In many real-world setups:

13B models feel better than 70B
simply because they’re faster and more responsive

💡 What you should actually choose

If you’re deciding today:

Casual use → 7B
Daily use / coding / workflows → 13B
Larger than that → consider cloud

📦 Where this comes from

I’ve been collecting patterns from:

community setups
repeated VRAM constraints
consistent performance ranges

👉 Dataset (still evolving):
https://github.com/airdropkalami/awesome-gpu-for-llm

🔗 More detailed breakdowns

If you want deeper guides:

Final thought

Most people don’t need a bigger GPU.

They need:

the right model size
the right quantization
and realistic expectations

If you’re running local LLMs, what GPU + model combo has worked best for you?

DEV Community