DEV Community

Cover image for How Much VRAM Do You *Actually* Need for Local LLMs?
Thurmon Demich
Thurmon Demich

Posted on

How Much VRAM Do You *Actually* Need for Local LLMs?

TL;DR: VRAM matters more than GPU power. Most people overestimate what they need—and underestimate what actually runs well.


The confusing part about local LLMs

If you’ve tried running models locally (Ollama, llama.cpp, LM Studio, etc.), you’ve probably asked:

  • “Can my GPU run this model?”
  • “Why does it technically load but run painfully slow?”
  • “Do I need 24GB VRAM for everything?”

The answers online are inconsistent.

So instead of relying on benchmarks, I started tracking what actually works in real setups.


🧠 The simple rule most people miss

If it doesn’t fit comfortably in VRAM, it doesn’t really “run”.

Yes, you can offload to CPU or swap memory—but the experience quickly degrades.


📊 Practical VRAM breakdown

Here’s a simplified version of what consistently works:

🟢 Under 8GB (S-tier)

  • 7B models (quantized)
  • Good for:

    • basic chat
    • light coding help
  • Limitations:

    • struggles with longer context
    • slower responses

🟡 8–16GB (M-tier)

  • 7B → very smooth
  • 13B → usable but sometimes tight

👉 This is where most consumer GPUs sit.


🟠 16–24GB (L-tier)

  • 13B → comfortable
  • 34B → possible with quantization

👉 This is the sweet spot for serious local use


🔴 24GB+ (XL-tier)

  • 34B → usable
  • 70B → technically possible, but often inefficient

👉 At this level, cloud often makes more sense unless you specifically need local.


⚙️ What actually matters (more than people think)

1. VRAM > raw GPU performance

A faster GPU doesn’t help if the model barely fits.


2. Quantization changes everything

Q4 vs Q5 can be the difference between:

  • “runs fine”
  • “completely unusable”

3. Model size ≠ better experience

In many real-world setups:

  • 13B models feel better than 70B
  • simply because they’re faster and more responsive

💡 What you should actually choose

If you’re deciding today:

  • Casual use → 7B
  • Daily use / coding / workflows → 13B
  • Larger than that → consider cloud

📦 Where this comes from

I’ve been collecting patterns from:

  • community setups
  • repeated VRAM constraints
  • consistent performance ranges

👉 Dataset (still evolving):
https://github.com/airdropkalami/awesome-gpu-for-llm


🔗 More detailed breakdowns

If you want deeper guides:


Final thought

Most people don’t need a bigger GPU.

They need:

  • the right model size
  • the right quantization
  • and realistic expectations

If you’re running local LLMs, what GPU + model combo has worked best for you?

Top comments (0)