TL;DR: Most GPU advice for LLMs is either outdated or too generic. I started collecting real-world data (VRAM, model fit, tokens/sec), and the patterns are surprisingly consistent.
Why I built this
If you’ve tried running LLMs locally (Ollama, llama.cpp, vLLM), you’ve probably hit this problem:
- “Can my GPU run this model?”
- “Why does 13B barely fit but runs so slow?”
- “Do I really need 24GB VRAM?”
The answers online are all over the place.
So I started putting together a small dataset based on:
- community benchmarks
- consistent VRAM constraints
- repeated patterns across setups
👉 Repo: https://github.com/airdropkalami/awesome-gpu-for-llm
The pattern is simpler than people think
After aggregating the data, a few rules kept showing up.
🧠 Practical rules (that actually work)
- 7B models → run on most GPUs
- 13B models → need ~16GB for comfortable use
- 34B models → require 24GB-class GPUs
- 70B models → usually better on cloud
These aren’t theoretical — they show up consistently across different frameworks.
VRAM matters more than compute
One thing became obvious quickly:
If it doesn’t fit in VRAM, it doesn’t run.
You can optimize speed, but you can’t “optimize around” missing VRAM.
Real-world dataset (simplified)
Here’s a small snapshot from what I collected:
| GPU | VRAM | Model | Quant | tok/s | Fit |
|---|---|---|---|---|---|
| RTX 4060 | 8GB | 7B | Q4 | ~35 | ✅ |
| RTX 4060 | 8GB | 13B | Q4 | ~18 | ⚠️ |
| RTX 4090 | 24GB | 13B | Q4 | ~45 | ✅ |
| RTX 4090 | 24GB | 34B | Q4 | ~25 | ⚠️ |
| RTX 3090 | 24GB | 34B | Q4 | ~22 | ⚠️ |
👉 Full dataset (updated):
https://github.com/airdropkalami/awesome-gpu-for-llm/blob/main/benchmark/dataset.md
What surprised me
1. 13B is the real sweet spot
Not 7B. Not 70B.
13B gives the best balance between:
- quality
- speed
- hardware cost
2. 24GB is a hard ceiling for most users
Once you go beyond that:
- cost explodes
- scaling becomes inefficient
- cloud often makes more sense
3. Benchmarks don’t reflect real usage
A lot of GPU comparisons focus on:
- FLOPS
- synthetic benchmarks
But for LLMs:
VRAM > everything else
When NOT to buy a GPU
This is where most people overspend.
If you:
- want to run 70B models
- only experiment occasionally
- don’t need local inference
👉 you’re better off using cloud GPUs.
If you want a deeper breakdown
I wrote more detailed guides here:
- https://bestgpuforllm.com/articles/best-gpu-for-ollama/
- https://bestgpuforllm.com/articles/how-much-vram-for-llm/
Why I’m sharing this
I’m still expanding the dataset, and I’m trying to keep it:
- practical (not theoretical)
- consistent across setups
- easy to use for decision making
If you have real benchmark data or setups, feel free to contribute.
Final thought
The “best GPU” isn’t the fastest.
It’s the one that:
- fits your model
- matches your budget
- and actually works in your setup
If you’re building with local LLMs, I’d love to know what GPU + model combo you’re running.
Top comments (0)