Thurmon Demich

Posted on Apr 28

I Built a GPU Dataset for LLM Inference — Here’s What I Learned

#ai #llm #performance #showdev

TL;DR: Most GPU advice for LLMs is either outdated or too generic. I started collecting real-world data (VRAM, model fit, tokens/sec), and the patterns are surprisingly consistent.

Why I built this

If you’ve tried running LLMs locally (Ollama, llama.cpp, vLLM), you’ve probably hit this problem:

“Can my GPU run this model?”
“Why does 13B barely fit but runs so slow?”
“Do I really need 24GB VRAM?”

The answers online are all over the place.

So I started putting together a small dataset based on:

community benchmarks
consistent VRAM constraints
repeated patterns across setups

👉 Repo: https://github.com/airdropkalami/awesome-gpu-for-llm

The pattern is simpler than people think

After aggregating the data, a few rules kept showing up.

🧠 Practical rules (that actually work)

7B models → run on most GPUs
13B models → need ~16GB for comfortable use
34B models → require 24GB-class GPUs
70B models → usually better on cloud

These aren’t theoretical — they show up consistently across different frameworks.

VRAM matters more than compute

One thing became obvious quickly:

If it doesn’t fit in VRAM, it doesn’t run.

You can optimize speed, but you can’t “optimize around” missing VRAM.

Real-world dataset (simplified)

Here’s a small snapshot from what I collected:

GPU	VRAM	Model	Quant	tok/s	Fit
RTX 4060	8GB	7B	Q4	~35	✅
RTX 4060	8GB	13B	Q4	~18	⚠️
RTX 4090	24GB	13B	Q4	~45	✅
RTX 4090	24GB	34B	Q4	~25	⚠️
RTX 3090	24GB	34B	Q4	~22	⚠️

👉 Full dataset (updated):
https://github.com/airdropkalami/awesome-gpu-for-llm/blob/main/benchmark/dataset.md

What surprised me

1. 13B is the real sweet spot

Not 7B. Not 70B.

13B gives the best balance between:

quality
speed
hardware cost

2. 24GB is a hard ceiling for most users

Once you go beyond that:

cost explodes
scaling becomes inefficient
cloud often makes more sense

3. Benchmarks don’t reflect real usage

A lot of GPU comparisons focus on:

FLOPS
synthetic benchmarks

But for LLMs:

VRAM > everything else

When NOT to buy a GPU

This is where most people overspend.

If you:

want to run 70B models
only experiment occasionally
don’t need local inference

👉 you’re better off using cloud GPUs.

If you want a deeper breakdown

I wrote more detailed guides here:

Why I’m sharing this

I’m still expanding the dataset, and I’m trying to keep it:

practical (not theoretical)
consistent across setups
easy to use for decision making

If you have real benchmark data or setups, feel free to contribute.

Final thought

The “best GPU” isn’t the fastest.

It’s the one that:

fits your model
matches your budget
and actually works in your setup

If you’re building with local LLMs, I’d love to know what GPU + model combo you’re running.

DEV Community