DEV Community

Cover image for I Built a GPU Dataset for LLM Inference — Here’s What I Learned
Thurmon Demich
Thurmon Demich

Posted on

I Built a GPU Dataset for LLM Inference — Here’s What I Learned

TL;DR: Most GPU advice for LLMs is either outdated or too generic. I started collecting real-world data (VRAM, model fit, tokens/sec), and the patterns are surprisingly consistent.


Why I built this

If you’ve tried running LLMs locally (Ollama, llama.cpp, vLLM), you’ve probably hit this problem:

  • “Can my GPU run this model?”
  • “Why does 13B barely fit but runs so slow?”
  • “Do I really need 24GB VRAM?”

The answers online are all over the place.

So I started putting together a small dataset based on:

  • community benchmarks
  • consistent VRAM constraints
  • repeated patterns across setups

👉 Repo: https://github.com/airdropkalami/awesome-gpu-for-llm


The pattern is simpler than people think

After aggregating the data, a few rules kept showing up.

🧠 Practical rules (that actually work)

  • 7B models → run on most GPUs
  • 13B models → need ~16GB for comfortable use
  • 34B models → require 24GB-class GPUs
  • 70B models → usually better on cloud

These aren’t theoretical — they show up consistently across different frameworks.


VRAM matters more than compute

One thing became obvious quickly:

If it doesn’t fit in VRAM, it doesn’t run.

You can optimize speed, but you can’t “optimize around” missing VRAM.


Real-world dataset (simplified)

Here’s a small snapshot from what I collected:

GPU VRAM Model Quant tok/s Fit
RTX 4060 8GB 7B Q4 ~35
RTX 4060 8GB 13B Q4 ~18 ⚠️
RTX 4090 24GB 13B Q4 ~45
RTX 4090 24GB 34B Q4 ~25 ⚠️
RTX 3090 24GB 34B Q4 ~22 ⚠️

👉 Full dataset (updated):
https://github.com/airdropkalami/awesome-gpu-for-llm/blob/main/benchmark/dataset.md


What surprised me

1. 13B is the real sweet spot

Not 7B. Not 70B.

13B gives the best balance between:

  • quality
  • speed
  • hardware cost

2. 24GB is a hard ceiling for most users

Once you go beyond that:

  • cost explodes
  • scaling becomes inefficient
  • cloud often makes more sense

3. Benchmarks don’t reflect real usage

A lot of GPU comparisons focus on:

  • FLOPS
  • synthetic benchmarks

But for LLMs:

VRAM > everything else
Enter fullscreen mode Exit fullscreen mode

When NOT to buy a GPU

This is where most people overspend.

If you:

  • want to run 70B models
  • only experiment occasionally
  • don’t need local inference

👉 you’re better off using cloud GPUs.


If you want a deeper breakdown

I wrote more detailed guides here:


Why I’m sharing this

I’m still expanding the dataset, and I’m trying to keep it:

  • practical (not theoretical)
  • consistent across setups
  • easy to use for decision making

If you have real benchmark data or setups, feel free to contribute.


Final thought

The “best GPU” isn’t the fastest.

It’s the one that:

  • fits your model
  • matches your budget
  • and actually works in your setup

If you’re building with local LLMs, I’d love to know what GPU + model combo you’re running.

Top comments (0)