Max Vyaznikov

Posted on Mar 12

A Developer's Guide to Choosing a GPU for Machine Learning in 2025-2026

#ai

Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.

The Four Specs That Matter

1. VRAM (Most Important)

VRAM determines what models you can run. No amount of compute power helps if your model doesn't fit in memory.

VRAM	What Fits (Inference)	What Fits (Training)
8 GB	7B at Q4	7B QLoRA
12 GB	13B at Q4	7B QLoRA comfortably
16 GB	24B at Q4	13B QLoRA
24 GB	34B at Q5	13B full fine-tune, 34B QLoRA
48 GB	70B at Q4	34B full fine-tune
80 GB	70B at FP16	70B QLoRA

Rule of thumb: buy the most VRAM you can afford. You can't upgrade VRAM later.

2. Memory Bandwidth

For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the memory bandwidth spec.

GPU	Bandwidth	Llama 8B Q4 tok/s
RTX 4060	272 GB/s	~35
RTX 4070	504 GB/s	~60
RTX 3090	936 GB/s	~85
RTX 4090	1,008 GB/s	~105
A100 80GB	2,039 GB/s	~180
H100	3,350 GB/s	~300

Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.

3. Tensor Cores

Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for training.

Generation	CC	Supported Precisions
1st (Volta)	7.0	FP16
2nd (Turing)	7.5	FP16, INT8, INT4
3rd (Ampere)	8.x	FP16, BF16, TF32, INT8
4th (Ada)	8.9	FP16, BF16, TF32, FP8, INT8
5th (Blackwell)	10.0	All above + FP4

BF16 support (Ampere+) is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.

4. CUDA Compute Capability

CC determines what frameworks and features your GPU supports. As of 2026:

Minimum CC 5.0 for PyTorch/TensorFlow
CC 7.0+ for Tensor Cores
CC 8.0+ for Flash Attention, BF16
CC 8.9 for FP8

You can look up any GPU's compute capability at gpuark.com.

GPU Recommendations by Budget

Under $400: RTX 4060 Ti 16GB

16 GB VRAM — runs 24B models at Q4
CC 8.9 (Ada Lovelace) — all modern features
165W TDP — low power
Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)

$500-700: Used RTX 3090

24 GB VRAM — the sweet spot
CC 8.6 — BF16, Flash Attention, everything you need
936 GB/s bandwidth — fast LLM inference
350W TDP — needs a beefy PSU
Best value in ML GPUs right now

$1,500-1,800: RTX 4090

24 GB VRAM (same as 3090)
2× training throughput vs 3090
Better power efficiency
CC 8.9 — FP8 support

$3,000-5,000: Used A100 40GB/80GB

Professional GPU with ECC memory
80GB version fits 70B at FP16
2 TB/s bandwidth
NVLink support for multi-GPU
Best for research labs and startups

Common Mistakes

"More CUDA cores = better for ML"

Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.

"I need the latest generation"

The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.

"Gaming benchmarks predict ML performance"

Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.

"I'll just use the cloud"

Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.

Quick Decision Matrix

Priority	Best Choice	Why
Max VRAM per $	Used RTX 3090	24GB at ~$650
Training speed	RTX 4090	2× faster than 3090
Inference tok/s	RTX 3090 or 4090	Best bandwidth at consumer price
LLM 70B+	2× Used 3090	48GB for ~$1,300
Professional	A100 80GB	80GB, NVLink, ECC

Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!

DEV Community