DEV Community

Max Vyaznikov
Max Vyaznikov

Posted on

A Developer's Guide to Choosing a GPU for Machine Learning in 2025-2026

#ai

Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.

The Four Specs That Matter

1. VRAM (Most Important)

VRAM determines what models you can run. No amount of compute power helps if your model doesn't fit in memory.

VRAM What Fits (Inference) What Fits (Training)
8 GB 7B at Q4 7B QLoRA
12 GB 13B at Q4 7B QLoRA comfortably
16 GB 24B at Q4 13B QLoRA
24 GB 34B at Q5 13B full fine-tune, 34B QLoRA
48 GB 70B at Q4 34B full fine-tune
80 GB 70B at FP16 70B QLoRA

Rule of thumb: buy the most VRAM you can afford. You can't upgrade VRAM later.

2. Memory Bandwidth

For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the memory bandwidth spec.

GPU Bandwidth Llama 8B Q4 tok/s
RTX 4060 272 GB/s ~35
RTX 4070 504 GB/s ~60
RTX 3090 936 GB/s ~85
RTX 4090 1,008 GB/s ~105
A100 80GB 2,039 GB/s ~180
H100 3,350 GB/s ~300

Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.

3. Tensor Cores

Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for training.

Generation CC Supported Precisions
1st (Volta) 7.0 FP16
2nd (Turing) 7.5 FP16, INT8, INT4
3rd (Ampere) 8.x FP16, BF16, TF32, INT8
4th (Ada) 8.9 FP16, BF16, TF32, FP8, INT8
5th (Blackwell) 10.0 All above + FP4

BF16 support (Ampere+) is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.

4. CUDA Compute Capability

CC determines what frameworks and features your GPU supports. As of 2026:

  • Minimum CC 5.0 for PyTorch/TensorFlow
  • CC 7.0+ for Tensor Cores
  • CC 8.0+ for Flash Attention, BF16
  • CC 8.9 for FP8

You can look up any GPU's compute capability at gpuark.com.

GPU Recommendations by Budget

Under $400: RTX 4060 Ti 16GB

  • 16 GB VRAM — runs 24B models at Q4
  • CC 8.9 (Ada Lovelace) — all modern features
  • 165W TDP — low power
  • Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)

$500-700: Used RTX 3090

  • 24 GB VRAM — the sweet spot
  • CC 8.6 — BF16, Flash Attention, everything you need
  • 936 GB/s bandwidth — fast LLM inference
  • 350W TDP — needs a beefy PSU
  • Best value in ML GPUs right now

$1,500-1,800: RTX 4090

  • 24 GB VRAM (same as 3090)
  • 2× training throughput vs 3090
  • Better power efficiency
  • CC 8.9 — FP8 support

$3,000-5,000: Used A100 40GB/80GB

  • Professional GPU with ECC memory
  • 80GB version fits 70B at FP16
  • 2 TB/s bandwidth
  • NVLink support for multi-GPU
  • Best for research labs and startups

Common Mistakes

"More CUDA cores = better for ML"

Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.

"I need the latest generation"

The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.

"Gaming benchmarks predict ML performance"

Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.

"I'll just use the cloud"

Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.

Quick Decision Matrix

Priority Best Choice Why
Max VRAM per $ Used RTX 3090 24GB at ~$650
Training speed RTX 4090 2× faster than 3090
Inference tok/s RTX 3090 or 4090 Best bandwidth at consumer price
LLM 70B+ 2× Used 3090 48GB for ~$1,300
Professional A100 80GB 80GB, NVLink, ECC

Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!

Top comments (0)