Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.
The Four Specs That Matter
1. VRAM (Most Important)
VRAM determines what models you can run. No amount of compute power helps if your model doesn't fit in memory.
| VRAM | What Fits (Inference) | What Fits (Training) |
|---|---|---|
| 8 GB | 7B at Q4 | 7B QLoRA |
| 12 GB | 13B at Q4 | 7B QLoRA comfortably |
| 16 GB | 24B at Q4 | 13B QLoRA |
| 24 GB | 34B at Q5 | 13B full fine-tune, 34B QLoRA |
| 48 GB | 70B at Q4 | 34B full fine-tune |
| 80 GB | 70B at FP16 | 70B QLoRA |
Rule of thumb: buy the most VRAM you can afford. You can't upgrade VRAM later.
2. Memory Bandwidth
For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the memory bandwidth spec.
| GPU | Bandwidth | Llama 8B Q4 tok/s |
|---|---|---|
| RTX 4060 | 272 GB/s | ~35 |
| RTX 4070 | 504 GB/s | ~60 |
| RTX 3090 | 936 GB/s | ~85 |
| RTX 4090 | 1,008 GB/s | ~105 |
| A100 80GB | 2,039 GB/s | ~180 |
| H100 | 3,350 GB/s | ~300 |
Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.
3. Tensor Cores
Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for training.
| Generation | CC | Supported Precisions |
|---|---|---|
| 1st (Volta) | 7.0 | FP16 |
| 2nd (Turing) | 7.5 | FP16, INT8, INT4 |
| 3rd (Ampere) | 8.x | FP16, BF16, TF32, INT8 |
| 4th (Ada) | 8.9 | FP16, BF16, TF32, FP8, INT8 |
| 5th (Blackwell) | 10.0 | All above + FP4 |
BF16 support (Ampere+) is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.
4. CUDA Compute Capability
CC determines what frameworks and features your GPU supports. As of 2026:
- Minimum CC 5.0 for PyTorch/TensorFlow
- CC 7.0+ for Tensor Cores
- CC 8.0+ for Flash Attention, BF16
- CC 8.9 for FP8
You can look up any GPU's compute capability at gpuark.com.
GPU Recommendations by Budget
Under $400: RTX 4060 Ti 16GB
- 16 GB VRAM — runs 24B models at Q4
- CC 8.9 (Ada Lovelace) — all modern features
- 165W TDP — low power
- Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)
$500-700: Used RTX 3090
- 24 GB VRAM — the sweet spot
- CC 8.6 — BF16, Flash Attention, everything you need
- 936 GB/s bandwidth — fast LLM inference
- 350W TDP — needs a beefy PSU
- Best value in ML GPUs right now
$1,500-1,800: RTX 4090
- 24 GB VRAM (same as 3090)
- 2× training throughput vs 3090
- Better power efficiency
- CC 8.9 — FP8 support
$3,000-5,000: Used A100 40GB/80GB
- Professional GPU with ECC memory
- 80GB version fits 70B at FP16
- 2 TB/s bandwidth
- NVLink support for multi-GPU
- Best for research labs and startups
Common Mistakes
"More CUDA cores = better for ML"
Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.
"I need the latest generation"
The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.
"Gaming benchmarks predict ML performance"
Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.
"I'll just use the cloud"
Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.
Quick Decision Matrix
| Priority | Best Choice | Why |
|---|---|---|
| Max VRAM per $ | Used RTX 3090 | 24GB at ~$650 |
| Training speed | RTX 4090 | 2× faster than 3090 |
| Inference tok/s | RTX 3090 or 4090 | Best bandwidth at consumer price |
| LLM 70B+ | 2× Used 3090 | 48GB for ~$1,300 |
| Professional | A100 80GB | 80GB, NVLink, ECC |
Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!
Top comments (0)