DEV Community: Max Vyaznikov

20 Years of GPUs in Numbers: How FLOPS and TDP Grew, and Who Led the NVIDIA vs AMD Duel (+ open dataset of 13,500 GPUs)

Max Vyaznikov — Tue, 26 May 2026 01:11:13 +0000

We run a GPU catalog and have built up a database of 13,566 GPUs — from the GeForce 256 (1999) to Blackwell and the MI355X (2025). At some point it got interesting to look not at "which card is faster," but at how the whole industry shifted: how much FLOPS grew, where TDP hit a wall, and who led the NVIDIA-vs-AMD race in different years.

Below is a breakdown from our own data. Two things I'll put on the table right away: the methodology (what I measured and how, where the data is noisy) and an open dataset at the end of the article — grab it and dig in with us 😊

TL;DR

Peak FP32 of the flagship grew ~400× in 19 years: 0.3 TFLOPS (GeForce 8800 GTX, 2006) → 126 TFLOPS (Blackwell, 2025). It's an almost perfectly straight line on a semi-log scale.
TDP crept up slowly (155 → 300 W over 2006–2020), then exploded in the datacenter: 700 W (H100), 1000 W (MI325X / B200), 1400 W (MI355X, 2025).
Yet performance per watt grew ~100× — they "draw more," but "do far more per watt." The main driver is the process node (90 nm → 3 nm) plus architecture.
The NVIDIA/AMD duel by peak FP32 moved in waves: AMD led in the early 2010s (GCN era) and again in 2023–24 (Instinct MI300/MI325), NVIDIA in 2016–2020 (the AI pivot) and in 2025 (Blackwell). But "raw FP32" is a misleading metric — more on that below.

Methodology

What these TFLOPS are and why they're "theoretical." Every FP32 number in this article is the theoretical peak that vendors compute with the formula:

FP32 TFLOPS = (shader ALUs / CUDA cores) × boost clock (Hz) × 2 / 10^12

The ×2 is because an FMA (fused multiply-add) does a multiply and an add in one cycle — two operations. This is a ceiling, not real-world throughput: in practice you reach noticeably less — typically 60–90% on well-optimized compute-bound kernels and a fraction of that on memory-bound ones — because memory bandwidth, SM occupancy, instruction mix, and the fact that boost clocks don't hold under sustained load and thermal limits all get in the way. Theory diverging from practice is normal. The theoretical peak is valuable for a different reason: it's computed by one formula across every card and generation, so it's a fair comparable yardstick for a historical look — that's what spec sheets list, and what we use. Real performance is measured with benchmarks (they're a separate table in the dataset).

The source is our specification database. "Flagship of the year" = the card with the maximum fp32_performance released that year, tracked separately for NVIDIA and AMD.
For the TDP/efficiency curves I excluded dual-GPU cards (GTX 295, HD 6990, R9 295X2, etc.) — otherwise TDP and FLOPS double up and break the trend.
Where the data is noisy: vendor is filled in for ~2,360 of 13,566 cards (the rest are mostly OEM partner-board variants). Medians use the labeled subset; flagship peaks are fully labeled. And FP16/tensor performance is not directly comparable between vendors — because of structured sparsity. Starting with Ampere (A100), NVIDIA quotes tensor FP16/BF16 in its spec sheets with sparsity already applied — that's 2× the dense value (the feature processes sparse matrices twice as fast). Our database stores exactly this "sparse" figure for such cards. AMD has no equivalent spec line — those are dense. So NVIDIA's raw FP16 column (A100+) has to be halved to compare fairly with AMD: A100 = 624 (sparse) → 312 dense, H100 = 1979 → ~990 dense. The "AI inflection" part below relies on these dense-normalized numbers.

1. FLOPS: an almost perfectly straight exponential

Peak FP32 of the single flagship by year (NVIDIA):

Year	Flagship	FP32, TFLOPS
2006	GeForce 8800 GTX	0.3
2010	GeForce GTX 580	1.6
2013	GeForce GTX 780 Ti	5.3
2016	Quadro P6000	12.6
2017	Tesla V100	15.7
2020	RTX A6000	38.7
2022	L40S	91.6
2025	RTX PRO 6000 Blackwell	126.0

≈400× in 19 years is a CAGR of about 37% per year. On a semi-log scale the line is almost straight: a classic exponential that has only recently started bending on the "desktop" segment and moved into the datacenter.

2. TDP: a quiet climb, then a datacenter explosion

Year	Card	TDP, W
2006	GeForce 8800 GTX	155
2010	GTX 580	244
2017	Tesla V100	250
2020	RTX A6000	300
2022	H100 SXM	700
2024	MI325X / B200	1000
2025	MI355X	1400

For a decade and a half the flagship TDP stayed in a 150–300 W band. The break comes after 2020, and it's entirely datacenter-driven: AI accelerators (SXM/OAM modules) shot up to 700–1400 W because they're cooled by liquid in a rack, not by a fan in a case. The desktop ceiling separately hit ~450–600 W (RTX 4090/5090).

There's a curious gap if you look at NVIDIA's consumer flagships separately: the GeForce flagship sat at exactly 250 W for seven years (2013–2019) — GTX 780 Ti, Titan X, 1080 Ti, 2080 Ti — and only broke that ceiling with the RTX 3090 (350 W, 2020), then 4090 (450 W) and 5090 (575 W). Datacenter accelerators, by contrast, went to 700–1400 W almost immediately. It looks like what capped gaming TDP wasn't the silicon so much as the market — cases, PSUs, and buyer habits; in a rack there are no such limits, and watts grew without looking back. (This is interpretation: the spec stores watts, not intentions — but a 250 W plateau across seven generations shows up clearly in the data.)

3. Performance per watt: this is where the progress is

If you only look at TDP, it feels like "everything's getting worse, cards guzzle power." But FP32 per watt tells the opposite story:

Year	Flagship	TFLOPS/W
2006	8800 GTX	0.002
2013	GTX 780 Ti	0.021
2016	Quadro P6000	0.051
2020	RTX A6000	0.129
2022	L40S	0.262
2025	RTX PRO 6000 Blackwell	0.21

~100× in efficiency. Peak "classic" efficiency lands in 2022 (Ada/L40S); the 2024–25 datacenter cards sometimes lose on TFLOPS/W because they deliberately trade efficiency for absolute compute density in the rack. The main drivers of efficiency gains are the process node (90 nm → 3 nm) and architectural improvements, not clocks.

4. The NVIDIA vs AMD duel

If you mark, year by year, whose single flagship had the higher FP32:

Period	Leader	Context
2007–2008	AMD	FireStream 9170/9270
2010–2013	AMD	GCN: HD 6970, HD 7970 GHz, R9 290X
2014	NVIDIA	Titan Black (5.6) vs FirePro W9100 (5.2)
2015	AMD	Fury X (8.6)
2016–2020	NVIDIA	Pascal → Ampere, the AI pivot
2021	AMD	Instinct MI250X (47.9)
2022	NVIDIA	L40S / Hopper
2023–2024	AMD	Instinct MI300A/MI325X (81.7)
2025	NVIDIA	Blackwell (126)

The picture is wavy, and I included it mostly for the intrigue — to give AMD at least a fighting chance. Because on raw FP32, AMD took the lead regularly — in the GCN era and again on recent Instinct parts. But raw FP32 is exactly the deceptive metric for today's world. The AI era is won not on FP32, but on software and FP16/BF16/FP8. Here NVIDIA, with tensor cores (since V100, 2017) and the CUDA ecosystem, built a moat that the FP32 numbers alone don't reveal: V100 delivered ~125 TFLOPS tensor-FP16, A100 ~312, H100 ~990 (vendor public data). In other words, the "FP32 duel" is about the past — the GPU as a graphics accelerator; the real battle has moved to a plane FP32 doesn't measure.

So, here's one more chart — the FP16 duel, where NVIDIA is consistently ahead. And once you layer the AI software stack on top of that…

5. What else the data shows

Process node: 90 nm (2006) → 28 nm (a 2012–2015 plateau, the "stuck node") → 16/12/7 → 3 nm (MI355X, 2025).
Flagship VRAM: 0.77 GB (8800 GTX) → 12–24 GB (mid-2010s) → 48 GB (A6000) → 192–288 GB (MI300/MI355X). Memory grows even faster than compute — because AI models are bottlenecked on it.
The "stuck" 28 nm: for four years (2012–2015) the industry sat on one node — and that's exactly when AMD held parity/leadership on FP32. As soon as the process-node sprint resumed and tensor cores appeared, the advantage swung to NVIDIA.

Open dataset — take it

We've published a cleaned dump of our GPU spec database for anyone who wants to dig in themselves:

📦 Download: gpuark.com/datasets — the files gpuark-gpu-specs.csv, gpuark-benchmarks.csv, gpuark-gpu-dataset.sqlite, or everything in a single gpuark-gpu-dataset.tar.gz archive.

13,566 GPUs (fields: vendor, manufacturer, release date, architecture, process node, transistors, clocks, memory size and type, bus, FP16/FP32/FP64/BF16/TF32/INT8, TDP, NVLink, CUDA SM, and more) + 993 third-party benchmark results (join on gpu_id).
Formats: CSV (Excel/pandas) and SQLite (ready-made SQL) — two tables, gpu_specs and benchmarks.
License: CC BY 4.0 (attribution to gpuark.com).

If you'd rather explore interactively before downloading, the same data powers the GPU comparison tool on the site.

Takeaways

FLOPS grew as an almost perfect exponential (~37%/yr) — but the "free" growth is over; from here we pay with TDP and a move into the rack.
Real progress is measured not in watts and not in raw FP32, but in performance per watt (×100) — and that rides on the process node.
AMD fought and led on the "raw" numbers more often than people think; but the AI era was defined by tensor + software, not FP32.

The data is open — if you find something in it we missed, let me know.

Running DeepSeek, Llama 3, and Qwen Locally: Complete GPU Requirements Guide

Max Vyaznikov — Thu, 12 Mar 2026 05:09:12 +0000

Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.

Quick Reference: VRAM Requirements

Model	FP16	Q8	Q4_K_M	Min GPU
Llama 3.1 8B	16 GB	8.5 GB	5 GB	RTX 3060 12GB
Llama 3.1 70B	140 GB	70 GB	40 GB	2× RTX 3090
Llama 3.1 405B	810 GB	405 GB	228 GB	8× A100 80GB
Qwen2.5 7B	14 GB	7.5 GB	4.5 GB	RTX 3060 8GB
Qwen2.5 14B	28 GB	14 GB	8.5 GB	RTX 4060 Ti 16GB
Qwen2.5 32B	64 GB	32 GB	18 GB	RTX 3090 24GB
Qwen2.5 72B	144 GB	72 GB	41 GB	2× RTX 3090
Mistral Small 24B	48 GB	24 GB	14 GB	RTX 4080 16GB
Mistral Large 123B	246 GB	123 GB	69 GB	4× RTX 3090
DeepSeek V3 671B	1,340 GB	670 GB	376 GB	5× A100 80GB
DeepSeek R1 671B	1,340 GB	670 GB	376 GB	5× A100 80GB
Phi-3.5 Mini 3.8B	7.6 GB	4 GB	2.5 GB	RTX 3060 8GB
Gemma 2 27B	54 GB	27 GB	16 GB	RTX 4080 16GB

For any model, you can calculate exact VRAM needs at the VRAM calculator on gpuark.com.

Model-by-Model Deep Dive

Llama 3.1 — The All-Rounder

Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Llama 3.1 8B (auto-downloads ~4.7GB)
ollama run llama3.1

# Or the 70B if you have the VRAM
ollama run llama3.1:70b

8B at Q4_K_M: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.

70B at Q4_K_M: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.

405B: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.

DeepSeek V3 / R1 — The MoE Giants

DeepSeek V3 (671B) uses Mixture of Experts — only ~37B parameters active per token, but all 671B must fit in memory. This means:

At Q4_K_M: ~376 GB VRAM minimum
Realistic minimum: 5× A100 80GB (400 GB total)
On consumer hardware: not feasible for the full model

But: DeepSeek R1 distilled versions exist:

DeepSeek-R1-7B: 4.5 GB at Q4 — runs on any modern GPU
DeepSeek-R1-14B: 8.5 GB at Q4 — RTX 4060 Ti
DeepSeek-R1-32B: 18 GB at Q4 — RTX 3090
DeepSeek-R1-70B: 40 GB at Q4 — 2× RTX 3090

The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.

Qwen2.5 — Best for Coding

Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:

# Qwen2.5-Coder-14B — best coding model for 16GB GPUs
ollama run qwen2.5-coder:14b

# Qwen2.5-32B — strong general model for 24GB GPUs
ollama run qwen2.5:32b

Qwen2.5-Coder-14B at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.

Mistral — Efficient and Fast

Mistral models are known for good quality-to-size ratio:

# Mistral Small 24B — best quality under 16GB
ollama run mistral-small

# Mistral Large 123B — needs serious hardware
ollama run mistral-large

Mistral Small 24B at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.

GPU Setup Recommendations

Beginner Setup (~$400)

GPU: RTX 4060 Ti 16GB
Models: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)
Software: Ollama + Open WebUI

Enthusiast Setup (~$700)

GPU: Used RTX 3090 24GB
Models: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model
Software: Ollama or ExLlamaV2 + TabbyAPI

Power User Setup (~$1,400)

GPUs: 2× Used RTX 3090 (48GB total)
Models: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B
Software: llama.cpp with --tensor-split 24,24

Prosumer Setup (~$2,000)

GPU: RTX 4090 + used RTX 3090
Models: Same as above, faster inference
Software: ExLlamaV2 with tensor parallelism

Performance Tips

1. Use the right quantization

Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.

2. Optimize KV cache

# llama.cpp: limit context to what you need
llama-server -m model.gguf -c 4096  # instead of default 8192+

Halving context length saves significant VRAM.

3. Flash Attention

Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).

4. CPU offloading for oversized models

# llama.cpp: offload only some layers to GPU
llama-server -m model.gguf -ngl 20  # 20 layers on GPU, rest on CPU

Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.

Conclusion

The local LLM ecosystem has matured enormously. For most developers:

Start with Ollama — zero-friction setup
Get at least 16GB VRAM — opens up 24B models
24GB (RTX 3090) is the sweet spot — runs everything up to 34B comfortably
Two GPUs if you need 70B+ — pipeline parallelism just works

The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.

What's your local LLM setup? Drop your GPU + favorite model in the comments!

A Developer's Guide to Choosing a GPU for Machine Learning in 2025-2026

Max Vyaznikov — Thu, 12 Mar 2026 05:04:11 +0000

Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.

The Four Specs That Matter

1. VRAM (Most Important)

VRAM determines what models you can run. No amount of compute power helps if your model doesn't fit in memory.

VRAM	What Fits (Inference)	What Fits (Training)
8 GB	7B at Q4	7B QLoRA
12 GB	13B at Q4	7B QLoRA comfortably
16 GB	24B at Q4	13B QLoRA
24 GB	34B at Q5	13B full fine-tune, 34B QLoRA
48 GB	70B at Q4	34B full fine-tune
80 GB	70B at FP16	70B QLoRA

Rule of thumb: buy the most VRAM you can afford. You can't upgrade VRAM later.

2. Memory Bandwidth

For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the memory bandwidth spec.

GPU	Bandwidth	Llama 8B Q4 tok/s
RTX 4060	272 GB/s	~35
RTX 4070	504 GB/s	~60
RTX 3090	936 GB/s	~85
RTX 4090	1,008 GB/s	~105
A100 80GB	2,039 GB/s	~180
H100	3,350 GB/s	~300

Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.

3. Tensor Cores

Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for training.

Generation	CC	Supported Precisions
1st (Volta)	7.0	FP16
2nd (Turing)	7.5	FP16, INT8, INT4
3rd (Ampere)	8.x	FP16, BF16, TF32, INT8
4th (Ada)	8.9	FP16, BF16, TF32, FP8, INT8
5th (Blackwell)	10.0	All above + FP4

BF16 support (Ampere+) is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.

4. CUDA Compute Capability

CC determines what frameworks and features your GPU supports. As of 2026:

Minimum CC 5.0 for PyTorch/TensorFlow
CC 7.0+ for Tensor Cores
CC 8.0+ for Flash Attention, BF16
CC 8.9 for FP8

You can look up any GPU's compute capability at gpuark.com.

GPU Recommendations by Budget

Under $400: RTX 4060 Ti 16GB

16 GB VRAM — runs 24B models at Q4
CC 8.9 (Ada Lovelace) — all modern features
165W TDP — low power
Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)

$500-700: Used RTX 3090

24 GB VRAM — the sweet spot
CC 8.6 — BF16, Flash Attention, everything you need
936 GB/s bandwidth — fast LLM inference
350W TDP — needs a beefy PSU
Best value in ML GPUs right now

$1,500-1,800: RTX 4090

24 GB VRAM (same as 3090)
2× training throughput vs 3090
Better power efficiency
CC 8.9 — FP8 support

$3,000-5,000: Used A100 40GB/80GB

Professional GPU with ECC memory
80GB version fits 70B at FP16
2 TB/s bandwidth
NVLink support for multi-GPU
Best for research labs and startups

Common Mistakes

"More CUDA cores = better for ML"

Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.

"I need the latest generation"

The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.

"Gaming benchmarks predict ML performance"

Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.

"I'll just use the cloud"

Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.

Quick Decision Matrix

Priority	Best Choice	Why
Max VRAM per $	Used RTX 3090	24GB at ~$650
Training speed	RTX 4090	2× faster than 3090
Inference tok/s	RTX 3090 or 4090	Best bandwidth at consumer price
LLM 70B+	2× Used 3090	48GB for ~$1,300
Professional	A100 80GB	80GB, NVLink, ECC

Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!

RTX 4090 vs RTX 3090 for AI/ML: Is the Upgrade Worth It?

Max Vyaznikov — Thu, 12 Mar 2026 05:03:04 +0000

The RTX 3090 and RTX 4090 are the two most popular consumer GPUs for AI/ML work. Both have 24GB VRAM, but the price gap is massive. Let's break down when each one makes sense.

Specs Comparison

Spec	RTX 3090	RTX 4090
Architecture	Ampere (CC 8.6)	Ada Lovelace (CC 8.9)
VRAM	24 GB GDDR6X	24 GB GDDR6X
Memory Bandwidth	936 GB/s	1,008 GB/s
CUDA Cores	10,496	16,384
Tensor Cores	328 (3rd gen)	512 (4th gen)
TDP	350W	450W
FP16 Tensor	142 TFLOPS	330 TFLOPS
New Price (2026)	Discontinued	~$1,800
Used Price (2026)	~$600-700	~$1,400-1,500

For a detailed side-by-side with all specifications, see the RTX 4090 vs RTX 3090 comparison page on gpuark.com.

Training Performance

The 4090 is roughly 1.7-2× faster for training due to:

56% more CUDA cores
4th gen Tensor Cores (better FP8, BF16 throughput)
Higher clock speeds
Better power efficiency

Real-world training benchmarks:

Task	RTX 3090	RTX 4090	Speedup
ResNet-50 (BS=64)	780 img/s	1,420 img/s	1.82×
BERT fine-tune (BS=32)	145 samples/s	268 samples/s	1.85×
Stable Diffusion training	2.1 it/s	3.8 it/s	1.81×
LLaMA 7B LoRA (r=16)	1.4 it/s	2.6 it/s	1.86×

Inference Performance (LLMs)

For LLM inference, the gap narrows because it's memory-bandwidth bound:

Task	RTX 3090	RTX 4090	Speedup
Llama 3.1 8B Q4 (tok/s)	85	105	1.24×
Llama 3.1 70B Q4 (tok/s)	doesn't fit	doesn't fit	—
Mistral 7B Q4 (prompt)	1,200 tok/s	1,800 tok/s	1.50×

Memory bandwidth difference is only 8% (936 vs 1,008 GB/s), so for pure token generation the 4090 advantage is modest.

The Real Decision

Buy a 4090 if:

Training throughput is your bottleneck (research, frequent fine-tuning)
You need FP8 features (CC 8.9 vs 8.6)
Power efficiency matters (performance per watt is much better)
You want one powerful card, not multi-GPU hassle

Buy a used 3090 (or two) if:

VRAM is your bottleneck (most LLM use cases)
Budget matters — two 3090s = 48GB for ~$1,300 vs one 4090 = 24GB for ~$1,500
You primarily do inference
You want to run 34B+ models

The multi-GPU argument

Two used 3090s give you 48GB total VRAM for less than one 4090:

Can run Llama 3.1 70B at Q4_K_M
Pipeline parallelism with llama.cpp works out of the box
Training with FSDP/DeepSpeed ZeRO-3 across both cards

The catch: inter-GPU communication over PCIe is slower than a single card's internal bandwidth. For training, expect ~1.5-1.7× scaling (not 2×). For inference with pipeline parallelism, the latency penalty is minimal.

Power Consumption

Often overlooked but significant:

Config	TDP	Annual electricity (24/7)
1× RTX 3090	350W	~$370/year
1× RTX 4090	450W	~$475/year
2× RTX 3090	700W	~$740/year

If running 24/7 as an inference server, the 4090's better perf/watt matters. For occasional use, it doesn't.

Bottom Line

The RTX 3090 at $600-700 used is the best value proposition in ML hardware right now. The 4090 is a better card in every metric except price-per-VRAM-GB, but the 3090 gives you 80% of the capability at 40% of the price.

If you're VRAM-limited (and you probably are if you're running LLMs), two 3090s beat one 4090 every time.

Running ML workloads on consumer GPUs? Share your setup in the comments!

CUDA Compute Capability: What It Is and Why It Matters for ML Engineers

Max Vyaznikov — Thu, 12 Mar 2026 03:45:21 +0000

If you've ever seen an error like "CUDA error: no kernel image is available for execution on the device" or "minimum required Cuda capability is 3.5" — you've run into Compute Capability issues. Here's everything you need to know.

What Is Compute Capability?

CUDA Compute Capability (CC) is a version number assigned to every NVIDIA GPU that identifies its architecture and supported feature set. It's NOT a performance score.

Format: Major.Minor

Major = GPU architecture generation
Minor = incremental improvements within that generation

GeForce GTX 1080  → CC 6.1 (Pascal)
GeForce RTX 3090  → CC 8.6 (Ampere)
GeForce RTX 4090  → CC 8.9 (Ada Lovelace)
H100              → CC 9.0 (Hopper)
RTX 5090          → CC 10.0 (Blackwell)

Why It Matters

1. Framework compatibility

Modern ML frameworks have minimum CC requirements:

Framework	Minimum CC	What's excluded
PyTorch 2.x	3.7	Kepler (K80), some Maxwell
TensorFlow 2.15+	5.0	All Maxwell, Kepler
JAX latest	5.2	Same as TF
Flash Attention 2	8.0	Everything before Ampere

If your GPU's CC is below the minimum, the framework will not use it — you'll silently fall back to CPU or get a hard error.

2. Feature availability

Each CC level unlocks hardware features:

CC	Architecture	Key ML Features
5.0-5.2	Maxwell	Basic CUDA, cuDNN
6.0-6.1	Pascal	FP16 compute, unified memory
7.0	Volta	Tensor Cores (1st gen), WMMA
7.5	Turing	INT8/INT4 Tensor Cores, mixed precision
8.0	Ampere	BF16, TF32, sparse Tensor Cores, 3rd gen
8.6	Ampere (consumer)	Same features, fewer SMs
8.9	Ada Lovelace	FP8, 4th gen Tensor Cores
9.0	Hopper	Transformer Engine, FP8 matmul, DPX
10.0	Blackwell	5th gen Tensor Cores, FP4

3. Compilation targets

When you compile CUDA code (or when PyTorch ships prebuilt binaries), it targets specific CC versions:

# Compile for multiple architectures
nvcc -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_86,code=sm_86 \
     -gencode arch=compute_89,code=sm_89 \
     my_kernel.cu

PyTorch wheels on PyPI typically include CC 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 8.9, 9.0. If your GPU isn't covered, you may need to build from source.

How to Check Your GPU's CC

nvidia-smi (easiest, no CUDA toolkit needed)

nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# Output: 8.6

Python (PyTorch)

import torch
major, minor = torch.cuda.get_device_capability()
print(f"Compute Capability: {major}.{minor}")

Python (TensorFlow)

import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    details = tf.config.experimental.get_device_details(gpu)
    print(details.get('compute_capability'))

C++ (CUDA Runtime)

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("CC: %d.%d\n", prop.major, prop.minor);

Lookup table

Don't have the GPU installed yet? The CUDA Compute Capability table on gpuark.com covers every NVIDIA GPU from Kepler to Blackwell.

Common CC-Related Errors and Fixes

"no kernel image is available for execution on the device"

Your PyTorch/TensorFlow binary wasn't compiled for your GPU's CC. Fix:

# Install PyTorch with the right CUDA version
pip install torch --index-url https://download.pytorch.org/whl/cu124

Or build from source with your CC:

TORCH_CUDA_ARCH_LIST="8.6" pip install torch --no-binary torch

"minimum required Cuda capability is X.X"

Your GPU is too old for the framework version. Options:

Use an older framework version
Upgrade your GPU
Use CPU mode: CUDA_VISIBLE_DEVICES="" python train.py

Flash Attention requires CC ≥ 8.0

Flash Attention 2 only works on Ampere (RTX 3000) and newer. For older GPUs:

# Use xformers instead (supports CC ≥ 6.0)
pip install xformers
# Or use PyTorch's built-in SDPA
from torch.nn.functional import scaled_dot_product_attention

Practical Advice for GPU Shopping

When buying a GPU for ML:

Minimum CC 7.5 (Turing) for mixed precision training — gives you Tensor Cores
CC 8.0+ (Ampere) strongly recommended — BF16, Flash Attention, much better ML performance
CC 8.9 (Ada) for bleeding-edge features like FP8 quantization-aware training
VRAM matters more than CC in most cases — a 3090 (CC 8.6, 24GB) beats a 4070 (CC 8.9, 12GB) for LLMs

CC tells you what features your GPU supports. VRAM tells you how big a model fits. Both matter, but for LLM inference, VRAM is usually the bottleneck.

What GPU are you running your ML workloads on? Have you hit CC compatibility issues? Let me know in the comments!

How Much VRAM Do You Actually Need to Run LLMs Locally?

Max Vyaznikov — Thu, 12 Mar 2026 03:44:13 +0000

Running large language models locally has become increasingly practical — but figuring out exactly how much VRAM you need can be confusing. Here's a concrete breakdown.

The Simple Formula

For inference (running a model, not training):

VRAM ≈ Parameters × Bytes per Weight + KV Cache + Overhead

Where bytes per weight depends on quantization:

Precision	Bytes/Param	Example: 7B model
FP32	4.0	28 GB
FP16/BF16	2.0	14 GB
INT8 (Q8)	1.0	7 GB
INT4 (Q4_K_M)	0.56	~4 GB
INT4 (Q4_0)	0.5	3.5 GB

Add 10-20% overhead for KV cache (more for longer contexts) and runtime buffers.

Practical VRAM Requirements by Model

Here's what you can actually run on common GPUs:

8 GB VRAM (RTX 4060, RTX 3070)

Llama 3.1 8B at Q4_K_M ✅
Qwen2.5 7B at Q4_K_M ✅
Mistral 7B at Q5_K_M ✅
Phi-3.5 Mini (3.8B) at Q8 ✅
13B models at Q4 ⚠️ (tight, short context only)

12 GB VRAM (RTX 4070, RTX 3060 12GB)

13B models at Q4_K_M ✅
Llama 3.1 8B at Q8 ✅
CodeQwen 14B at Q4_K_M ✅
20B models at Q4 ⚠️

16 GB VRAM (RTX 4080, RTX 5070 Ti)

Mistral Small 24B at Q4_K_M ✅
Qwen2.5-Coder 14B at Q6_K ✅
20B models at Q5-Q6 ✅
34B models at Q4 ⚠️

24 GB VRAM (RTX 3090, RTX 4090)

Llama 3.1 70B at Q4_K_M ⚠️ (with partial offload)
34B models at Q5-Q6 ✅
Qwen2.5 32B at Q5_K_M ✅
DeepSeek-Coder-V2-Lite 16B at FP16 ✅
Mistral Small 24B at Q8 ✅

48 GB VRAM (2× RTX 3090, A6000)

Llama 3.1 70B at Q4_K_M ✅
DeepSeek V3 670B — not enough, even at Q2
Mixtral 8x22B at Q4 ✅

The Quantization Sweet Spot

Q4_K_M is the most popular quantization for local inference and for good reason:

Quality: ~1-2% degradation vs FP16 on most benchmarks
Size: ~56% of the original INT8 size
Speed: Fastest on most consumer GPUs (memory-bandwidth bound)

Going lower (Q3, Q2) introduces noticeable quality degradation, especially on reasoning tasks. Going higher (Q6, Q8) gives marginal quality improvement but costs significantly more VRAM.

What About Training?

Training needs much more memory than inference:

Training VRAM ≈ Model weights + Gradients + Optimizer states + Activations

For full fine-tuning with Adam optimizer at FP32:

Weights: 4 bytes/param
Gradients: 4 bytes/param
Adam states: 8 bytes/param
Total: ~16 bytes/param (before activations)

A 7B model needs ~112 GB for full FP32 training. That's why techniques like LoRA (which only trains ~1-2% of parameters) and QLoRA (quantized base + LoRA) are so popular:

QLoRA fine-tuning of 7B: ~6-8 GB VRAM
QLoRA fine-tuning of 13B: ~10-12 GB VRAM
QLoRA fine-tuning of 70B: ~40-48 GB VRAM

KV Cache: The Hidden VRAM Consumer

When generating long texts, the KV cache grows with context length:

KV cache ≈ 2 × num_layers × hidden_dim × context_length × bytes_per_element

For Llama 3.1 8B at FP16 with 8K context: ~1 GB
For Llama 3.1 8B at FP16 with 128K context: ~16 GB

This is why you might load a model fine but run out of memory during long conversations.

Tools for Estimating

Rather than doing this math by hand every time, there's a VRAM calculator that estimates memory requirements — plug in the model size, quantization level, and context length to see if it fits your GPU.

Bottom Line

Budget	Best GPU	What You Can Run
~$300	RTX 4060 8GB	7-8B models at Q4
~$400	RTX 4060 Ti 16GB	Up to 24B at Q4
~$600	Used RTX 3090 24GB	Up to 34B at Q5, 70B at Q3
~$1800	RTX 4090 24GB	Same as 3090 but 2× faster
~$1200	2× Used RTX 3090	70B at Q4, most models comfortably

The most cost-effective option for serious local LLM use in 2025-2026 is still a used RTX 3090 — 24 GB of VRAM at a fraction of the 4090 price.

What's your local LLM setup? Drop a comment with your GPU and favorite model!