Max Vyaznikov

Posted on May 26

20 Years of GPUs in Numbers: How FLOPS and TDP Grew, and Who Led the NVIDIA vs AMD Duel (+ open dataset of 13,500 GPUs)

#gpu #machinelearning #hardware #datascience

We run a GPU catalog and have built up a database of 13,566 GPUs — from the GeForce 256 (1999) to Blackwell and the MI355X (2025). At some point it got interesting to look not at "which card is faster," but at how the whole industry shifted: how much FLOPS grew, where TDP hit a wall, and who led the NVIDIA-vs-AMD race in different years.

Below is a breakdown from our own data. Two things I'll put on the table right away: the methodology (what I measured and how, where the data is noisy) and an open dataset at the end of the article — grab it and dig in with us 😊

TL;DR

Peak FP32 of the flagship grew ~400× in 19 years: 0.3 TFLOPS (GeForce 8800 GTX, 2006) → 126 TFLOPS (Blackwell, 2025). It's an almost perfectly straight line on a semi-log scale.
TDP crept up slowly (155 → 300 W over 2006–2020), then exploded in the datacenter: 700 W (H100), 1000 W (MI325X / B200), 1400 W (MI355X, 2025).
Yet performance per watt grew ~100× — they "draw more," but "do far more per watt." The main driver is the process node (90 nm → 3 nm) plus architecture.
The NVIDIA/AMD duel by peak FP32 moved in waves: AMD led in the early 2010s (GCN era) and again in 2023–24 (Instinct MI300/MI325), NVIDIA in 2016–2020 (the AI pivot) and in 2025 (Blackwell). But "raw FP32" is a misleading metric — more on that below.

Methodology

What these TFLOPS are and why they're "theoretical." Every FP32 number in this article is the theoretical peak that vendors compute with the formula:

FP32 TFLOPS = (shader ALUs / CUDA cores) × boost clock (Hz) × 2 / 10^12

The ×2 is because an FMA (fused multiply-add) does a multiply and an add in one cycle — two operations. This is a ceiling, not real-world throughput: in practice you reach noticeably less — typically 60–90% on well-optimized compute-bound kernels and a fraction of that on memory-bound ones — because memory bandwidth, SM occupancy, instruction mix, and the fact that boost clocks don't hold under sustained load and thermal limits all get in the way. Theory diverging from practice is normal. The theoretical peak is valuable for a different reason: it's computed by one formula across every card and generation, so it's a fair comparable yardstick for a historical look — that's what spec sheets list, and what we use. Real performance is measured with benchmarks (they're a separate table in the dataset).

The source is our specification database. "Flagship of the year" = the card with the maximum fp32_performance released that year, tracked separately for NVIDIA and AMD.
For the TDP/efficiency curves I excluded dual-GPU cards (GTX 295, HD 6990, R9 295X2, etc.) — otherwise TDP and FLOPS double up and break the trend.
Where the data is noisy: vendor is filled in for ~2,360 of 13,566 cards (the rest are mostly OEM partner-board variants). Medians use the labeled subset; flagship peaks are fully labeled. And FP16/tensor performance is not directly comparable between vendors — because of structured sparsity. Starting with Ampere (A100), NVIDIA quotes tensor FP16/BF16 in its spec sheets with sparsity already applied — that's 2× the dense value (the feature processes sparse matrices twice as fast). Our database stores exactly this "sparse" figure for such cards. AMD has no equivalent spec line — those are dense. So NVIDIA's raw FP16 column (A100+) has to be halved to compare fairly with AMD: A100 = 624 (sparse) → 312 dense, H100 = 1979 → ~990 dense. The "AI inflection" part below relies on these dense-normalized numbers.

1. FLOPS: an almost perfectly straight exponential

Peak FP32 of the single flagship by year (NVIDIA):

Year	Flagship	FP32, TFLOPS
2006	GeForce 8800 GTX	0.3
2010	GeForce GTX 580	1.6
2013	GeForce GTX 780 Ti	5.3
2016	Quadro P6000	12.6
2017	Tesla V100	15.7
2020	RTX A6000	38.7
2022	L40S	91.6
2025	RTX PRO 6000 Blackwell	126.0

≈400× in 19 years is a CAGR of about 37% per year. On a semi-log scale the line is almost straight: a classic exponential that has only recently started bending on the "desktop" segment and moved into the datacenter.

2. TDP: a quiet climb, then a datacenter explosion

Year	Card	TDP, W
2006	GeForce 8800 GTX	155
2010	GTX 580	244
2017	Tesla V100	250
2020	RTX A6000	300
2022	H100 SXM	700
2024	MI325X / B200	1000
2025	MI355X	1400

For a decade and a half the flagship TDP stayed in a 150–300 W band. The break comes after 2020, and it's entirely datacenter-driven: AI accelerators (SXM/OAM modules) shot up to 700–1400 W because they're cooled by liquid in a rack, not by a fan in a case. The desktop ceiling separately hit ~450–600 W (RTX 4090/5090).

There's a curious gap if you look at NVIDIA's consumer flagships separately: the GeForce flagship sat at exactly 250 W for seven years (2013–2019) — GTX 780 Ti, Titan X, 1080 Ti, 2080 Ti — and only broke that ceiling with the RTX 3090 (350 W, 2020), then 4090 (450 W) and 5090 (575 W). Datacenter accelerators, by contrast, went to 700–1400 W almost immediately. It looks like what capped gaming TDP wasn't the silicon so much as the market — cases, PSUs, and buyer habits; in a rack there are no such limits, and watts grew without looking back. (This is interpretation: the spec stores watts, not intentions — but a 250 W plateau across seven generations shows up clearly in the data.)

3. Performance per watt: this is where the progress is

If you only look at TDP, it feels like "everything's getting worse, cards guzzle power." But FP32 per watt tells the opposite story:

Year	Flagship	TFLOPS/W
2006	8800 GTX	0.002
2013	GTX 780 Ti	0.021
2016	Quadro P6000	0.051
2020	RTX A6000	0.129
2022	L40S	0.262
2025	RTX PRO 6000 Blackwell	0.21

~100× in efficiency. Peak "classic" efficiency lands in 2022 (Ada/L40S); the 2024–25 datacenter cards sometimes lose on TFLOPS/W because they deliberately trade efficiency for absolute compute density in the rack. The main drivers of efficiency gains are the process node (90 nm → 3 nm) and architectural improvements, not clocks.

4. The NVIDIA vs AMD duel

If you mark, year by year, whose single flagship had the higher FP32:

Period	Leader	Context
2007–2008	AMD	FireStream 9170/9270
2010–2013	AMD	GCN: HD 6970, HD 7970 GHz, R9 290X
2014	NVIDIA	Titan Black (5.6) vs FirePro W9100 (5.2)
2015	AMD	Fury X (8.6)
2016–2020	NVIDIA	Pascal → Ampere, the AI pivot
2021	AMD	Instinct MI250X (47.9)
2022	NVIDIA	L40S / Hopper
2023–2024	AMD	Instinct MI300A/MI325X (81.7)
2025	NVIDIA	Blackwell (126)

The picture is wavy, and I included it mostly for the intrigue — to give AMD at least a fighting chance. Because on raw FP32, AMD took the lead regularly — in the GCN era and again on recent Instinct parts. But raw FP32 is exactly the deceptive metric for today's world. The AI era is won not on FP32, but on software and FP16/BF16/FP8. Here NVIDIA, with tensor cores (since V100, 2017) and the CUDA ecosystem, built a moat that the FP32 numbers alone don't reveal: V100 delivered ~125 TFLOPS tensor-FP16, A100 ~312, H100 ~990 (vendor public data). In other words, the "FP32 duel" is about the past — the GPU as a graphics accelerator; the real battle has moved to a plane FP32 doesn't measure.

So, here's one more chart — the FP16 duel, where NVIDIA is consistently ahead. And once you layer the AI software stack on top of that…

5. What else the data shows

Process node: 90 nm (2006) → 28 nm (a 2012–2015 plateau, the "stuck node") → 16/12/7 → 3 nm (MI355X, 2025).
Flagship VRAM: 0.77 GB (8800 GTX) → 12–24 GB (mid-2010s) → 48 GB (A6000) → 192–288 GB (MI300/MI355X). Memory grows even faster than compute — because AI models are bottlenecked on it.
The "stuck" 28 nm: for four years (2012–2015) the industry sat on one node — and that's exactly when AMD held parity/leadership on FP32. As soon as the process-node sprint resumed and tensor cores appeared, the advantage swung to NVIDIA.

Open dataset — take it

We've published a cleaned dump of our GPU spec database for anyone who wants to dig in themselves:

📦 Download: gpuark.com/datasets — the files gpuark-gpu-specs.csv, gpuark-benchmarks.csv, gpuark-gpu-dataset.sqlite, or everything in a single gpuark-gpu-dataset.tar.gz archive.

13,566 GPUs (fields: vendor, manufacturer, release date, architecture, process node, transistors, clocks, memory size and type, bus, FP16/FP32/FP64/BF16/TF32/INT8, TDP, NVLink, CUDA SM, and more) + 993 third-party benchmark results (join on gpu_id).
Formats: CSV (Excel/pandas) and SQLite (ready-made SQL) — two tables, gpu_specs and benchmarks.
License: CC BY 4.0 (attribution to gpuark.com).

If you'd rather explore interactively before downloading, the same data powers the GPU comparison tool on the site.

Takeaways

FLOPS grew as an almost perfect exponential (~37%/yr) — but the "free" growth is over; from here we pay with TDP and a move into the rack.
Real progress is measured not in watts and not in raw FP32, but in performance per watt (×100) — and that rides on the process node.
AMD fought and led on the "raw" numbers more often than people think; but the AI era was defined by tensor + software, not FP32.

The data is open — if you find something in it we missed, let me know.

DEV Community