We run a GPU catalog and have built up a database of 13,566 GPUs — from the GeForce 256 (1999) to Blackwell and the MI355X (2025). At some point it got interesting to look not at "which card is faster," but at how the whole industry shifted: how much FLOPS grew, where TDP hit a wall, and who led the NVIDIA-vs-AMD race in different years.
Below is a breakdown from our own data. Two things I'll put on the table right away: the methodology (what I measured and how, where the data is noisy) and an open dataset at the end of the article — grab it and dig in with us 😊
TL;DR
- Peak FP32 of the flagship grew ~400× in 19 years: 0.3 TFLOPS (GeForce 8800 GTX, 2006) → 126 TFLOPS (Blackwell, 2025). It's an almost perfectly straight line on a semi-log scale.
- TDP crept up slowly (155 → 300 W over 2006–2020), then exploded in the datacenter: 700 W (H100), 1000 W (MI325X / B200), 1400 W (MI355X, 2025).
- Yet performance per watt grew ~100× — they "draw more," but "do far more per watt." The main driver is the process node (90 nm → 3 nm) plus architecture.
- The NVIDIA/AMD duel by peak FP32 moved in waves: AMD led in the early 2010s (GCN era) and again in 2023–24 (Instinct MI300/MI325), NVIDIA in 2016–2020 (the AI pivot) and in 2025 (Blackwell). But "raw FP32" is a misleading metric — more on that below.
Methodology
- What these TFLOPS are and why they're "theoretical." Every FP32 number in this article is the theoretical peak that vendors compute with the formula:
FP32 TFLOPS = (shader ALUs / CUDA cores) × boost clock (Hz) × 2 / 10^12
The ×2 is because an FMA (fused multiply-add) does a multiply and an add in one cycle — two operations. This is a ceiling, not real-world throughput: in practice you reach noticeably less — typically 60–90% on well-optimized compute-bound kernels and a fraction of that on memory-bound ones — because memory bandwidth, SM occupancy, instruction mix, and the fact that boost clocks don't hold under sustained load and thermal limits all get in the way. Theory diverging from practice is normal. The theoretical peak is valuable for a different reason: it's computed by one formula across every card and generation, so it's a fair comparable yardstick for a historical look — that's what spec sheets list, and what we use. Real performance is measured with benchmarks (they're a separate table in the dataset).
- The source is our specification database. "Flagship of the year" = the card with the maximum
fp32_performancereleased that year, tracked separately for NVIDIA and AMD. - For the TDP/efficiency curves I excluded dual-GPU cards (GTX 295, HD 6990, R9 295X2, etc.) — otherwise TDP and FLOPS double up and break the trend.
- Where the data is noisy:
vendoris filled in for ~2,360 of 13,566 cards (the rest are mostly OEM partner-board variants). Medians use the labeled subset; flagship peaks are fully labeled. And FP16/tensor performance is not directly comparable between vendors — because of structured sparsity. Starting with Ampere (A100), NVIDIA quotes tensor FP16/BF16 in its spec sheets with sparsity already applied — that's 2× the dense value (the feature processes sparse matrices twice as fast). Our database stores exactly this "sparse" figure for such cards. AMD has no equivalent spec line — those are dense. So NVIDIA's raw FP16 column (A100+) has to be halved to compare fairly with AMD: A100 = 624 (sparse) → 312 dense, H100 = 1979 → ~990 dense. The "AI inflection" part below relies on these dense-normalized numbers.
1. FLOPS: an almost perfectly straight exponential
Peak FP32 of the single flagship by year (NVIDIA):
| Year | Flagship | FP32, TFLOPS |
|---|---|---|
| 2006 | GeForce 8800 GTX | 0.3 |
| 2010 | GeForce GTX 580 | 1.6 |
| 2013 | GeForce GTX 780 Ti | 5.3 |
| 2016 | Quadro P6000 | 12.6 |
| 2017 | Tesla V100 | 15.7 |
| 2020 | RTX A6000 | 38.7 |
| 2022 | L40S | 91.6 |
| 2025 | RTX PRO 6000 Blackwell | 126.0 |
≈400× in 19 years is a CAGR of about 37% per year. On a semi-log scale the line is almost straight: a classic exponential that has only recently started bending on the "desktop" segment and moved into the datacenter.
2. TDP: a quiet climb, then a datacenter explosion
| Year | Card | TDP, W |
|---|---|---|
| 2006 | GeForce 8800 GTX | 155 |
| 2010 | GTX 580 | 244 |
| 2017 | Tesla V100 | 250 |
| 2020 | RTX A6000 | 300 |
| 2022 | H100 SXM | 700 |
| 2024 | MI325X / B200 | 1000 |
| 2025 | MI355X | 1400 |
For a decade and a half the flagship TDP stayed in a 150–300 W band. The break comes after 2020, and it's entirely datacenter-driven: AI accelerators (SXM/OAM modules) shot up to 700–1400 W because they're cooled by liquid in a rack, not by a fan in a case. The desktop ceiling separately hit ~450–600 W (RTX 4090/5090).
There's a curious gap if you look at NVIDIA's consumer flagships separately: the GeForce flagship sat at exactly 250 W for seven years (2013–2019) — GTX 780 Ti, Titan X, 1080 Ti, 2080 Ti — and only broke that ceiling with the RTX 3090 (350 W, 2020), then 4090 (450 W) and 5090 (575 W). Datacenter accelerators, by contrast, went to 700–1400 W almost immediately. It looks like what capped gaming TDP wasn't the silicon so much as the market — cases, PSUs, and buyer habits; in a rack there are no such limits, and watts grew without looking back. (This is interpretation: the spec stores watts, not intentions — but a 250 W plateau across seven generations shows up clearly in the data.)
3. Performance per watt: this is where the progress is
If you only look at TDP, it feels like "everything's getting worse, cards guzzle power." But FP32 per watt tells the opposite story:
| Year | Flagship | TFLOPS/W |
|---|---|---|
| 2006 | 8800 GTX | 0.002 |
| 2013 | GTX 780 Ti | 0.021 |
| 2016 | Quadro P6000 | 0.051 |
| 2020 | RTX A6000 | 0.129 |
| 2022 | L40S | 0.262 |
| 2025 | RTX PRO 6000 Blackwell | 0.21 |
~100× in efficiency. Peak "classic" efficiency lands in 2022 (Ada/L40S); the 2024–25 datacenter cards sometimes lose on TFLOPS/W because they deliberately trade efficiency for absolute compute density in the rack. The main drivers of efficiency gains are the process node (90 nm → 3 nm) and architectural improvements, not clocks.
4. The NVIDIA vs AMD duel
If you mark, year by year, whose single flagship had the higher FP32:
| Period | Leader | Context |
|---|---|---|
| 2007–2008 | AMD | FireStream 9170/9270 |
| 2010–2013 | AMD | GCN: HD 6970, HD 7970 GHz, R9 290X |
| 2014 | NVIDIA | Titan Black (5.6) vs FirePro W9100 (5.2) |
| 2015 | AMD | Fury X (8.6) |
| 2016–2020 | NVIDIA | Pascal → Ampere, the AI pivot |
| 2021 | AMD | Instinct MI250X (47.9) |
| 2022 | NVIDIA | L40S / Hopper |
| 2023–2024 | AMD | Instinct MI300A/MI325X (81.7) |
| 2025 | NVIDIA | Blackwell (126) |
The picture is wavy, and I included it mostly for the intrigue — to give AMD at least a fighting chance. Because on raw FP32, AMD took the lead regularly — in the GCN era and again on recent Instinct parts. But raw FP32 is exactly the deceptive metric for today's world. The AI era is won not on FP32, but on software and FP16/BF16/FP8. Here NVIDIA, with tensor cores (since V100, 2017) and the CUDA ecosystem, built a moat that the FP32 numbers alone don't reveal: V100 delivered ~125 TFLOPS tensor-FP16, A100 ~312, H100 ~990 (vendor public data). In other words, the "FP32 duel" is about the past — the GPU as a graphics accelerator; the real battle has moved to a plane FP32 doesn't measure.
So, here's one more chart — the FP16 duel, where NVIDIA is consistently ahead. And once you layer the AI software stack on top of that…
5. What else the data shows
- Process node: 90 nm (2006) → 28 nm (a 2012–2015 plateau, the "stuck node") → 16/12/7 → 3 nm (MI355X, 2025).
- Flagship VRAM: 0.77 GB (8800 GTX) → 12–24 GB (mid-2010s) → 48 GB (A6000) → 192–288 GB (MI300/MI355X). Memory grows even faster than compute — because AI models are bottlenecked on it.
- The "stuck" 28 nm: for four years (2012–2015) the industry sat on one node — and that's exactly when AMD held parity/leadership on FP32. As soon as the process-node sprint resumed and tensor cores appeared, the advantage swung to NVIDIA.
Open dataset — take it
We've published a cleaned dump of our GPU spec database for anyone who wants to dig in themselves:
📦 Download: gpuark.com/datasets — the files gpuark-gpu-specs.csv, gpuark-benchmarks.csv, gpuark-gpu-dataset.sqlite, or everything in a single gpuark-gpu-dataset.tar.gz archive.
-
13,566 GPUs (fields: vendor, manufacturer, release date, architecture, process node, transistors, clocks, memory size and type, bus, FP16/FP32/FP64/BF16/TF32/INT8, TDP, NVLink, CUDA SM, and more) + 993 third-party benchmark results (join on
gpu_id). - Formats: CSV (Excel/pandas) and SQLite (ready-made SQL) — two tables,
gpu_specsandbenchmarks. - License: CC BY 4.0 (attribution to gpuark.com).
If you'd rather explore interactively before downloading, the same data powers the GPU comparison tool on the site.
Takeaways
- FLOPS grew as an almost perfect exponential (~37%/yr) — but the "free" growth is over; from here we pay with TDP and a move into the rack.
- Real progress is measured not in watts and not in raw FP32, but in performance per watt (×100) — and that rides on the process node.
- AMD fought and led on the "raw" numbers more often than people think; but the AI era was defined by tensor + software, not FP32.
The data is open — if you find something in it we missed, let me know.





Top comments (0)