DEV Community

Cover image for 20 Years of GPUs in Numbers: How FLOPS and TDP Grew, and Who Led the NVIDIA vs AMD Duel (+ open dataset of 13,500 GPUs)
Max Vyaznikov
Max Vyaznikov

Posted on

20 Years of GPUs in Numbers: How FLOPS and TDP Grew, and Who Led the NVIDIA vs AMD Duel (+ open dataset of 13,500 GPUs)

We run a GPU catalog and have built up a database of 13,566 GPUs — from the GeForce 256 (1999) to Blackwell and the MI355X (2025). At some point it got interesting to look not at "which card is faster," but at how the whole industry shifted: how much FLOPS grew, where TDP hit a wall, and who led the NVIDIA-vs-AMD race in different years.

Below is a breakdown from our own data. Two things I'll put on the table right away: the methodology (what I measured and how, where the data is noisy) and an open dataset at the end of the article — grab it and dig in with us 😊

TL;DR

  • Peak FP32 of the flagship grew ~400× in 19 years: 0.3 TFLOPS (GeForce 8800 GTX, 2006) → 126 TFLOPS (Blackwell, 2025). It's an almost perfectly straight line on a semi-log scale.
  • TDP crept up slowly (155 → 300 W over 2006–2020), then exploded in the datacenter: 700 W (H100), 1000 W (MI325X / B200), 1400 W (MI355X, 2025).
  • Yet performance per watt grew ~100× — they "draw more," but "do far more per watt." The main driver is the process node (90 nm → 3 nm) plus architecture.
  • The NVIDIA/AMD duel by peak FP32 moved in waves: AMD led in the early 2010s (GCN era) and again in 2023–24 (Instinct MI300/MI325), NVIDIA in 2016–2020 (the AI pivot) and in 2025 (Blackwell). But "raw FP32" is a misleading metric — more on that below.

Methodology

  • What these TFLOPS are and why they're "theoretical." Every FP32 number in this article is the theoretical peak that vendors compute with the formula:
FP32 TFLOPS = (shader ALUs / CUDA cores) × boost clock (Hz) × 2 / 10^12
Enter fullscreen mode Exit fullscreen mode

The ×2 is because an FMA (fused multiply-add) does a multiply and an add in one cycle — two operations. This is a ceiling, not real-world throughput: in practice you reach noticeably less — typically 60–90% on well-optimized compute-bound kernels and a fraction of that on memory-bound ones — because memory bandwidth, SM occupancy, instruction mix, and the fact that boost clocks don't hold under sustained load and thermal limits all get in the way. Theory diverging from practice is normal. The theoretical peak is valuable for a different reason: it's computed by one formula across every card and generation, so it's a fair comparable yardstick for a historical look — that's what spec sheets list, and what we use. Real performance is measured with benchmarks (they're a separate table in the dataset).

  • The source is our specification database. "Flagship of the year" = the card with the maximum fp32_performance released that year, tracked separately for NVIDIA and AMD.
  • For the TDP/efficiency curves I excluded dual-GPU cards (GTX 295, HD 6990, R9 295X2, etc.) — otherwise TDP and FLOPS double up and break the trend.
  • Where the data is noisy: vendor is filled in for ~2,360 of 13,566 cards (the rest are mostly OEM partner-board variants). Medians use the labeled subset; flagship peaks are fully labeled. And FP16/tensor performance is not directly comparable between vendors — because of structured sparsity. Starting with Ampere (A100), NVIDIA quotes tensor FP16/BF16 in its spec sheets with sparsity already applied — that's 2× the dense value (the feature processes sparse matrices twice as fast). Our database stores exactly this "sparse" figure for such cards. AMD has no equivalent spec line — those are dense. So NVIDIA's raw FP16 column (A100+) has to be halved to compare fairly with AMD: A100 = 624 (sparse) → 312 dense, H100 = 1979 → ~990 dense. The "AI inflection" part below relies on these dense-normalized numbers.

1. FLOPS: an almost perfectly straight exponential

Peak FP32 of the single flagship by year (NVIDIA):

Year Flagship FP32, TFLOPS
2006 GeForce 8800 GTX 0.3
2010 GeForce GTX 580 1.6
2013 GeForce GTX 780 Ti 5.3
2016 Quadro P6000 12.6
2017 Tesla V100 15.7
2020 RTX A6000 38.7
2022 L40S 91.6
2025 RTX PRO 6000 Blackwell 126.0

≈400× in 19 years is a CAGR of about 37% per year. On a semi-log scale the line is almost straight: a classic exponential that has only recently started bending on the "desktop" segment and moved into the datacenter.

FP32 of NVIDIA and AMD flagships by year (log scale)

2. TDP: a quiet climb, then a datacenter explosion

Year Card TDP, W
2006 GeForce 8800 GTX 155
2010 GTX 580 244
2017 Tesla V100 250
2020 RTX A6000 300
2022 H100 SXM 700
2024 MI325X / B200 1000
2025 MI355X 1400

For a decade and a half the flagship TDP stayed in a 150–300 W band. The break comes after 2020, and it's entirely datacenter-driven: AI accelerators (SXM/OAM modules) shot up to 700–1400 W because they're cooled by liquid in a rack, not by a fan in a case. The desktop ceiling separately hit ~450–600 W (RTX 4090/5090).

There's a curious gap if you look at NVIDIA's consumer flagships separately: the GeForce flagship sat at exactly 250 W for seven years (2013–2019) — GTX 780 Ti, Titan X, 1080 Ti, 2080 Ti — and only broke that ceiling with the RTX 3090 (350 W, 2020), then 4090 (450 W) and 5090 (575 W). Datacenter accelerators, by contrast, went to 700–1400 W almost immediately. It looks like what capped gaming TDP wasn't the silicon so much as the market — cases, PSUs, and buyer habits; in a rack there are no such limits, and watts grew without looking back. (This is interpretation: the spec stores watts, not intentions — but a 250 W plateau across seven generations shows up clearly in the data.)

TDP of flagships — desktop vs datacenter modules

3. Performance per watt: this is where the progress is

If you only look at TDP, it feels like "everything's getting worse, cards guzzle power." But FP32 per watt tells the opposite story:

Year Flagship TFLOPS/W
2006 8800 GTX 0.002
2013 GTX 780 Ti 0.021
2016 Quadro P6000 0.051
2020 RTX A6000 0.129
2022 L40S 0.262
2025 RTX PRO 6000 Blackwell 0.21

~100× in efficiency. Peak "classic" efficiency lands in 2022 (Ada/L40S); the 2024–25 datacenter cards sometimes lose on TFLOPS/W because they deliberately trade efficiency for absolute compute density in the rack. The main drivers of efficiency gains are the process node (90 nm → 3 nm) and architectural improvements, not clocks.

TFLOPS per watt by year (NVIDIA and AMD)

4. The NVIDIA vs AMD duel

If you mark, year by year, whose single flagship had the higher FP32:

Period Leader Context
2007–2008 AMD FireStream 9170/9270
2010–2013 AMD GCN: HD 6970, HD 7970 GHz, R9 290X
2014 NVIDIA Titan Black (5.6) vs FirePro W9100 (5.2)
2015 AMD Fury X (8.6)
2016–2020 NVIDIA Pascal → Ampere, the AI pivot
2021 AMD Instinct MI250X (47.9)
2022 NVIDIA L40S / Hopper
2023–2024 AMD Instinct MI300A/MI325X (81.7)
2025 NVIDIA Blackwell (126)

The picture is wavy, and I included it mostly for the intrigue — to give AMD at least a fighting chance. Because on raw FP32, AMD took the lead regularly — in the GCN era and again on recent Instinct parts. But raw FP32 is exactly the deceptive metric for today's world. The AI era is won not on FP32, but on software and FP16/BF16/FP8. Here NVIDIA, with tensor cores (since V100, 2017) and the CUDA ecosystem, built a moat that the FP32 numbers alone don't reveal: V100 delivered ~125 TFLOPS tensor-FP16, A100 ~312, H100 ~990 (vendor public data). In other words, the "FP32 duel" is about the past — the GPU as a graphics accelerator; the real battle has moved to a plane FP32 doesn't measure.

NVIDIA vs AMD FP32 leadership timeline

So, here's one more chart — the FP16 duel, where NVIDIA is consistently ahead. And once you layer the AI software stack on top of that…

AI inflection — peak tensor FP16 (dense) vs FP32 by year (log scale)

5. What else the data shows

  • Process node: 90 nm (2006) → 28 nm (a 2012–2015 plateau, the "stuck node") → 16/12/7 → 3 nm (MI355X, 2025).
  • Flagship VRAM: 0.77 GB (8800 GTX) → 12–24 GB (mid-2010s) → 48 GB (A6000) → 192–288 GB (MI300/MI355X). Memory grows even faster than compute — because AI models are bottlenecked on it.
  • The "stuck" 28 nm: for four years (2012–2015) the industry sat on one node — and that's exactly when AMD held parity/leadership on FP32. As soon as the process-node sprint resumed and tensor cores appeared, the advantage swung to NVIDIA.

Open dataset — take it

We've published a cleaned dump of our GPU spec database for anyone who wants to dig in themselves:

📦 Download: gpuark.com/datasets — the files gpuark-gpu-specs.csv, gpuark-benchmarks.csv, gpuark-gpu-dataset.sqlite, or everything in a single gpuark-gpu-dataset.tar.gz archive.

  • 13,566 GPUs (fields: vendor, manufacturer, release date, architecture, process node, transistors, clocks, memory size and type, bus, FP16/FP32/FP64/BF16/TF32/INT8, TDP, NVLink, CUDA SM, and more) + 993 third-party benchmark results (join on gpu_id).
  • Formats: CSV (Excel/pandas) and SQLite (ready-made SQL) — two tables, gpu_specs and benchmarks.
  • License: CC BY 4.0 (attribution to gpuark.com).

If you'd rather explore interactively before downloading, the same data powers the GPU comparison tool on the site.

Takeaways

  1. FLOPS grew as an almost perfect exponential (~37%/yr) — but the "free" growth is over; from here we pay with TDP and a move into the rack.
  2. Real progress is measured not in watts and not in raw FP32, but in performance per watt (×100) — and that rides on the process node.
  3. AMD fought and led on the "raw" numbers more often than people think; but the AI era was defined by tensor + software, not FP32.

The data is open — if you find something in it we missed, let me know.

Top comments (0)