<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Max Vyaznikov</title>
    <description>The latest articles on DEV Community by Max Vyaznikov (@maxvyaznikov).</description>
    <link>https://dev.to/maxvyaznikov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3819321%2F53229506-3d43-4511-a7f6-bb2f58e84931.png</url>
      <title>DEV Community: Max Vyaznikov</title>
      <link>https://dev.to/maxvyaznikov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maxvyaznikov"/>
    <language>en</language>
    <item>
      <title>20 Years of GPUs in Numbers: How FLOPS and TDP Grew, and Who Led the NVIDIA vs AMD Duel (+ open dataset of 13,500 GPUs)</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Tue, 26 May 2026 01:11:13 +0000</pubDate>
      <link>https://dev.to/maxvyaznikov/20-years-of-gpus-in-numbers-how-flops-tdp-grew-and-who-led-the-nvidia-vs-amd-race-open-370n</link>
      <guid>https://dev.to/maxvyaznikov/20-years-of-gpus-in-numbers-how-flops-tdp-grew-and-who-led-the-nvidia-vs-amd-race-open-370n</guid>
      <description>&lt;p&gt;We run a GPU catalog and have built up a database of &lt;strong&gt;13,566 GPUs&lt;/strong&gt; — from the GeForce 256 (1999) to Blackwell and the MI355X (2025). At some point it got interesting to look not at "which card is faster," but at how the whole industry shifted: how much FLOPS grew, where TDP hit a wall, and who led the NVIDIA-vs-AMD race in different years.&lt;/p&gt;

&lt;p&gt;Below is a breakdown from our own data. Two things I'll put on the table right away: the &lt;strong&gt;methodology&lt;/strong&gt; (what I measured and how, where the data is noisy) and an &lt;strong&gt;open dataset&lt;/strong&gt; at the end of the article — grab it and dig in with us 😊&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Peak &lt;strong&gt;FP32 of the flagship grew ~400×&lt;/strong&gt; in 19 years: 0.3 TFLOPS (GeForce 8800 GTX, 2006) → 126 TFLOPS (Blackwell, 2025). It's an almost perfectly straight line on a semi-log scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TDP&lt;/strong&gt; crept up slowly (155 → 300 W over 2006–2020), then &lt;strong&gt;exploded in the datacenter&lt;/strong&gt;: 700 W (H100), 1000 W (MI325X / B200), &lt;strong&gt;1400 W&lt;/strong&gt; (MI355X, 2025).&lt;/li&gt;
&lt;li&gt;Yet &lt;strong&gt;performance per watt grew ~100×&lt;/strong&gt; — they "draw more," but "do far more per watt." The main driver is the process node (90 nm → 3 nm) plus architecture.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;NVIDIA/AMD duel by peak FP32&lt;/strong&gt; moved in waves: AMD led in the early 2010s (GCN era) and again in 2023–24 (Instinct MI300/MI325), NVIDIA in 2016–2020 (the AI pivot) and in 2025 (Blackwell). But "raw FP32" is a misleading metric — more on that below.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What these TFLOPS are and why they're "theoretical."&lt;/strong&gt; Every FP32 number in this article is the &lt;em&gt;theoretical peak&lt;/em&gt; that vendors compute with the formula:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FP32 TFLOPS = (shader ALUs / CUDA cores) × boost clock (Hz) × 2 / 10^12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ×2 is because an FMA (fused multiply-add) does a multiply and an add in one cycle — two operations. This is a &lt;strong&gt;ceiling, not real-world throughput&lt;/strong&gt;: in practice you reach noticeably less — typically 60–90% on well-optimized compute-bound kernels and a fraction of that on memory-bound ones — because memory bandwidth, SM occupancy, instruction mix, and the fact that boost clocks don't hold under sustained load and thermal limits all get in the way. &lt;strong&gt;Theory diverging from practice is normal.&lt;/strong&gt; The theoretical peak is valuable for a different reason: it's computed by one formula across every card and generation, so it's a fair &lt;em&gt;comparable&lt;/em&gt; yardstick for a historical look — that's what spec sheets list, and what we use. Real performance is measured with benchmarks (they're a separate table in the dataset).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The source is our specification database. &lt;strong&gt;"Flagship of the year"&lt;/strong&gt; = the card with the maximum &lt;code&gt;fp32_performance&lt;/code&gt; released that year, tracked separately for NVIDIA and AMD.&lt;/li&gt;
&lt;li&gt;For the TDP/efficiency curves I &lt;strong&gt;excluded dual-GPU cards&lt;/strong&gt; (GTX 295, HD 6990, R9 295X2, etc.) — otherwise TDP and FLOPS double up and break the trend.&lt;/li&gt;
&lt;li&gt;Where the data is noisy: &lt;code&gt;vendor&lt;/code&gt; is filled in for ~2,360 of 13,566 cards (the rest are mostly OEM partner-board variants). Medians use the labeled subset; flagship peaks are fully labeled. And &lt;strong&gt;FP16/tensor performance is not directly comparable between vendors — because of structured sparsity.&lt;/strong&gt; Starting with Ampere (A100), NVIDIA quotes tensor FP16/BF16 in its spec sheets &lt;strong&gt;with sparsity already applied — that's 2× the dense value&lt;/strong&gt; (the feature processes sparse matrices twice as fast). Our database stores exactly this "sparse" figure for such cards. AMD has no equivalent spec line — those are dense. So NVIDIA's raw FP16 column (A100+) has to be halved to compare fairly with AMD: A100 = 624 (sparse) → &lt;strong&gt;312 dense&lt;/strong&gt;, H100 = 1979 → &lt;strong&gt;~990 dense&lt;/strong&gt;. The "AI inflection" part below relies on these dense-normalized numbers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. FLOPS: an almost perfectly straight exponential
&lt;/h2&gt;

&lt;p&gt;Peak FP32 of the single flagship by year (NVIDIA):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Flagship&lt;/th&gt;
&lt;th&gt;FP32, TFLOPS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2006&lt;/td&gt;
&lt;td&gt;GeForce 8800 GTX&lt;/td&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2010&lt;/td&gt;
&lt;td&gt;GeForce GTX 580&lt;/td&gt;
&lt;td&gt;1.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2013&lt;/td&gt;
&lt;td&gt;GeForce GTX 780 Ti&lt;/td&gt;
&lt;td&gt;5.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016&lt;/td&gt;
&lt;td&gt;Quadro P6000&lt;/td&gt;
&lt;td&gt;12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017&lt;/td&gt;
&lt;td&gt;Tesla V100&lt;/td&gt;
&lt;td&gt;15.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;RTX A6000&lt;/td&gt;
&lt;td&gt;38.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;L40S&lt;/td&gt;
&lt;td&gt;91.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025&lt;/td&gt;
&lt;td&gt;RTX PRO 6000 Blackwell&lt;/td&gt;
&lt;td&gt;126.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;≈400× in 19 years is a CAGR of about &lt;strong&gt;37% per year&lt;/strong&gt;. On a semi-log scale the line is almost straight: a classic exponential that has only recently started bending on the "desktop" segment and moved into the datacenter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuvh1uco7gr9dsvkdnso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuvh1uco7gr9dsvkdnso.png" alt="FP32 of NVIDIA and AMD flagships by year (log scale)" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. TDP: a quiet climb, then a datacenter explosion
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Card&lt;/th&gt;
&lt;th&gt;TDP, W&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2006&lt;/td&gt;
&lt;td&gt;GeForce 8800 GTX&lt;/td&gt;
&lt;td&gt;155&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2010&lt;/td&gt;
&lt;td&gt;GTX 580&lt;/td&gt;
&lt;td&gt;244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017&lt;/td&gt;
&lt;td&gt;Tesla V100&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;RTX A6000&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;H100 SXM&lt;/td&gt;
&lt;td&gt;700&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;MI325X / B200&lt;/td&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025&lt;/td&gt;
&lt;td&gt;MI355X&lt;/td&gt;
&lt;td&gt;1400&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a decade and a half the flagship TDP stayed in a 150–300 W band. The break comes after 2020, and it's entirely &lt;strong&gt;datacenter-driven&lt;/strong&gt;: AI accelerators (SXM/OAM modules) shot up to 700–1400 W because they're cooled by liquid in a rack, not by a fan in a case. The desktop ceiling separately hit ~450–600 W (RTX 4090/5090).&lt;/p&gt;

&lt;p&gt;There's a curious gap if you look at NVIDIA's &lt;strong&gt;consumer&lt;/strong&gt; flagships separately: the GeForce flagship &lt;strong&gt;sat at exactly 250 W for seven years (2013–2019)&lt;/strong&gt; — GTX 780 Ti, Titan X, 1080 Ti, 2080 Ti — and only broke that ceiling with the RTX 3090 (350 W, 2020), then 4090 (450 W) and 5090 (575 W). Datacenter accelerators, by contrast, went to 700–1400 W almost immediately. It looks like what capped gaming TDP wasn't the silicon so much as the market — cases, PSUs, and buyer habits; in a rack there are no such limits, and watts grew without looking back. (This is interpretation: the spec stores watts, not intentions — but a 250 W plateau across seven generations shows up clearly in the data.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbww32v790hfdkxubxud5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbww32v790hfdkxubxud5.png" alt="TDP of flagships — desktop vs datacenter modules" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Performance per watt: this is where the progress is
&lt;/h2&gt;

&lt;p&gt;If you only look at TDP, it feels like "everything's getting worse, cards guzzle power." But FP32 per watt tells the opposite story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Flagship&lt;/th&gt;
&lt;th&gt;TFLOPS/W&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2006&lt;/td&gt;
&lt;td&gt;8800 GTX&lt;/td&gt;
&lt;td&gt;0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2013&lt;/td&gt;
&lt;td&gt;GTX 780 Ti&lt;/td&gt;
&lt;td&gt;0.021&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016&lt;/td&gt;
&lt;td&gt;Quadro P6000&lt;/td&gt;
&lt;td&gt;0.051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;RTX A6000&lt;/td&gt;
&lt;td&gt;0.129&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;L40S&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.262&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025&lt;/td&gt;
&lt;td&gt;RTX PRO 6000 Blackwell&lt;/td&gt;
&lt;td&gt;0.21&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;~100× in efficiency.&lt;/strong&gt; Peak "classic" efficiency lands in 2022 (Ada/L40S); the 2024–25 datacenter cards sometimes lose on TFLOPS/W because they deliberately trade efficiency for absolute compute density in the rack. The main drivers of efficiency gains are the &lt;strong&gt;process node (90 nm → 3 nm)&lt;/strong&gt; and architectural improvements, not clocks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh0glfht2rbkrbzl6ja3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh0glfht2rbkrbzl6ja3.png" alt="TFLOPS per watt by year (NVIDIA and AMD)" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The NVIDIA vs AMD duel
&lt;/h2&gt;

&lt;p&gt;If you mark, year by year, whose single flagship had the higher FP32:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Period&lt;/th&gt;
&lt;th&gt;Leader&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2007–2008&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FireStream 9170/9270&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2010–2013&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GCN: HD 6970, HD 7970 GHz, R9 290X&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2014&lt;/td&gt;
&lt;td&gt;NVIDIA&lt;/td&gt;
&lt;td&gt;Titan Black (5.6) vs FirePro W9100 (5.2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2015&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fury X (8.6)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016–2020&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NVIDIA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pascal → Ampere, the AI pivot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instinct MI250X (47.9)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;NVIDIA&lt;/td&gt;
&lt;td&gt;L40S / Hopper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2023–2024&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instinct MI300A/MI325X (81.7)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NVIDIA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blackwell (126)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The picture is wavy, and I included it mostly for the intrigue — to give AMD at least a fighting chance. Because on &lt;strong&gt;raw FP32&lt;/strong&gt;, AMD took the lead regularly — in the GCN era and again on recent Instinct parts. But raw FP32 is exactly the deceptive metric for today's world. The AI era is won not on FP32, but on software and FP16/BF16/FP8. Here NVIDIA, with tensor cores (since V100, 2017) and the CUDA ecosystem, built a moat that the FP32 numbers alone don't reveal: V100 delivered ~125 TFLOPS tensor-FP16, A100 ~312, H100 ~990 (vendor public data). In other words, the "FP32 duel" is about the past — the GPU as a graphics accelerator; the real battle has moved to a plane FP32 doesn't measure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fis4h6w6aejv38jl5e50s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fis4h6w6aejv38jl5e50s.png" alt="NVIDIA vs AMD FP32 leadership timeline" width="800" height="189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, here's one more chart — the FP16 duel, where NVIDIA is consistently ahead. And once you layer the AI software stack on top of that…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frkmvaregx9e72u3knxgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frkmvaregx9e72u3knxgf.png" alt="AI inflection — peak tensor FP16 (dense) vs FP32 by year (log scale)" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. What else the data shows
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Process node:&lt;/strong&gt; 90 nm (2006) → 28 nm (a 2012–2015 plateau, the "stuck node") → 16/12/7 → &lt;strong&gt;3 nm&lt;/strong&gt; (MI355X, 2025).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flagship VRAM:&lt;/strong&gt; 0.77 GB (8800 GTX) → 12–24 GB (mid-2010s) → 48 GB (A6000) → &lt;strong&gt;192–288 GB&lt;/strong&gt; (MI300/MI355X). Memory grows even faster than compute — because AI models are bottlenecked on it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "stuck" 28 nm:&lt;/strong&gt; for four years (2012–2015) the industry sat on one node — and that's exactly when AMD held parity/leadership on FP32. As soon as the process-node sprint resumed and tensor cores appeared, the advantage swung to NVIDIA.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Open dataset — take it
&lt;/h2&gt;

&lt;p&gt;We've published a &lt;strong&gt;cleaned dump of our GPU spec database&lt;/strong&gt; for anyone who wants to dig in themselves:&lt;/p&gt;

&lt;p&gt;📦 &lt;strong&gt;Download:&lt;/strong&gt; &lt;a href="https://gpuark.com/datasets/" rel="noopener noreferrer"&gt;&lt;strong&gt;gpuark.com/datasets&lt;/strong&gt;&lt;/a&gt; — the files &lt;code&gt;gpuark-gpu-specs.csv&lt;/code&gt;, &lt;code&gt;gpuark-benchmarks.csv&lt;/code&gt;, &lt;code&gt;gpuark-gpu-dataset.sqlite&lt;/code&gt;, or everything in a single &lt;code&gt;gpuark-gpu-dataset.tar.gz&lt;/code&gt; archive.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;13,566 GPUs&lt;/strong&gt; (fields: vendor, manufacturer, release date, architecture, process node, transistors, clocks, memory size and type, bus, FP16/FP32/FP64/BF16/TF32/INT8, TDP, NVLink, CUDA SM, and more) + &lt;strong&gt;993 third-party benchmark results&lt;/strong&gt; (join on &lt;code&gt;gpu_id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Formats: &lt;strong&gt;CSV&lt;/strong&gt; (Excel/pandas) and &lt;strong&gt;SQLite&lt;/strong&gt; (ready-made SQL) — two tables, &lt;code&gt;gpu_specs&lt;/code&gt; and &lt;code&gt;benchmarks&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;License: CC BY 4.0 (attribution to gpuark.com).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you'd rather explore interactively before downloading, the same data powers the &lt;a href="https://gpuark.com/en/compare/" rel="noopener noreferrer"&gt;GPU comparison tool&lt;/a&gt; on the site.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;FLOPS grew as an almost perfect exponential (~37%/yr) — but the "free" growth is over; from here we pay with TDP and a move into the rack.&lt;/li&gt;
&lt;li&gt;Real progress is measured not in watts and not in raw FP32, but in &lt;strong&gt;performance per watt&lt;/strong&gt; (×100) — and that rides on the process node.&lt;/li&gt;
&lt;li&gt;AMD fought and led on the "raw" numbers more often than people think; but the AI era was defined by tensor + software, not FP32.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The data is open — if you find something in it we missed, let me know.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>machinelearning</category>
      <category>hardware</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Running DeepSeek, Llama 3, and Qwen Locally: Complete GPU Requirements Guide</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:09:12 +0000</pubDate>
      <link>https://dev.to/maxvyaznikov/running-deepseek-llama-3-and-qwen-locally-complete-gpu-requirements-guide-6fd</link>
      <guid>https://dev.to/maxvyaznikov/running-deepseek-llama-3-and-qwen-locally-complete-gpu-requirements-guide-6fd</guid>
      <description>&lt;p&gt;Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference: VRAM Requirements
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;FP16&lt;/th&gt;
&lt;th&gt;Q8&lt;/th&gt;
&lt;th&gt;Q4_K_M&lt;/th&gt;
&lt;th&gt;Min GPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 8B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;8.5 GB&lt;/td&gt;
&lt;td&gt;5 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 70B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;140 GB&lt;/td&gt;
&lt;td&gt;70 GB&lt;/td&gt;
&lt;td&gt;40 GB&lt;/td&gt;
&lt;td&gt;2× RTX 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 405B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;810 GB&lt;/td&gt;
&lt;td&gt;405 GB&lt;/td&gt;
&lt;td&gt;228 GB&lt;/td&gt;
&lt;td&gt;8× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 7B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;td&gt;7.5 GB&lt;/td&gt;
&lt;td&gt;4.5 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 8GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 14B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;28 GB&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;td&gt;8.5 GB&lt;/td&gt;
&lt;td&gt;RTX 4060 Ti 16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 32B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 GB&lt;/td&gt;
&lt;td&gt;32 GB&lt;/td&gt;
&lt;td&gt;18 GB&lt;/td&gt;
&lt;td&gt;RTX 3090 24GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 72B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;144 GB&lt;/td&gt;
&lt;td&gt;72 GB&lt;/td&gt;
&lt;td&gt;41 GB&lt;/td&gt;
&lt;td&gt;2× RTX 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mistral Small 24B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;48 GB&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;td&gt;RTX 4080 16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mistral Large 123B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;246 GB&lt;/td&gt;
&lt;td&gt;123 GB&lt;/td&gt;
&lt;td&gt;69 GB&lt;/td&gt;
&lt;td&gt;4× RTX 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V3 671B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,340 GB&lt;/td&gt;
&lt;td&gt;670 GB&lt;/td&gt;
&lt;td&gt;376 GB&lt;/td&gt;
&lt;td&gt;5× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek R1 671B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,340 GB&lt;/td&gt;
&lt;td&gt;670 GB&lt;/td&gt;
&lt;td&gt;376 GB&lt;/td&gt;
&lt;td&gt;5× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phi-3.5 Mini 3.8B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.6 GB&lt;/td&gt;
&lt;td&gt;4 GB&lt;/td&gt;
&lt;td&gt;2.5 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 8GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 2 27B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;54 GB&lt;/td&gt;
&lt;td&gt;27 GB&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;RTX 4080 16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For any model, you can calculate exact VRAM needs at the &lt;a href="https://gpuark.com/en/vram-calculator/" rel="noopener noreferrer"&gt;VRAM calculator on gpuark.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model-by-Model Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Llama 3.1 — The All-Rounder
&lt;/h3&gt;

&lt;p&gt;Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ollama&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Run Llama 3.1 8B (auto-downloads ~4.7GB)&lt;/span&gt;
ollama run llama3.1

&lt;span class="c"&gt;# Or the 70B if you have the VRAM&lt;/span&gt;
ollama run llama3.1:70b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;8B at Q4_K_M&lt;/strong&gt;: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;70B at Q4_K_M&lt;/strong&gt;: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;405B&lt;/strong&gt;: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.&lt;/p&gt;

&lt;h3&gt;
  
  
  DeepSeek V3 / R1 — The MoE Giants
&lt;/h3&gt;

&lt;p&gt;DeepSeek V3 (671B) uses &lt;strong&gt;Mixture of Experts&lt;/strong&gt; — only ~37B parameters active per token, but all 671B must fit in memory. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At Q4_K_M: ~376 GB VRAM minimum&lt;/li&gt;
&lt;li&gt;Realistic minimum: &lt;strong&gt;5× A100 80GB&lt;/strong&gt; (400 GB total)&lt;/li&gt;
&lt;li&gt;On consumer hardware: not feasible for the full model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;But&lt;/strong&gt;: DeepSeek R1 distilled versions exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-7B&lt;/strong&gt;: 4.5 GB at Q4 — runs on any modern GPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-14B&lt;/strong&gt;: 8.5 GB at Q4 — RTX 4060 Ti&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-32B&lt;/strong&gt;: 18 GB at Q4 — RTX 3090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-70B&lt;/strong&gt;: 40 GB at Q4 — 2× RTX 3090&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen2.5 — Best for Coding
&lt;/h3&gt;

&lt;p&gt;Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Qwen2.5-Coder-14B — best coding model for 16GB GPUs&lt;/span&gt;
ollama run qwen2.5-coder:14b

&lt;span class="c"&gt;# Qwen2.5-32B — strong general model for 24GB GPUs&lt;/span&gt;
ollama run qwen2.5:32b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Qwen2.5-Coder-14B&lt;/strong&gt; at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistral — Efficient and Fast
&lt;/h3&gt;

&lt;p&gt;Mistral models are known for good quality-to-size ratio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Mistral Small 24B — best quality under 16GB&lt;/span&gt;
ollama run mistral-small

&lt;span class="c"&gt;# Mistral Large 123B — needs serious hardware&lt;/span&gt;
ollama run mistral-large
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mistral Small 24B&lt;/strong&gt; at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Setup Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Beginner Setup (~$400)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: RTX 4060 Ti 16GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: Ollama + Open WebUI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Enthusiast Setup (~$700)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: Used RTX 3090 24GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: Ollama or ExLlamaV2 + TabbyAPI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Power User Setup (~$1,400)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPUs&lt;/strong&gt;: 2× Used RTX 3090 (48GB total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: llama.cpp with &lt;code&gt;--tensor-split 24,24&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prosumer Setup (~$2,000)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: RTX 4090 + used RTX 3090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Same as above, faster inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: ExLlamaV2 with tensor parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance Tips
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Use the right quantization
&lt;/h3&gt;

&lt;p&gt;Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Optimize KV cache
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# llama.cpp: limit context to what you need&lt;/span&gt;
llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-c&lt;/span&gt; 4096  &lt;span class="c"&gt;# instead of default 8192+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Halving context length saves significant VRAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Flash Attention
&lt;/h3&gt;

&lt;p&gt;Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).&lt;/p&gt;

&lt;h3&gt;
  
  
  4. CPU offloading for oversized models
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# llama.cpp: offload only some layers to GPU&lt;/span&gt;
llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 20  &lt;span class="c"&gt;# 20 layers on GPU, rest on CPU&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The local LLM ecosystem has matured enormously. For most developers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with Ollama&lt;/strong&gt; — zero-friction setup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get at least 16GB VRAM&lt;/strong&gt; — opens up 24B models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;24GB (RTX 3090) is the sweet spot&lt;/strong&gt; — runs everything up to 34B comfortably&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two GPUs if you need 70B+&lt;/strong&gt; — pipeline parallelism just works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your local LLM setup? Drop your GPU + favorite model in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
    </item>
    <item>
      <title>A Developer's Guide to Choosing a GPU for Machine Learning in 2025-2026</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:04:11 +0000</pubDate>
      <link>https://dev.to/maxvyaznikov/a-developers-guide-to-choosing-a-gpu-for-machine-learning-in-2025-2026-5d4f</link>
      <guid>https://dev.to/maxvyaznikov/a-developers-guide-to-choosing-a-gpu-for-machine-learning-in-2025-2026-5d4f</guid>
      <description>&lt;p&gt;Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Specs That Matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. VRAM (Most Important)
&lt;/h3&gt;

&lt;p&gt;VRAM determines &lt;strong&gt;what models you can run&lt;/strong&gt;. No amount of compute power helps if your model doesn't fit in memory.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;What Fits (Inference)&lt;/th&gt;
&lt;th&gt;What Fits (Training)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;7B at Q4&lt;/td&gt;
&lt;td&gt;7B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12 GB&lt;/td&gt;
&lt;td&gt;13B at Q4&lt;/td&gt;
&lt;td&gt;7B QLoRA comfortably&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;24B at Q4&lt;/td&gt;
&lt;td&gt;13B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;34B at Q5&lt;/td&gt;
&lt;td&gt;13B full fine-tune, 34B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;48 GB&lt;/td&gt;
&lt;td&gt;70B at Q4&lt;/td&gt;
&lt;td&gt;34B full fine-tune&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;td&gt;70B at FP16&lt;/td&gt;
&lt;td&gt;70B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb&lt;/strong&gt;: buy the most VRAM you can afford. You can't upgrade VRAM later.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Memory Bandwidth
&lt;/h3&gt;

&lt;p&gt;For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the &lt;strong&gt;memory bandwidth&lt;/strong&gt; spec.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Bandwidth&lt;/th&gt;
&lt;th&gt;Llama 8B Q4 tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4060&lt;/td&gt;
&lt;td&gt;272 GB/s&lt;/td&gt;
&lt;td&gt;~35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4070&lt;/td&gt;
&lt;td&gt;504 GB/s&lt;/td&gt;
&lt;td&gt;~60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090&lt;/td&gt;
&lt;td&gt;936 GB/s&lt;/td&gt;
&lt;td&gt;~85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;1,008 GB/s&lt;/td&gt;
&lt;td&gt;~105&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A100 80GB&lt;/td&gt;
&lt;td&gt;2,039 GB/s&lt;/td&gt;
&lt;td&gt;~180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H100&lt;/td&gt;
&lt;td&gt;3,350 GB/s&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Tensor Cores
&lt;/h3&gt;

&lt;p&gt;Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for &lt;strong&gt;training&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;CC&lt;/th&gt;
&lt;th&gt;Supported Precisions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st (Volta)&lt;/td&gt;
&lt;td&gt;7.0&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2nd (Turing)&lt;/td&gt;
&lt;td&gt;7.5&lt;/td&gt;
&lt;td&gt;FP16, INT8, INT4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd (Ampere)&lt;/td&gt;
&lt;td&gt;8.x&lt;/td&gt;
&lt;td&gt;FP16, BF16, TF32, INT8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4th (Ada)&lt;/td&gt;
&lt;td&gt;8.9&lt;/td&gt;
&lt;td&gt;FP16, BF16, TF32, FP8, INT8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5th (Blackwell)&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;All above + FP4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;BF16 support (Ampere+)&lt;/strong&gt; is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. CUDA Compute Capability
&lt;/h3&gt;

&lt;p&gt;CC determines what frameworks and features your GPU supports. As of 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimum CC 5.0&lt;/strong&gt; for PyTorch/TensorFlow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 7.0+&lt;/strong&gt; for Tensor Cores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.0+&lt;/strong&gt; for Flash Attention, BF16&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.9&lt;/strong&gt; for FP8&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can look up any GPU's compute capability at &lt;a href="https://gpuark.com/en/cuda-compute-capability/" rel="noopener noreferrer"&gt;gpuark.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Recommendations by Budget
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Under $400: RTX 4060 Ti 16GB
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;16 GB VRAM — runs 24B models at Q4&lt;/li&gt;
&lt;li&gt;CC 8.9 (Ada Lovelace) — all modern features&lt;/li&gt;
&lt;li&gt;165W TDP — low power&lt;/li&gt;
&lt;li&gt;Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  $500-700: Used RTX 3090
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;24 GB VRAM&lt;/strong&gt; — the sweet spot&lt;/li&gt;
&lt;li&gt;CC 8.6 — BF16, Flash Attention, everything you need&lt;/li&gt;
&lt;li&gt;936 GB/s bandwidth — fast LLM inference&lt;/li&gt;
&lt;li&gt;350W TDP — needs a beefy PSU&lt;/li&gt;
&lt;li&gt;Best value in ML GPUs right now&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  $1,500-1,800: RTX 4090
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;24 GB VRAM (same as 3090)&lt;/li&gt;
&lt;li&gt;2× training throughput vs 3090&lt;/li&gt;
&lt;li&gt;Better power efficiency&lt;/li&gt;
&lt;li&gt;CC 8.9 — FP8 support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  $3,000-5,000: Used A100 40GB/80GB
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Professional GPU with ECC memory&lt;/li&gt;
&lt;li&gt;80GB version fits 70B at FP16&lt;/li&gt;
&lt;li&gt;2 TB/s bandwidth&lt;/li&gt;
&lt;li&gt;NVLink support for multi-GPU&lt;/li&gt;
&lt;li&gt;Best for research labs and startups&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  "More CUDA cores = better for ML"
&lt;/h3&gt;

&lt;p&gt;Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.&lt;/p&gt;

&lt;h3&gt;
  
  
  "I need the latest generation"
&lt;/h3&gt;

&lt;p&gt;The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Gaming benchmarks predict ML performance"
&lt;/h3&gt;

&lt;p&gt;Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.&lt;/p&gt;

&lt;h3&gt;
  
  
  "I'll just use the cloud"
&lt;/h3&gt;

&lt;p&gt;Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max VRAM per $&lt;/td&gt;
&lt;td&gt;Used RTX 3090&lt;/td&gt;
&lt;td&gt;24GB at ~$650&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training speed&lt;/td&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;2× faster than 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference tok/s&lt;/td&gt;
&lt;td&gt;RTX 3090 or 4090&lt;/td&gt;
&lt;td&gt;Best bandwidth at consumer price&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM 70B+&lt;/td&gt;
&lt;td&gt;2× Used 3090&lt;/td&gt;
&lt;td&gt;48GB for ~$1,300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Professional&lt;/td&gt;
&lt;td&gt;A100 80GB&lt;/td&gt;
&lt;td&gt;80GB, NVLink, ECC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>RTX 4090 vs RTX 3090 for AI/ML: Is the Upgrade Worth It?</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:03:04 +0000</pubDate>
      <link>https://dev.to/maxvyaznikov/rtx-4090-vs-rtx-3090-for-aiml-is-the-upgrade-worth-it-c68</link>
      <guid>https://dev.to/maxvyaznikov/rtx-4090-vs-rtx-3090-for-aiml-is-the-upgrade-worth-it-c68</guid>
      <description>&lt;p&gt;The RTX 3090 and RTX 4090 are the two most popular consumer GPUs for AI/ML work. Both have 24GB VRAM, but the price gap is massive. Let's break down when each one makes sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Specs Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;RTX 3090&lt;/th&gt;
&lt;th&gt;RTX 4090&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Ampere (CC 8.6)&lt;/td&gt;
&lt;td&gt;Ada Lovelace (CC 8.9)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VRAM&lt;/td&gt;
&lt;td&gt;24 GB GDDR6X&lt;/td&gt;
&lt;td&gt;24 GB GDDR6X&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory Bandwidth&lt;/td&gt;
&lt;td&gt;936 GB/s&lt;/td&gt;
&lt;td&gt;1,008 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA Cores&lt;/td&gt;
&lt;td&gt;10,496&lt;/td&gt;
&lt;td&gt;16,384&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tensor Cores&lt;/td&gt;
&lt;td&gt;328 (3rd gen)&lt;/td&gt;
&lt;td&gt;512 (4th gen)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TDP&lt;/td&gt;
&lt;td&gt;350W&lt;/td&gt;
&lt;td&gt;450W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16 Tensor&lt;/td&gt;
&lt;td&gt;142 TFLOPS&lt;/td&gt;
&lt;td&gt;330 TFLOPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Price (2026)&lt;/td&gt;
&lt;td&gt;Discontinued&lt;/td&gt;
&lt;td&gt;~$1,800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Used Price (2026)&lt;/td&gt;
&lt;td&gt;~$600-700&lt;/td&gt;
&lt;td&gt;~$1,400-1,500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a detailed side-by-side with all specifications, see the &lt;a href="https://gpuark.com/en/gpu/nvidia-geforce-rtx-4090-vs-nvidia-geforce-rtx-3090/" rel="noopener noreferrer"&gt;RTX 4090 vs RTX 3090 comparison page on gpuark.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Performance
&lt;/h2&gt;

&lt;p&gt;The 4090 is roughly &lt;strong&gt;1.7-2× faster&lt;/strong&gt; for training due to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;56% more CUDA cores&lt;/li&gt;
&lt;li&gt;4th gen Tensor Cores (better FP8, BF16 throughput)&lt;/li&gt;
&lt;li&gt;Higher clock speeds&lt;/li&gt;
&lt;li&gt;Better power efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real-world training benchmarks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;RTX 3090&lt;/th&gt;
&lt;th&gt;RTX 4090&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ResNet-50 (BS=64)&lt;/td&gt;
&lt;td&gt;780 img/s&lt;/td&gt;
&lt;td&gt;1,420 img/s&lt;/td&gt;
&lt;td&gt;1.82×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERT fine-tune (BS=32)&lt;/td&gt;
&lt;td&gt;145 samples/s&lt;/td&gt;
&lt;td&gt;268 samples/s&lt;/td&gt;
&lt;td&gt;1.85×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable Diffusion training&lt;/td&gt;
&lt;td&gt;2.1 it/s&lt;/td&gt;
&lt;td&gt;3.8 it/s&lt;/td&gt;
&lt;td&gt;1.81×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLaMA 7B LoRA (r=16)&lt;/td&gt;
&lt;td&gt;1.4 it/s&lt;/td&gt;
&lt;td&gt;2.6 it/s&lt;/td&gt;
&lt;td&gt;1.86×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Inference Performance (LLMs)
&lt;/h2&gt;

&lt;p&gt;For LLM inference, the gap narrows because it's &lt;strong&gt;memory-bandwidth bound&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;RTX 3090&lt;/th&gt;
&lt;th&gt;RTX 4090&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B Q4 (tok/s)&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;1.24×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B Q4 (tok/s)&lt;/td&gt;
&lt;td&gt;doesn't fit&lt;/td&gt;
&lt;td&gt;doesn't fit&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B Q4 (prompt)&lt;/td&gt;
&lt;td&gt;1,200 tok/s&lt;/td&gt;
&lt;td&gt;1,800 tok/s&lt;/td&gt;
&lt;td&gt;1.50×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Memory bandwidth difference is only 8% (936 vs 1,008 GB/s), so for pure token generation the 4090 advantage is modest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Decision
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Buy a 4090 if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Training throughput is your bottleneck (research, frequent fine-tuning)&lt;/li&gt;
&lt;li&gt;You need FP8 features (CC 8.9 vs 8.6)&lt;/li&gt;
&lt;li&gt;Power efficiency matters (performance per watt is much better)&lt;/li&gt;
&lt;li&gt;You want one powerful card, not multi-GPU hassle&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Buy a used 3090 (or two) if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;VRAM is your bottleneck (most LLM use cases)&lt;/li&gt;
&lt;li&gt;Budget matters — two 3090s = 48GB for ~$1,300 vs one 4090 = 24GB for ~$1,500&lt;/li&gt;
&lt;li&gt;You primarily do inference&lt;/li&gt;
&lt;li&gt;You want to run 34B+ models&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The multi-GPU argument
&lt;/h3&gt;

&lt;p&gt;Two used 3090s give you &lt;strong&gt;48GB total VRAM&lt;/strong&gt; for less than one 4090:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can run Llama 3.1 70B at Q4_K_M&lt;/li&gt;
&lt;li&gt;Pipeline parallelism with llama.cpp works out of the box&lt;/li&gt;
&lt;li&gt;Training with FSDP/DeepSpeed ZeRO-3 across both cards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catch: inter-GPU communication over PCIe is slower than a single card's internal bandwidth. For training, expect ~1.5-1.7× scaling (not 2×). For inference with pipeline parallelism, the latency penalty is minimal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Power Consumption
&lt;/h2&gt;

&lt;p&gt;Often overlooked but significant:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;TDP&lt;/th&gt;
&lt;th&gt;Annual electricity (24/7)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1× RTX 3090&lt;/td&gt;
&lt;td&gt;350W&lt;/td&gt;
&lt;td&gt;~$370/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1× RTX 4090&lt;/td&gt;
&lt;td&gt;450W&lt;/td&gt;
&lt;td&gt;~$475/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2× RTX 3090&lt;/td&gt;
&lt;td&gt;700W&lt;/td&gt;
&lt;td&gt;~$740/year&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If running 24/7 as an inference server, the 4090's better perf/watt matters. For occasional use, it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;The RTX 3090 at $600-700 used is the &lt;strong&gt;best value proposition in ML hardware&lt;/strong&gt; right now. The 4090 is a better card in every metric except price-per-VRAM-GB, but the 3090 gives you 80% of the capability at 40% of the price.&lt;/p&gt;

&lt;p&gt;If you're VRAM-limited (and you probably are if you're running LLMs), two 3090s beat one 4090 every time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running ML workloads on consumer GPUs? Share your setup in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deeplearning</category>
    </item>
    <item>
      <title>CUDA Compute Capability: What It Is and Why It Matters for ML Engineers</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 03:45:21 +0000</pubDate>
      <link>https://dev.to/maxvyaznikov/cuda-compute-capability-what-it-is-and-why-it-matters-for-ml-engineers-1mhg</link>
      <guid>https://dev.to/maxvyaznikov/cuda-compute-capability-what-it-is-and-why-it-matters-for-ml-engineers-1mhg</guid>
      <description>&lt;p&gt;If you've ever seen an error like "CUDA error: no kernel image is available for execution on the device" or "minimum required Cuda capability is 3.5" — you've run into &lt;strong&gt;Compute Capability&lt;/strong&gt; issues. Here's everything you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Compute Capability?
&lt;/h2&gt;

&lt;p&gt;CUDA Compute Capability (CC) is a &lt;strong&gt;version number&lt;/strong&gt; assigned to every NVIDIA GPU that identifies its &lt;strong&gt;architecture and supported feature set&lt;/strong&gt;. It's NOT a performance score.&lt;/p&gt;

&lt;p&gt;Format: &lt;code&gt;Major.Minor&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Major&lt;/strong&gt; = GPU architecture generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minor&lt;/strong&gt; = incremental improvements within that generation
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GeForce GTX 1080  → CC 6.1 (Pascal)
GeForce RTX 3090  → CC 8.6 (Ampere)
GeForce RTX 4090  → CC 8.9 (Ada Lovelace)
H100              → CC 9.0 (Hopper)
RTX 5090          → CC 10.0 (Blackwell)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Framework compatibility
&lt;/h3&gt;

&lt;p&gt;Modern ML frameworks have &lt;strong&gt;minimum CC requirements&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Minimum CC&lt;/th&gt;
&lt;th&gt;What's excluded&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PyTorch 2.x&lt;/td&gt;
&lt;td&gt;3.7&lt;/td&gt;
&lt;td&gt;Kepler (K80), some Maxwell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TensorFlow 2.15+&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;All Maxwell, Kepler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JAX latest&lt;/td&gt;
&lt;td&gt;5.2&lt;/td&gt;
&lt;td&gt;Same as TF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flash Attention 2&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;Everything before Ampere&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your GPU's CC is below the minimum, the framework &lt;strong&gt;will not use it&lt;/strong&gt; — you'll silently fall back to CPU or get a hard error.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Feature availability
&lt;/h3&gt;

&lt;p&gt;Each CC level unlocks hardware features:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CC&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Key ML Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5.0-5.2&lt;/td&gt;
&lt;td&gt;Maxwell&lt;/td&gt;
&lt;td&gt;Basic CUDA, cuDNN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6.0-6.1&lt;/td&gt;
&lt;td&gt;Pascal&lt;/td&gt;
&lt;td&gt;FP16 compute, unified memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7.0&lt;/td&gt;
&lt;td&gt;Volta&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Tensor Cores&lt;/strong&gt; (1st gen), WMMA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7.5&lt;/td&gt;
&lt;td&gt;Turing&lt;/td&gt;
&lt;td&gt;INT8/INT4 Tensor Cores, mixed precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;Ampere&lt;/td&gt;
&lt;td&gt;BF16, TF32, sparse Tensor Cores, 3rd gen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;td&gt;Ampere (consumer)&lt;/td&gt;
&lt;td&gt;Same features, fewer SMs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.9&lt;/td&gt;
&lt;td&gt;Ada Lovelace&lt;/td&gt;
&lt;td&gt;FP8, 4th gen Tensor Cores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;Hopper&lt;/td&gt;
&lt;td&gt;Transformer Engine, FP8 matmul, DPX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;Blackwell&lt;/td&gt;
&lt;td&gt;5th gen Tensor Cores, FP4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3. Compilation targets
&lt;/h3&gt;

&lt;p&gt;When you compile CUDA code (or when PyTorch ships prebuilt binaries), it targets specific CC versions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Compile for multiple architectures&lt;/span&gt;
nvcc &lt;span class="nt"&gt;-gencode&lt;/span&gt; &lt;span class="nb"&gt;arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_80,code&lt;span class="o"&gt;=&lt;/span&gt;sm_80 &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-gencode&lt;/span&gt; &lt;span class="nb"&gt;arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_86,code&lt;span class="o"&gt;=&lt;/span&gt;sm_86 &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-gencode&lt;/span&gt; &lt;span class="nb"&gt;arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_89,code&lt;span class="o"&gt;=&lt;/span&gt;sm_89 &lt;span class="se"&gt;\&lt;/span&gt;
     my_kernel.cu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PyTorch wheels on PyPI typically include CC 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 8.9, 9.0. If your GPU isn't covered, you may need to build from source.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Check Your GPU's CC
&lt;/h2&gt;

&lt;h3&gt;
  
  
  nvidia-smi (easiest, no CUDA toolkit needed)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_cap &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
&lt;span class="c"&gt;# Output: 8.6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python (PyTorch)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_capability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compute Capability: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;minor&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python (TensorFlow)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;
&lt;span class="n"&gt;gpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_physical_devices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GPU&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compute_capability&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  C++ (CUDA Runtime)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;cudaDeviceProp&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;cudaGetDeviceProperties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CC: %d.%d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;minor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lookup table
&lt;/h3&gt;

&lt;p&gt;Don't have the GPU installed yet? The &lt;a href="https://gpuark.com/en/cuda-compute-capability/" rel="noopener noreferrer"&gt;CUDA Compute Capability table on gpuark.com&lt;/a&gt; covers every NVIDIA GPU from Kepler to Blackwell.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common CC-Related Errors and Fixes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  "no kernel image is available for execution on the device"
&lt;/h3&gt;

&lt;p&gt;Your PyTorch/TensorFlow binary wasn't compiled for your GPU's CC. Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install PyTorch with the right CUDA version&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu124
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or build from source with your CC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TORCH_CUDA_ARCH_LIST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"8.6"&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--no-binary&lt;/span&gt; torch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  "minimum required Cuda capability is X.X"
&lt;/h3&gt;

&lt;p&gt;Your GPU is too old for the framework version. Options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use an older framework version&lt;/li&gt;
&lt;li&gt;Upgrade your GPU&lt;/li&gt;
&lt;li&gt;Use CPU mode: &lt;code&gt;CUDA_VISIBLE_DEVICES="" python train.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Flash Attention requires CC ≥ 8.0
&lt;/h3&gt;

&lt;p&gt;Flash Attention 2 only works on Ampere (RTX 3000) and newer. For older GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Use xformers instead (supports CC ≥ 6.0)
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;xformers&lt;/span&gt;
&lt;span class="c1"&gt;# Or use PyTorch's built-in SDPA
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.nn.functional&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scaled_dot_product_attention&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical Advice for GPU Shopping
&lt;/h2&gt;

&lt;p&gt;When buying a GPU for ML:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Minimum CC 7.5&lt;/strong&gt; (Turing) for mixed precision training — gives you Tensor Cores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.0+&lt;/strong&gt; (Ampere) strongly recommended — BF16, Flash Attention, much better ML performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.9&lt;/strong&gt; (Ada) for bleeding-edge features like FP8 quantization-aware training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM matters more than CC&lt;/strong&gt; in most cases — a 3090 (CC 8.6, 24GB) beats a 4070 (CC 8.9, 12GB) for LLMs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CC tells you &lt;em&gt;what features your GPU supports&lt;/em&gt;. VRAM tells you &lt;em&gt;how big a model fits&lt;/em&gt;. Both matter, but for LLM inference, VRAM is usually the bottleneck.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What GPU are you running your ML workloads on? Have you hit CC compatibility issues? Let me know in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>nvidia</category>
    </item>
    <item>
      <title>How Much VRAM Do You Actually Need to Run LLMs Locally?</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 03:44:13 +0000</pubDate>
      <link>https://dev.to/maxvyaznikov/how-much-vram-do-you-actually-need-to-run-llms-locally-2604</link>
      <guid>https://dev.to/maxvyaznikov/how-much-vram-do-you-actually-need-to-run-llms-locally-2604</guid>
      <description>&lt;p&gt;Running large language models locally has become increasingly practical — but figuring out exactly how much VRAM you need can be confusing. Here's a concrete breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Simple Formula
&lt;/h2&gt;

&lt;p&gt;For &lt;strong&gt;inference&lt;/strong&gt; (running a model, not training):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VRAM ≈ Parameters × Bytes per Weight + KV Cache + Overhead
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where bytes per weight depends on quantization:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Bytes/Param&lt;/th&gt;
&lt;th&gt;Example: 7B model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FP32&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;28 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16/BF16&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT8 (Q8)&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 (Q4_K_M)&lt;/td&gt;
&lt;td&gt;0.56&lt;/td&gt;
&lt;td&gt;~4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 (Q4_0)&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;3.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Add &lt;strong&gt;10-20% overhead&lt;/strong&gt; for KV cache (more for longer contexts) and runtime buffers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical VRAM Requirements by Model
&lt;/h2&gt;

&lt;p&gt;Here's what you can actually run on common GPUs:&lt;/p&gt;

&lt;h3&gt;
  
  
  8 GB VRAM (RTX 4060, RTX 3070)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 8B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Qwen2.5 7B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Mistral 7B at Q5_K_M ✅&lt;/li&gt;
&lt;li&gt;Phi-3.5 Mini (3.8B) at Q8 ✅&lt;/li&gt;
&lt;li&gt;13B models at Q4 ⚠️ (tight, short context only)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  12 GB VRAM (RTX 4070, RTX 3060 12GB)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;13B models at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Llama 3.1 8B at Q8 ✅&lt;/li&gt;
&lt;li&gt;CodeQwen 14B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;20B models at Q4 ⚠️&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  16 GB VRAM (RTX 4080, RTX 5070 Ti)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Mistral Small 24B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Qwen2.5-Coder 14B at Q6_K ✅&lt;/li&gt;
&lt;li&gt;20B models at Q5-Q6 ✅&lt;/li&gt;
&lt;li&gt;34B models at Q4 ⚠️&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  24 GB VRAM (RTX 3090, RTX 4090)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 70B at Q4_K_M ⚠️ (with partial offload)&lt;/li&gt;
&lt;li&gt;34B models at Q5-Q6 ✅&lt;/li&gt;
&lt;li&gt;Qwen2.5 32B at Q5_K_M ✅&lt;/li&gt;
&lt;li&gt;DeepSeek-Coder-V2-Lite 16B at FP16 ✅&lt;/li&gt;
&lt;li&gt;Mistral Small 24B at Q8 ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  48 GB VRAM (2× RTX 3090, A6000)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 70B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;DeepSeek V3 670B — not enough, even at Q2&lt;/li&gt;
&lt;li&gt;Mixtral 8x22B at Q4 ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Quantization Sweet Spot
&lt;/h2&gt;

&lt;p&gt;Q4_K_M is the most popular quantization for local inference and for good reason:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality:&lt;/strong&gt; ~1-2% degradation vs FP16 on most benchmarks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size:&lt;/strong&gt; ~56% of the original INT8 size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed:&lt;/strong&gt; Fastest on most consumer GPUs (memory-bandwidth bound)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Going lower (Q3, Q2) introduces noticeable quality degradation, especially on reasoning tasks. Going higher (Q6, Q8) gives marginal quality improvement but costs significantly more VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About Training?
&lt;/h2&gt;

&lt;p&gt;Training needs &lt;strong&gt;much more&lt;/strong&gt; memory than inference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Training VRAM ≈ Model weights + Gradients + Optimizer states + Activations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For full fine-tuning with Adam optimizer at FP32:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weights: 4 bytes/param&lt;/li&gt;
&lt;li&gt;Gradients: 4 bytes/param&lt;/li&gt;
&lt;li&gt;Adam states: 8 bytes/param&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total: ~16 bytes/param&lt;/strong&gt; (before activations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 7B model needs &lt;strong&gt;~112 GB&lt;/strong&gt; for full FP32 training. That's why techniques like &lt;strong&gt;LoRA&lt;/strong&gt; (which only trains ~1-2% of parameters) and &lt;strong&gt;QLoRA&lt;/strong&gt; (quantized base + LoRA) are so popular:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA fine-tuning&lt;/strong&gt; of 7B: ~6-8 GB VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA fine-tuning&lt;/strong&gt; of 13B: ~10-12 GB VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA fine-tuning&lt;/strong&gt; of 70B: ~40-48 GB VRAM&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  KV Cache: The Hidden VRAM Consumer
&lt;/h2&gt;

&lt;p&gt;When generating long texts, the KV cache grows with context length:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KV cache ≈ 2 × num_layers × hidden_dim × context_length × bytes_per_element
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Llama 3.1 8B at FP16 with 8K context: ~1 GB&lt;br&gt;
For Llama 3.1 8B at FP16 with 128K context: ~16 GB&lt;/p&gt;

&lt;p&gt;This is why you might load a model fine but run out of memory during long conversations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools for Estimating
&lt;/h2&gt;

&lt;p&gt;Rather than doing this math by hand every time, there's a &lt;a href="https://gpuark.com/en/vram-calculator/" rel="noopener noreferrer"&gt;VRAM calculator&lt;/a&gt; that estimates memory requirements — plug in the model size, quantization level, and context length to see if it fits your GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Budget&lt;/th&gt;
&lt;th&gt;Best GPU&lt;/th&gt;
&lt;th&gt;What You Can Run&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;~$300&lt;/td&gt;
&lt;td&gt;RTX 4060 8GB&lt;/td&gt;
&lt;td&gt;7-8B models at Q4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$400&lt;/td&gt;
&lt;td&gt;RTX 4060 Ti 16GB&lt;/td&gt;
&lt;td&gt;Up to 24B at Q4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$600&lt;/td&gt;
&lt;td&gt;Used RTX 3090 24GB&lt;/td&gt;
&lt;td&gt;Up to 34B at Q5, 70B at Q3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$1800&lt;/td&gt;
&lt;td&gt;RTX 4090 24GB&lt;/td&gt;
&lt;td&gt;Same as 3090 but 2× faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$1200&lt;/td&gt;
&lt;td&gt;2× Used RTX 3090&lt;/td&gt;
&lt;td&gt;70B at Q4, most models comfortably&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most cost-effective option for serious local LLM use in 2025-2026 is still a &lt;strong&gt;used RTX 3090&lt;/strong&gt; — 24 GB of VRAM at a fraction of the 4090 price.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your local LLM setup? Drop a comment with your GPU and favorite model!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
    </item>
  </channel>
</rss>
