This article was originally published on runaihome.com
Most "best GPU" articles rank cards by gaming benchmarks. That ranking is wrong for local
AI. The card that wins Cyberpunk 2077 at 4K can be the worst pick for running Llama 3 at
home, and a "midrange" card from two generations ago can outperform a brand-new flagship
on a per-dollar basis. This guide ignores frame rates and ranks by what actually matters
for local AI inference: VRAM capacity, memory bandwidth, and price-per-gigabyte.
The structure is six budget tiers, $300 to $3000+. For each, we name the realistic picks
(new and used), state the verified specs, and flag the trap cards to avoid. All
specifications are sourced from manufacturer pages and independent benchmarks; all prices
are accurate as of May 2026 and will fluctuate. See the Sources section
at the end for citations.
For the model side of the equation, see our companion
best models by VRAM tier guide.
The one rule: VRAM beats almost everything else
For local AI inference, VRAM capacity is the primary spec. Not CUDA cores, not Tensor
cores, not boost clock. The order of importance is roughly:
- VRAM size — determines what models fit at all
- Memory bandwidth — determines tokens/sec on models that fit
- Compute (Tensor cores) — matters mostly for fine-tuning and high batch sizes
- Power efficiency — matters for 24/7 home-server setups
This is why a used RTX 3090 (24 GB GDDR6X, 936 GB/s) often outperforms a brand-new RTX
5070 (12 GB GDDR7, 672 GB/s) for AI inference despite being five years older. The 3090
has both more VRAM and more raw memory bandwidth, which are the bottlenecks.
For reference, here is the verified memory bandwidth of every card in this guide:
| Card | VRAM | Memory bandwidth | Memory type |
|---|---|---|---|
| RTX 3060 12GB | 12 GB | 360 GB/s | GDDR6 |
| RTX 3090 | 24 GB | 936 GB/s | GDDR6X |
| RTX 4060 Ti 16GB | 16 GB | 288 GB/s (554 effective with L2) | GDDR6 |
| RTX 4070 | 12 GB | 504 GB/s | GDDR6X |
| RTX 4090 | 24 GB | 1,008 GB/s | GDDR6X |
| RTX 5060 Ti 16GB | 16 GB | 448 GB/s | GDDR7 |
| RTX 5070 | 12 GB | 672 GB/s | GDDR7 |
| RTX 5070 Ti | 16 GB | 896 GB/s | GDDR7 |
| RTX 5080 | 16 GB | 960 GB/s | GDDR7 |
| RTX 5090 | 32 GB | 1,792 GB/s | GDDR7 |
| Apple M4 Pro | up to 64 GB unified | 273 GB/s | LPDDR5x |
| Apple M4 Max | up to 128 GB unified | 546 GB/s | LPDDR5x |
| Apple M3 Ultra | up to 512 GB unified | 819 GB/s | LPDDR5x |
If a card is described as "great for AI" but has less than 12 GB VRAM, treat the claim
with suspicion regardless of how new it is.
$300–$450 — entry tier
Best card overall: Used RTX 3060 12 GB.
Best new option in this budget: None worth recommending.
The used 3060 12 GB has been the single best entry-level AI card on the market for three
years running. As of May 2026 it sells for roughly $267 average on eBay, has 12 GB of
GDDR6 with 360 GB/s memory bandwidth, and runs nearly any model up to 13B parameters at
Q4 quantization.
Performance on this card is real, not theoretical. Independent benchmarks measure 42–53
tokens per second running Llama 3.1 8B at Q4_K_M quantization via llama.cpp — well
above the ~20 tokens/sec threshold where chat feels responsive. That is genuinely usable
performance for a $267 GPU.
What you can run on a 3060 12GB:
- LLMs up to 13B parameters at Q4 quantization (Llama 3 8B, Mistral 7B, Qwen 2.5 14B)
- SDXL image generation (slow but works at 1024×1024)
- Whisper Large for transcription
- ComfyUI workflows up to medium complexity
Avoid at this tier:
- RTX 4060 8 GB ($299 new) — VRAM-starved despite being newer; 8 GB caps you at 7B Q4 and chokes on SDXL. Save your money or grab the 3060 12 GB used.
- RTX 3050 6 GB / 3050 8 GB — too little VRAM for anything beyond toy models.
Honest take: If $300 is your ceiling, do not buy new. The used 3060 12 GB market is
your friend. Buy from a reputable seller with returns enabled.
$450–$750 — the practical entry
Best new card: RTX 5060 Ti 16 GB at $429 MSRP.
Best used pick: RTX 3090 24 GB at $800–$1,300 used (high variance; market median
around $1,050 as of May 2026).
Honorable mention: RTX 4060 Ti 16 GB if available below $400.
The RTX 5060 Ti 16 GB launched April 2025 with a $429 MSRP from NVIDIA. It uses GDDR7 on
a narrow 128-bit bus, giving 448 GB/s of memory bandwidth — significantly higher than the
RTX 4060 Ti 16 GB's 288 GB/s raw (or 554 GB/s effective with L2 cache). For 16 GB of
VRAM at this price, it's the most compelling new card in the entry tier.
But if you can find a clean used 3090, the math still favors it: 24 GB of VRAM and
936 GB/s memory bandwidth versus the 5060 Ti's 16 GB / 448 GB/s. The 3090 catches: 350W
power-hungry, often ex-mining cards, and prices vary widely (clean cards trend toward the
$1,050+ range). For the complete 3090 value analysis, see our used RTX 3090 buyer's guide.
What you can run at this tier:
- LLMs up to 30B parameters at Q4 quantization
- Llama 3.3 70B with aggressive Q3 quantization (slow but possible on 24 GB)
- SDXL and Flux Schnell at full speed
- Mistral / Qwen 32B Q4
Trap cards at this tier:
- RTX 4070 12 GB ($600+) — newer than the 3090 but only 12 GB VRAM and 504 GB/s bandwidth. AI buyers should skip it.
- RTX 5060 Ti 8 GB ($379 MSRP) — confusingly named, do not confuse with the 16 GB SKU. Look at the spec sheet, not the model name.
Honest take: Used 3090 if you can verify it's clean (not ex-mining at 24/7 load) and
have a 750W+ PSU. New 5060 Ti 16 GB if you want zero hassle. Skip the 4070.
$750–$1200 — the productivity tier
Best new card: RTX 5070 Ti 16 GB at $749 MSRP.
Wildcard: Mac Mini M4 Pro 64 GB (~$2,000 in this configuration; unified memory
shifts the math).
The 5070 Ti hits a sweet spot at this tier: GDDR7 memory at 896 GB/s bandwidth, 16 GB
VRAM, and significantly faster inference than a 4060 Ti at the same VRAM size. It
launched February 2025 at $749 and has stayed near MSRP at most retailers.
The Mac Mini wildcard deserves consideration. The M4 Pro supports up to 64 GB unified
memory with 273 GB/s bandwidth. That's lower bandwidth than any modern discrete
NVIDIA GPU, but the 64 GB pool means you can run Llama 3.3 70B Q4 — something that
requires a 24 GB+ VRAM discrete GPU plus offloading. Mac inference is slower per dollar
than NVIDIA's per-dollar throughput, but for hobbyists who want a silent, single-machine
setup, the math works out.
What you can run at this tier:
- 70B Q3 / Q4 with offload (or natively on a 64 GB Mac)
- 32B Q4 at full speed
- Flux Dev image generation comfortably
- Light LoRA training on 7B models
Avoid at this tier:
- RTX 5070 12 GB ($549 MSRP) — same trap as the 4070. The 5070 Ti is worth the extra $200 for both more VRAM and more bandwidth.
- AMD RX 7900 XTX (24 GB VRAM at this price looks tempting) — [ROCm support in 20
Top comments (0)