Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.
Quick answer: The RTX 4070 Ti Super (16GB) is the best GPU for most Stable Diffusion users. It has enough VRAM for SDXL and Flux, generates images fast, and doesn't cost flagship prices.
See the recommended pick on the original guide
What Stable Diffusion actually needs from a GPU
Stable Diffusion is a VRAM-hungry workload. Unlike gaming, where raw compute dominates, image generation performance scales directly with:
- VRAM — determines which models you can run and at what resolution
- Memory bandwidth — affects generation speed (how fast data moves, not just how much fits)
- CUDA cores — more cores = faster diffusion steps
- Architecture — newer architectures have better AI-specific tensor core optimizations
The single most common mistake is buying a GPU based on CUDA core count or price alone without checking VRAM. You can have the fastest GPU on paper and still be unable to run SDXL with ControlNet if you only have 8GB. For exact numbers by workflow, see our Stable Diffusion VRAM requirements guide.
SD 1.5 vs SDXL vs Flux — VRAM comparison
The three main Stable Diffusion generations have very different VRAM requirements:
| Model | Minimum VRAM | Recommended | ControlNet overhead | LoRA training |
|---|---|---|---|---|
| SD 1.5 (512×512) | 4GB | 6–8GB | +1–2GB | 8GB |
| SD 1.5 (768×768) | 6GB | 8GB | +1–2GB | 8GB |
| SDXL (1024×1024) | 8GB | 12–16GB | +2–3GB per model | 12–16GB |
| Flux Schnell | 10GB | 12GB | +2GB | — |
| Flux Dev | 12GB | 16GB | +2–3GB | 16–24GB |
| Flux Dev (high-res 1.5K+) | 16GB | 24GB | +3–4GB | 24GB |
The jump from SD 1.5 to SDXL roughly doubles the VRAM requirement. Flux jumps it again. If you buy a GPU today and plan to stay current with new models, 16GB is the minimum worth buying new.
Generation speed benchmarks
How fast each GPU generates a single 1024×1024 image at 20 steps using DPM++ 2M sampler in ComfyUI:
| GPU | VRAM | SD 1.5 (512px) | SDXL (1024px) | Flux Dev (1024px) | Price |
|---|---|---|---|---|---|
| RTX 5090 | 32GB | ~2.0 s/img | ~3.5 s/img | ~5.5 s/img | ~$2,000+ |
| RTX 4090 | 24GB | ~3.0 s/img | ~5.5 s/img | ~8.0 s/img | ~$1,600 |
| RTX 5080 | 16GB | ~3.8 s/img | ~6.5 s/img | ~9.5 s/img | ~$1,000 |
| RTX 4070 Ti Super | 16GB | ~5.0 s/img | ~8.5 s/img | ~13 s/img | ~$700 |
| RTX 4060 Ti 16GB | 16GB | ~7.5 s/img | ~12 s/img | ~19 s/img | ~$400 |
| RTX 3060 12GB | 12GB | ~9.0 s/img | ~16 s/img | ~28 s/img | ~$250 used |
Times are approximate single-image benchmarks with xformers enabled. Real-world times vary by sampler, resolution, and system configuration.
The difference between an RTX 4060 Ti 16GB and an RTX 4090 for SDXL is roughly 2x in generation speed — which matters significantly when you're iterating on prompts 50+ times in a session.
RTX 4070 Ti Super — best for most users
The RTX 4070 Ti Super hits the sweet spot that no other card currently matches:
- 16GB VRAM runs SDXL, Flux Dev, and most ControlNet workflows without offloading
- Generation speed is fast enough for active creative iteration (8–9 seconds for SDXL)
- ~$700 price sits well below the 4090 and new RTX 5080
- Full support for ComfyUI, Automatic1111, and Forge
- Efficient power draw (~285W) compared to the 4090's 450W
- Handles LoRA training for SDXL with some batch size constraints
For hobbyists and semi-professional image generators who don't need to train custom models from scratch, this card handles everything current. If SDXL is your primary workflow, our dedicated GPU guide for SDXL covers SDXL-specific optimizations and budget picks in more detail.
See the recommended pick on the original guide
RTX 4090 — for power users and trainers
If you generate hundreds of images daily or run complex multi-ControlNet workflows, the RTX 4090 is worth the premium:
- 24GB VRAM means you never hit OOM errors with ControlNet stacks or IP-Adapters
- Nearly 2x faster than the 4070 Ti Super for SDXL and Flux
- Handles high-resolution upscaling (2K+) without tiling tricks
- Can run Dreambooth fine-tuning and full LoRA training with comfortable batch sizes
- Handles Flux Dev + ControlNet + IP-Adapter simultaneously in a single workflow
The 4090 makes sense if image generation is your primary GPU workload or you sell generated content commercially. It's overkill for casual use. If your focus is AI-assisted photo retouching and enhancement rather than generation, see our best GPU for AI photo editing guide.
See the recommended pick on the original guide
RTX 4060 Ti 16GB — best budget pick
At ~$400, the RTX 4060 Ti 16GB is the cheapest new card that handles SDXL and Flux Dev without constant memory-swapping:
- 16GB VRAM is enough for SDXL with ControlNet and Flux without extreme offloading
- Generation is slow — roughly 12 seconds for SDXL, 19 seconds for Flux
- Acceptable for users who generate occasionally (a few dozen images per session)
- Not ideal for LoRA training due to slow compute
The narrow memory bus (128-bit) limits bandwidth compared to higher-end cards, which is why generation times lag despite having equal VRAM to the 4070 Ti Super.
See the recommended pick on the original guide
Batch generation and ControlNet VRAM math
Single-image VRAM requirements are the baseline. Batch generation multiplies them:
| Scenario | VRAM needed (SDXL) |
|---|---|
| Single image, no extras | 8–10GB |
| Batch of 2 images | 12–14GB |
| Single + ControlNet (depth) | 10–13GB |
| Single + 2 ControlNets | 13–16GB |
| Single + ControlNet + IP-Adapter | 14–18GB |
| LoRA training (batch=4) | 14–16GB |
Running multiple ControlNets simultaneously — which professional workflows commonly do for pose, depth, and edge control — pushes well into 16GB territory. Flux + ControlNet + IP-Adapter reliably exceeds 16GB, which is where the 4090's 24GB genuinely matters.
What about AMD GPUs?
AMD GPUs can run Stable Diffusion through DirectML or ROCm, but the reality is consistently worse:
- Performance runs 30–50% slower than equivalent NVIDIA cards in most image generation benchmarks
- xformers, Flash Attention, and other critical optimizations are NVIDIA-only or require significant workarounds
- Community support overwhelmingly assumes NVIDIA — tutorials, troubleshooting guides, custom nodes
- ROCm works better on Linux than Windows, adding another variable for most users
Unless you already own an AMD card and want to experiment, buy NVIDIA for any serious Stable Diffusion work. For a detailed comparison, see our NVIDIA vs AMD for AI guide.
Optimization tips that actually matter
Regardless of which GPU you buy, these practices stretch your VRAM further:
- Use FP16/BF16 precision — halves VRAM usage versus FP32 with no visible quality difference in generated images
- Enable xformers or PyTorch SDP attention — reduces peak VRAM and speeds up generation significantly
- Use VAE tiling for high-resolution images on limited VRAM (1.5K+ on 12GB cards)
- Forge over Automatic1111 — significantly better VRAM management, especially for 16GB cards. Also consider InvokeAI if you want a polished creative UI with built-in canvas editing, or Fooocus for a no-knobs SDXL experience
- ComfyUI for complex workflows — gives you explicit control over model loading and unloading; if you are unsure which frontend suits your workflow, our Automatic1111 vs ComfyUI comparison breaks down the VRAM efficiency differences between the two
- FP8 quantization for Flux — cuts VRAM by ~25% with minimal visible quality loss. See our best quantization for Stable Diffusion guide for a full breakdown of precision formats and their quality-VRAM trade-offs
If you're running a 16GB card, applying all of these optimizations can get you close to 24GB behavior in many scenarios. If you are also evaluating Chroma, the newer Flux-based generation model, see our best GPU for Chroma AI guide.
GPU tier list available at the original article
Not ready to buy hardware? Try cloud GPU first
If you want to test workflows before committing to hardware, RunPod and Vast.ai let you rent RTX 4090s by the hour for under $0.50/hr. It's a practical way to figure out how much VRAM you actually need before spending $700+.
Which GPU should YOU buy for Stable Diffusion?
- Just getting started with SD 1.5? A used RTX 3060 12GB under $250 runs SD 1.5 and basic SDXL. Fine for learning, but you'll want to upgrade when you hit Flux. For a detailed answer on what the 3060 can and cannot do, see can the RTX 3060 run Stable Diffusion?
- Want to run SDXL and Flux comfortably without constant waiting? The RTX 4070 Ti Super at 16GB is the right card. Fast enough, enough VRAM, reasonable price.
- Heavily using ControlNet stacks or IP-Adapters? You need 24GB to prevent OOM errors. The RTX 4090 is the answer.
- Training custom models (Dreambooth, full LoRA)? Go RTX 4090. LoRA training runs on 16GB but larger batch sizes and faster iteration require 24GB.
- Budget is tight and you generate occasionally? The RTX 4060 Ti 16GB at $400 handles everything, just slower. Acceptable if you're patient.
Common mistakes to avoid
- Buying a GPU with only 8GB VRAM in 2026. SDXL and Flux are the current standard. 8GB forces heavy offloading that makes generation painfully slow and breaks many workflows entirely.
- Choosing AMD for Stable Diffusion. ROCm support for image generation lags significantly. You'll spend more time debugging than generating.
- Ignoring memory bandwidth. Two GPUs with identical VRAM can generate images at 2x different speeds based purely on memory bandwidth. The RTX 4060 Ti 16GB vs 4070 Ti Super gap is almost entirely bandwidth.
- Skipping FP16 precision. Running at FP32 wastes half your VRAM for zero visible quality improvement.
- Assuming multi-GPU will help. Stable Diffusion does not benefit from multiple consumer GPUs. One fast card with lots of VRAM beats two slower cards.
Final verdict
| Budget | GPU | Best for |
|---|---|---|
| Under $300 | RTX 3060 12GB (used) | Learning, SD 1.5, basic SDXL |
| ~$400 | RTX 4060 Ti 16GB | Budget SDXL and Flux, slow but works |
| ~$700 | RTX 4070 Ti Super | Best overall — SDXL + Flux + ControlNet |
| ~$1,600 | RTX 4090 | Professional use, training, 24GB headroom |
| ~$2,000+ | RTX 5090 | Maximum speed, 32GB, future-proofed |
See the recommended pick on the original guide
For most people generating images as a hobby or side project, the RTX 4070 Ti Super handles every current model at useful speeds. Only step up to the 4090 if you need 24GB for ControlNet stacking or model training.
The best GPU for Stable Diffusion is the one with enough VRAM for your target model at a speed you can actually work with — neither too slow to iterate nor too expensive to justify.
Frequently Asked Questions
How much VRAM do you need for Stable Diffusion?
SD 1.5 runs on 6–8GB, SDXL needs 12–16GB, and Flux Dev requires 16GB minimum (24GB recommended with ControlNet). In 2026, 16GB is the minimum worth buying new if you want to stay current with the latest models. The jump from each generation roughly doubles the VRAM requirement, so buying 8GB today guarantees you will be upgrading soon.
Can you run Stable Diffusion on 8GB VRAM?
You can run SD 1.5 comfortably on 8GB VRAM, but SDXL and Flux — the current standard models — require heavy optimization hacks like tiled VAE and attention slicing on 8GB cards. Many workflows will fail with out-of-memory errors, and generation times increase 3–5x compared to 16GB cards. For 2026 workflows, 12GB is the practical minimum and 16GB is strongly recommended.
How much VRAM does Flux Schnell need?
Flux Schnell requires a minimum of 10GB VRAM and runs best with 12GB or more. With FP8 quantization you can squeeze it onto a 12GB card like the RTX 3060, but 16GB cards like the RTX 4060 Ti 16GB or RTX 4070 Ti Super provide a much more comfortable experience. Adding ControlNet to Flux Schnell pushes requirements to 14–16GB.
Is an AMD GPU good for Stable Diffusion?
AMD GPUs can technically run Stable Diffusion through DirectML or ROCm, but performance is 30–50% slower than equivalent NVIDIA cards. Critical optimizations like xformers and Flash Attention are NVIDIA-only, and community support overwhelmingly assumes CUDA. ROCm works better on Linux than Windows, adding another variable. Stick with NVIDIA for any serious Stable Diffusion work.
Related guides on Best GPU for AI
- Best GPU for AI Animation in 2026 (5 Picks Ranked)
- Best GPU for AI Art in 2026: Every Budget Compared
- Best GPU for DreamBooth Training in 2026 (Ranked)
The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.
Top comments (0)