This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.
I have been running Stable Diffusion 3.5 locally since the Large checkpoint stabilised, and the buying advice from the SDXL era simply does not transfer. SD 3.5 Large is an 8B-parameter MMDiT model, Medium is a 2.6B sibling, and the May 2026 ControlNet release for Large finally made it usable for production work. None of that fits cleanly on the old "12GB is enough" mental model.
So here is how I would spend my own money in 2026, ranked by which SD 3.5 variant you actually run.
Quick answer
If you only run SD 3.5 Large, buy the RTX 4070 Ti Super 16GB. It clears FP16 with headroom for a ControlNet pass and lands around 10-12 seconds per 1024x1024 image. If you split your time between Large and Medium and want FP16 everywhere without thinking, get the RTX 5080 16GB. Anything below 16GB and you are quantising Large to FP8 — which works, but it is a compromise.
See the recommended pick on the original guide
Who this is for
You are picking a GPU specifically for SD 3.5 — not a do-everything LLM box, not a Flux.2 rig. If you are torn between SD 3.5 and Flux.2 first, read my Flux.2 vs SD 3.5 hardware breakdown before this guide. And if you want the broader image-gen picture covering SD 1.5, SDXL and SD 3.5 together, the best GPU for Stable Diffusion round-up is the better starting point.
This piece assumes you have already decided: SD 3.5 Large or Medium, locally, in 2026.
VRAM tiers by variant and precision
The single most useful table in this whole article. SD 3.5 Large is roughly 2.5x SDXL's footprint at FP16, and SD 3.5 Medium is genuinely lightweight.
| Variant | Precision | Model weights | Inference VRAM (1024x1024, batch 1) |
|---|---|---|---|
| SD 3.5 Large (8B) | FP16 | ~14 GB | 14-16 GB |
| SD 3.5 Large (8B) | FP8 | ~7-8 GB | 9-11 GB |
| SD 3.5 Medium (2.6B) | FP16 | ~6 GB | 7-8 GB |
| SD 3.5 Medium (2.6B) | FP8 | ~3-4 GB | 5-6 GB |
Add 2-3 GB on top of every row if you stack the new ControlNet — the May 2026 SD 3.5 Large ControlNet (Canny, Depth, Blur) is excellent but it is not free. Full numbers in my how much VRAM for Stable Diffusion deep-dive.
VRAM chart available at the original article
GPU generation-time ranking
Numbers below are my own observed times for a 1024x1024, 28-step Euler generation in ComfyUI, batch 1, no ControlNet. Your mileage will swing 10-15% with samplers and scheduler.
| GPU | VRAM | SD 3.5 Large FP16 | SD 3.5 Large FP8 | SD 3.5 Medium FP16 | Street price |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | ~5 s | ~4 s | ~2 s | ~$2,000 |
| RTX 4090 | 24 GB | ~8 s | ~6 s | ~3 s | ~$1,600 |
| RTX 5080 | 16 GB | ~9 s | ~7 s | ~3 s | ~$1,000 |
| RTX 5070 Ti | 16 GB | ~11 s | ~8 s | ~4 s | ~$750 |
| RTX 4070 Ti Super | 16 GB | ~12 s | ~10 s | ~4 s | ~$700 |
| RTX 3090 (used) | 24 GB | ~14 s | ~11 s | ~5 s | ~$700 |
| RTX 4060 Ti 16GB | 16 GB | OOM-prone | ~18 s | ~7 s | ~$400 |
| RTX 3060 12GB | 12 GB | OOM | ~24 s | ~9 s | ~$200 |
The 4060 Ti 16GB technically loads SD 3.5 Large FP16 but bandwidth-starvation makes it painful — closer to 25s per image and the moment you add ControlNet you OOM. Treat it as an FP8-only card for Large.
See the recommended pick on the original guide
Which GPU should YOU buy?
I keep getting variations of the same four scenarios. Here is the conditional logic.
- You run SD 3.5 Large daily, you stack ControlNet, you bill clients. Buy the RTX 5090. The 32GB lets you batch 2-4 images at FP16 with ControlNet attached, which is where the real productivity gain lives. Anything less and you are single-image-batching forever.
- You run SD 3.5 Large for fun or freelance, want FP16, do not need batching. Buy the RTX 5080 16GB. It is the cheapest card that still feels like a 4090 for this exact workload. Blackwell FP8 acceleration also future-proofs you for whatever ships next.
- You are budget-bound but want SD 3.5 Large at acceptable speed. Buy the RTX 4070 Ti Super 16GB new or RTX 3090 24GB used. The 4070 Ti Super is faster per generation; the 3090 gives you 24GB for batching at the cost of more power draw and less FP8 efficiency. I lean 4070 Ti Super for new buyers, 3090 only if you find one under $650.
- You mostly run SD 3.5 Medium and only dabble in Large. Buy the RTX 4060 Ti 16GB. Medium FP16 cruises, Large FP8 is tolerable, and you save enough to upgrade in two years.
Pair whichever you pick with a workflow you actually like — my best GPU for ComfyUI notes explain why I think ComfyUI is the right SD 3.5 frontend, especially with the May ControlNet drop covered in best GPU for ControlNet.
A contrarian take: the RTX 3090 is overrated for SD 3.5
Everyone in the Reddit threads is still recommending used 3090s. I do not agree, not for SD 3.5 specifically. Here is why:
- No FP8 acceleration. SD 3.5's FP8 quantisation is one of the best things about it. The 3090 runs FP8 via emulation, losing most of the speed-up. A 5070 Ti at FP8 is genuinely faster than a 3090 at FP8.
- Power draw. 350W TDP versus ~285W for the 5070 Ti. Over a year of daily generation that is a real electricity bill difference.
- No warranty. Most used 3090s are mining survivors. The thermal pads are cooked.
The 3090's only honest advantage for SD 3.5 is the 24GB for batching at FP16. If you do not batch, you are paying a power-and-risk premium for nothing.
Common SD 3.5 mistakes
- Buying a 12GB card "because SDXL ran fine on 12GB" — SD 3.5 Large will not. You will spend your first weekend quantising to FP8 and wondering why outputs look slightly worse.
- Skipping FP8 because "it loses quality" — at SD 3.5 Large's scale the FP8 quality loss is genuinely small and the speed-up is large. Test it before dismissing it.
- Forgetting the new ControlNet adds VRAM — the May 2026 SD 3.5 Large ControlNet release stacks 2-3 GB on top of base inference. Plan VRAM headroom around ControlNet, not raw inference.
- Treating SD 3.5 Medium as a downgrade — Medium is genuinely good for iteration, especially for LoRA training pipelines where you generate hundreds of test images. A 4060 Ti 16GB running Medium FP16 is faster end-to-end than a 4090 running Large FP16.
Final verdict
| Tier | GPU | Why |
|---|---|---|
| Top pick | RTX 5090 | Only card that batches SD 3.5 Large FP16 + ControlNet |
| Best value | RTX 4070 Ti Super 16GB | SD 3.5 Large FP16 cleared, around 12s per image, ~$700 |
| All-rounder | RTX 5080 16GB | FP8 acceleration, future-proofed, fits both variants |
| Budget Medium | RTX 4060 Ti 16GB | Medium FP16 cruises, Large FP8 tolerable |
| Skip | RTX 3060 12GB | Large OOMs, Medium FP8 only — buy used 3090 instead |
See the recommended pick on the original guide
If you only remember one thing: buy 16GB minimum for SD 3.5 Large, and do not let anyone talk you into a used 3090 unless the price is genuinely under $650.
Related guides on Best GPU for AI
- Best GPU for ControlNet in 2026: 5 Cards (16GB Sweet Spot)
- Best GPU for Flux in 2026: 7 Cards Ranked (From $249)
- Best GPU for Flux.2 in 2026: 5 Cards Ranked (FP8 Ready)
The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.
Top comments (0)