Thurmon Demich

Posted on Jun 19 • Originally published at bestgpuforai.com

Best GPU for SD 3.5 in 2026: 5 Cards (Large + Medium)

#gpu #stablediffusion35 #sd35 #imagegeneration

This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.

I have been running Stable Diffusion 3.5 locally since the Large checkpoint stabilised, and the buying advice from the SDXL era simply does not transfer. SD 3.5 Large is an 8B-parameter MMDiT model, Medium is a 2.6B sibling, and the May 2026 ControlNet release for Large finally made it usable for production work. None of that fits cleanly on the old "12GB is enough" mental model.

So here is how I would spend my own money in 2026, ranked by which SD 3.5 variant you actually run.

Quick answer

If you only run SD 3.5 Large, buy the RTX 4070 Ti Super 16GB. It clears FP16 with headroom for a ControlNet pass and lands around 10-12 seconds per 1024x1024 image. If you split your time between Large and Medium and want FP16 everywhere without thinking, get the RTX 5080 16GB. Anything below 16GB and you are quantising Large to FP8 — which works, but it is a compromise.

Who this is for

You are picking a GPU specifically for SD 3.5 — not a do-everything LLM box, not a Flux.2 rig. If you are torn between SD 3.5 and Flux.2 first, read my Flux.2 vs SD 3.5 hardware breakdown before this guide. And if you want the broader image-gen picture covering SD 1.5, SDXL and SD 3.5 together, the best GPU for Stable Diffusion round-up is the better starting point.

This piece assumes you have already decided: SD 3.5 Large or Medium, locally, in 2026.

VRAM tiers by variant and precision

The single most useful table in this whole article. SD 3.5 Large is roughly 2.5x SDXL's footprint at FP16, and SD 3.5 Medium is genuinely lightweight.

Variant	Precision	Model weights	Inference VRAM (1024x1024, batch 1)
SD 3.5 Large (8B)	FP16	~14 GB	14-16 GB
SD 3.5 Large (8B)	FP8	~7-8 GB	9-11 GB
SD 3.5 Medium (2.6B)	FP16	~6 GB	7-8 GB
SD 3.5 Medium (2.6B)	FP8	~3-4 GB	5-6 GB

Add 2-3 GB on top of every row if you stack the new ControlNet — the May 2026 SD 3.5 Large ControlNet (Canny, Depth, Blur) is excellent but it is not free. Full numbers in my how much VRAM for Stable Diffusion deep-dive.

VRAM chart available at the original article

GPU generation-time ranking

Numbers below are my own observed times for a 1024x1024, 28-step Euler generation in ComfyUI, batch 1, no ControlNet. Your mileage will swing 10-15% with samplers and scheduler.

GPU	VRAM	SD 3.5 Large FP16	SD 3.5 Large FP8	SD 3.5 Medium FP16	Street price
RTX 5090	32 GB	~5 s	~4 s	~2 s	~$2,000
RTX 4090	24 GB	~8 s	~6 s	~3 s	~$1,600
RTX 5080	16 GB	~9 s	~7 s	~3 s	~$1,000
RTX 5070 Ti	16 GB	~11 s	~8 s	~4 s	~$750
RTX 4070 Ti Super	16 GB	~12 s	~10 s	~4 s	~$700
RTX 3090 (used)	24 GB	~14 s	~11 s	~5 s	~$700
RTX 4060 Ti 16GB	16 GB	OOM-prone	~18 s	~7 s	~$400
RTX 3060 12GB	12 GB	OOM	~24 s	~9 s	~$200

The 4060 Ti 16GB technically loads SD 3.5 Large FP16 but bandwidth-starvation makes it painful — closer to 25s per image and the moment you add ControlNet you OOM. Treat it as an FP8-only card for Large.

Which GPU should YOU buy?

I keep getting variations of the same four scenarios. Here is the conditional logic.

You run SD 3.5 Large daily, you stack ControlNet, you bill clients. Buy the RTX 5090. The 32GB lets you batch 2-4 images at FP16 with ControlNet attached, which is where the real productivity gain lives. Anything less and you are single-image-batching forever.
You run SD 3.5 Large for fun or freelance, want FP16, do not need batching. Buy the RTX 5080 16GB. It is the cheapest card that still feels like a 4090 for this exact workload. Blackwell FP8 acceleration also future-proofs you for whatever ships next.
You are budget-bound but want SD 3.5 Large at acceptable speed. Buy the RTX 4070 Ti Super 16GB new or RTX 3090 24GB used. The 4070 Ti Super is faster per generation; the 3090 gives you 24GB for batching at the cost of more power draw and less FP8 efficiency. I lean 4070 Ti Super for new buyers, 3090 only if you find one under $650.
You mostly run SD 3.5 Medium and only dabble in Large. Buy the RTX 4060 Ti 16GB. Medium FP16 cruises, Large FP8 is tolerable, and you save enough to upgrade in two years.

Pair whichever you pick with a workflow you actually like — my best GPU for ComfyUI notes explain why I think ComfyUI is the right SD 3.5 frontend, especially with the May ControlNet drop covered in best GPU for ControlNet.

A contrarian take: the RTX 3090 is overrated for SD 3.5

Everyone in the Reddit threads is still recommending used 3090s. I do not agree, not for SD 3.5 specifically. Here is why:

No FP8 acceleration. SD 3.5's FP8 quantisation is one of the best things about it. The 3090 runs FP8 via emulation, losing most of the speed-up. A 5070 Ti at FP8 is genuinely faster than a 3090 at FP8.
Power draw. 350W TDP versus ~285W for the 5070 Ti. Over a year of daily generation that is a real electricity bill difference.
No warranty. Most used 3090s are mining survivors. The thermal pads are cooked.

The 3090's only honest advantage for SD 3.5 is the 24GB for batching at FP16. If you do not batch, you are paying a power-and-risk premium for nothing.

Common SD 3.5 mistakes

Buying a 12GB card "because SDXL ran fine on 12GB" — SD 3.5 Large will not. You will spend your first weekend quantising to FP8 and wondering why outputs look slightly worse.
Skipping FP8 because "it loses quality" — at SD 3.5 Large's scale the FP8 quality loss is genuinely small and the speed-up is large. Test it before dismissing it.
Forgetting the new ControlNet adds VRAM — the May 2026 SD 3.5 Large ControlNet release stacks 2-3 GB on top of base inference. Plan VRAM headroom around ControlNet, not raw inference.
Treating SD 3.5 Medium as a downgrade — Medium is genuinely good for iteration, especially for LoRA training pipelines where you generate hundreds of test images. A 4060 Ti 16GB running Medium FP16 is faster end-to-end than a 4090 running Large FP16.

Final verdict

Tier	GPU	Why
Top pick	RTX 5090	Only card that batches SD 3.5 Large FP16 + ControlNet
Best value	RTX 4070 Ti Super 16GB	SD 3.5 Large FP16 cleared, around 12s per image, ~$700
All-rounder	RTX 5080 16GB	FP8 acceleration, future-proofed, fits both variants
Budget Medium	RTX 4060 Ti 16GB	Medium FP16 cruises, Large FP8 tolerable
Skip	RTX 3060 12GB	Large OOMs, Medium FP8 only — buy used 3090 instead

If you only remember one thing: buy 16GB minimum for SD 3.5 Large, and do not let anyone talk you into a used 3090 unless the price is genuinely under $650.

Related guides on Best GPU for AI

The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.

DEV Community