Flux vs SDXL vs SD 1.5: Real Cost-per-Image Across GPUs (2026)

#flux #sdxl #stablediffusion #comparison

This article was originally published on runaihome.com

Three generations of image models now live in a typical ComfyUI installation (Windows users: see our ComfyUI Windows setup guide), and the choice between them isn't obvious. SD 1.5 still commands the deepest fine-tune ecosystem ever built around a single model. SDXL is the default backbone for most home-lab artists. Flux.1 produces images that read as professional photography — handles human hands, readable in-image text, and complex lighting in ways that SD and SDXL can't reliably match.

The tradeoff is hardware. Flux requires 12–24 GB VRAM and takes 4–10× longer per image than SDXL on the same GPU. Whether that matters depends on how many images you generate per session and what GPU you're running. This article quantifies those costs: verified generation times across two GPU tiers, converted into dollar-per-image electricity costs at the current US average of $0.182/kWh (EIA 2026 forecast), and a cloud comparison that tells you when a $30/month Midjourney subscription is still the smarter call.

The Three Models at a Glance

Model	Architecture	Parameters	Native Resolution	VRAM (FP16)	VRAM (FP8/GGUF)
SD 1.5	U-Net	860M	512×512	4–6 GB	—
SDXL 1.0	U-Net (dual)	3.5B	1024×1024	8–12 GB	—
Flux.1 Dev	DiT transformer	12B	1024×1024	~24 GB	12–14 GB
Flux.1 Schnell	DiT transformer	12B	1024×1024	~24 GB	12–14 GB

SD 1.5 and SDXL use U-Net architectures — compact, fast, designed for iterative denoising. Flux uses a Diffusion Transformer (DiT) architecture at 12 billion parameters. The quality jump is observable and consistent: Flux renders legible text in generated images, renders human anatomy with significantly fewer errors, and handles complex multi-element compositions more coherently. SDXL cannot do any of these reliably.

Dev vs Schnell: Flux.1 Schnell uses knowledge distillation to produce usable images in 4 steps instead of the 20+ steps Flux Dev requires. Schnell is Apache 2.0 licensed; Dev carries a non-commercial research restriction. For personal home use, either is legally fine. Schnell is faster, but most users running quality-critical work prefer Dev at 20 steps for the added detail — especially for photorealistic subjects.

Raw Speed: Verified Benchmarks

The SDXL numbers below come from ComfyUI's public benchmark thread (Discussion #2970), which aggregated community-submitted hardware results for SDXL 1.0 at 1024×1024, 20 steps in ComfyUI. The Flux.1 Dev numbers come from ComfyUI Discussion #4571 (RTX 4090 Flux benchmarks, multiple contributors). SD 1.5 timings are derived from Automatic1111 community benchmarks; the 4090 vs 3090 ratio is confirmed by Tom's Hardware testing.

SDXL at 1024×1024, 20 steps

GPU	it/s	Sec/Image
RTX 3070 8 GB	2.26	8.8 s
RTX 3090 24 GB	3.61	5.5 s
RTX 4080 16 GB	3.53	5.7 s
RTX 4090 24 GB	7.61	2.6 s

The RTX 3090 and RTX 4080 16GB land within 3% of each other on SDXL — roughly equal inference speed despite the VRAM difference. The RTX 4090 pulls ~2× ahead.

Flux.1 Dev at 1024×1024, 20 steps

GPU	Precision	Sec/Image
RTX 4090	FP8 + `--fast`	9–10 s
RTX 4090	Q8 GGUF	15–17 s
RTX 4090	FP16 full	18–41 s
RTX 3090	FP8	~14–18 s

The FP16 time for RTX 4090 varies widely (18–41 s) depending on whether torch.compile is active and whether the VRAM pressure forces any CPU offloading. FP8 with --fast is the practical default on 24 GB cards — it fits cleanly, the quality delta from FP16 is undetectable at normal viewing distances, and the 9–10 second generation time is genuinely workflow-usable.

The RTX 3090 FP8 estimate (~14–18 s) is derived from community reports of the 3090 running approximately 40–45% slower than the 4090 per iteration, consistent with multiple benchmark sources.

Flux.1 Schnell at 1024×1024, 4 steps

GPU	Precision	Sec/Image
RTX 4090	FP8	~4–5 s
RTX 3090	FP8	~6–8 s

Schnell at 4 steps is competitive with SDXL at 20 steps in pure generation time on the RTX 4090. Quality isn't SDXL-equivalent — it's better in photorealism, weaker in fine-grained compositional control where SDXL's ecosystem of refined samplers and CFG schedules still has an edge. For prompt-exploration workflows where you're running 50+ generations to find the right composition, Schnell makes Flux economically viable on a 3090.

SD 1.5 at 512×512, 50 steps

GPU	it/s	Sec/Image
RTX 4090	~37.6	~1.3 s
RTX 3090	~18.8	~2.7 s

SD 1.5's native resolution is 512×512. At that resolution and 50 steps, the RTX 4090 generates roughly 46 images per minute. The gap over SDXL and Flux in raw throughput is dramatic. For workflows that require hundreds of iterations — LoRA testing, prompt engineering sessions, batch rendering concept grids — SD 1.5's speed advantage is real and meaningful.

The Electricity Math

At $0.182/kWh (US residential average, EIA 2026 forecast) and official NVIDIA TDPs (RTX 4090: 450W, RTX 3090: 350W):

Formula: cost = (seconds/image × 1000 images ÷ 3600) × (TDP_kW) × ($/kWh)

Model	GPU	Sec/Image	TDP	Cost / 1,000 Images
SD 1.5	RTX 4090 (450W)	1.3 s	450W	$0.030
SD 1.5	RTX 3090 (350W)	2.7 s	350W	$0.048
SDXL	RTX 4090 (450W)	2.6 s	450W	$0.060
SDXL	RTX 3090 (350W)	5.5 s	350W	$0.097
Flux Schnell	RTX 4090 (450W)	4.5 s	450W	$0.102
Flux Schnell	RTX 3090 (350W)	7.0 s	350W	$0.124
Flux Dev (FP8)	RTX 4090 (450W)	10 s	450W	$0.228
Flux Dev (FP8)	RTX 3090 (350W)	16 s	350W	$0.284

Three things stand out:

1. Electricity is not the cost driver — hardware is. Even running Flux Dev on an RTX 3090 at full throughput 24/7 for a month produces roughly 162,000 images and costs about $46 in electricity. The GPU purchase is always the dominant number.

2. Flux Schnell on an RTX 4090 costs roughly the same electricity-per-image as SDXL on an RTX 3090. The 4090 generates Schnell images nearly twice as fast, which largely cancels out its higher TDP.

3. The gap from SDXL to Flux Dev is real. At 10 seconds per image versus 2.6 seconds, Flux Dev takes 3.8× longer on the same 4090, which translates to 3.8× the electricity cost. For 10,000 images monthly, that's $2.28 vs $0.60 in electricity — not consequential on its own, but multiply by years and it adds up.

VRAM Tiers: What You Can Actually Run

The VRAM question isn't just about whether a model fits — it's about whether it fits at a speed that matches your workflow.

12 GB cards (RTX 3060 12GB, RTX 4060 Ti 12GB): SD 1.5 at full speed. SDXL runs but benefits from 16 GB headroom, especially with ControlNet or a refiner loaded simultaneously. Flux requires GGUF Q5 or lower quantization and will use CPU offloading for the text encoders — expect 30–60 seconds per image. Usable for final production renders, impractical for iterative workflows.

ComfyUI's Dynamic VRAM system (released March 2026) improved the 12 GB Flux experience by reducing peak RAM pressure, but it doesn't change the fundamental compute bottleneck. The 3060 12GB is still a solid SDXL card — it's a slow Flux card.

**16 GB cards ([RTX 4060](https://www.amazon.com/s?k=RTX+4060&tag=ru