Thurmon Demich

Posted on Jun 30 • Originally published at bestgpuforai.com

Best Quantization for Stable Diffusion & Flux

#quantization #stablediffusion #flux #vram

Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Quantization for image generation works differently than quantization for LLMs. With language models, you compress billions of text-prediction weights. With Stable Diffusion and Flux, you compress a UNet (or DiT transformer) and text encoders — and the quality tradeoffs hit differently because the output is visual. A slightly off token is invisible in text. A slightly off pixel is immediately noticeable.

The practical summary: FP16 is the default and produces the best quality. FP8 halves UNet VRAM with nearly invisible quality loss on RTX 40/50 series GPUs (which have native FP8 support). NF4 (4-bit) is the last resort for low-VRAM cards — it works, but image quality degrades noticeably on complex prompts.

VRAM usage by quantization level

Model	FP16 VRAM	FP8 VRAM	NF4 VRAM	Notes
SD 1.5	~4GB	~2.5GB	~1.8GB	Runs on anything modern
SD XL	~7GB	~4.5GB	~3GB	FP16 fits comfortably on 8GB+
Flux.1 Dev	~14GB	~8GB	~5GB	FP8 is the sweet spot for 12-16GB cards
Flux.1 Schnell	~12GB	~7GB	~4.5GB	Faster variant, slightly less VRAM

VRAM figures are approximate and vary by implementation, batch size, and resolution. Measured during generation at 1024x1024, batch size 1.

VRAM chart available at the original article

FP16: the baseline

FP16 (half-precision floating point) is the standard format for Stable Diffusion and Flux models. Every model checkpoint you download from CivitAI or Hugging Face is stored in FP16 by default.

Quality: Maximum. This is what the model was trained at. No information loss.

When to use FP16: Whenever your GPU has enough VRAM. For SD XL, that means 8GB+. For Flux, 16GB+. If you have an RTX 4090 (24GB) or RTX 5090 (32GB), there is no reason to quantize — run FP16 and get the best possible output.

FP8: the practical sweet spot

FP8 (8-bit floating point) compresses the UNet/DiT weights to half the size of FP16. On paper, this should degrade quality. In practice, the difference is nearly invisible for most prompts — and on RTX 40 and 50 series GPUs, FP8 computation is handled by dedicated hardware (the FP8 tensor cores), so there's minimal speed penalty too.

Quality impact: Negligible for 90%+ of prompts. Side-by-side comparisons show differences only in very fine details at high magnification. Community blind tests consistently fail to distinguish FP8 from FP16 outputs.

When to use FP8:

Running Flux on 12GB GPUs (RTX 3060 12GB, RTX 5070) — see can the RTX 3060 run Stable Diffusion? for the full picture on what this card handles
Running Flux on 16GB GPUs and wanting headroom for ControlNet or batch generation
Running SD XL on 8GB GPUs

Hardware requirement: RTX 40/50 series GPUs have native FP8 tensor cores. RTX 30 series cards can use FP8 models but the computation falls back to FP16 math — you save VRAM but don't get the speed benefit.

NF4: when VRAM is desperate

NF4 (4-bit NormalFloat) quantization via ComfyUI nodes compresses models aggressively. This is how people run Flux on 8GB GPUs and SD XL on 6GB GPUs.

Quality impact: Visible. Fine details soften, color accuracy drifts slightly, and complex compositions show more artifacts. For quick drafts and iteration, it's usable. For final output quality, it's a compromise.

Quality aspect	FP16	FP8	NF4
Fine detail	Excellent	Excellent	Reduced
Color accuracy	Reference	Near-reference	Slight drift
Complex scenes	Full fidelity	Full fidelity	Occasional artifacts
Text rendering	Best available	Same as FP16	Degraded
Speed (RTX 40/50)	Baseline	Similar	Slower (dequant overhead)

When to use NF4: Only when FP8 doesn't fit. Running Flux on 8GB cards. Running SD XL on 6GB cards. If your GPU has 12GB+, FP8 is almost always the better tradeoff.

T5 text encoder quantization (Flux-specific)

Flux uses a large T5-XXL text encoder alongside its image model. At FP16, this encoder alone uses ~10GB of VRAM. Quantizing the T5 encoder to INT8 or INT4 reduces this to ~5GB or ~3GB respectively, with minimal impact on prompt understanding.

This is often more impactful than quantizing the image model itself. On a 16GB card, quantizing T5 to INT8 and keeping the DiT at FP8 gives a much better quality result than keeping T5 at FP16 and aggressively quantizing the DiT.

For a deeper look at Flux VRAM requirements, see the Flux GPU guide and VRAM requirements for Flux.

Which quantization should you use?

Your GPU VRAM	SD 1.5	SD XL	Flux
6GB	FP16	NF4	Not viable
8GB	FP16	FP16	NF4 (barely)
12GB	FP16	FP16	FP8
16GB	FP16	FP16	FP8 (with headroom)
24GB+	FP16	FP16	FP16

For GPU recommendations matched to these VRAM tiers, see the Stable Diffusion GPU guide and the ComfyUI GPU guide.

Related guides on Best GPU for AI

The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.

DEV Community