DEV Community

Cover image for Best Quantization for Stable Diffusion & Flux
Thurmon Demich
Thurmon Demich

Posted on • Originally published at bestgpuforai.com

Best Quantization for Stable Diffusion & Flux

Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Quantization for image generation works differently than quantization for LLMs. With language models, you compress billions of text-prediction weights. With Stable Diffusion and Flux, you compress a UNet (or DiT transformer) and text encoders — and the quality tradeoffs hit differently because the output is visual. A slightly off token is invisible in text. A slightly off pixel is immediately noticeable.

The practical summary: FP16 is the default and produces the best quality. FP8 halves UNet VRAM with nearly invisible quality loss on RTX 40/50 series GPUs (which have native FP8 support). NF4 (4-bit) is the last resort for low-VRAM cards — it works, but image quality degrades noticeably on complex prompts.

See the recommended pick on the original guide

VRAM usage by quantization level

Model FP16 VRAM FP8 VRAM NF4 VRAM Notes
SD 1.5 ~4GB ~2.5GB ~1.8GB Runs on anything modern
SD XL ~7GB ~4.5GB ~3GB FP16 fits comfortably on 8GB+
Flux.1 Dev ~14GB ~8GB ~5GB FP8 is the sweet spot for 12-16GB cards
Flux.1 Schnell ~12GB ~7GB ~4.5GB Faster variant, slightly less VRAM

VRAM figures are approximate and vary by implementation, batch size, and resolution. Measured during generation at 1024x1024, batch size 1.

VRAM chart available at the original article

FP16: the baseline

FP16 (half-precision floating point) is the standard format for Stable Diffusion and Flux models. Every model checkpoint you download from CivitAI or Hugging Face is stored in FP16 by default.

Quality: Maximum. This is what the model was trained at. No information loss.

When to use FP16: Whenever your GPU has enough VRAM. For SD XL, that means 8GB+. For Flux, 16GB+. If you have an RTX 4090 (24GB) or RTX 5090 (32GB), there is no reason to quantize — run FP16 and get the best possible output.

See the recommended pick on the original guide

FP8: the practical sweet spot

FP8 (8-bit floating point) compresses the UNet/DiT weights to half the size of FP16. On paper, this should degrade quality. In practice, the difference is nearly invisible for most prompts — and on RTX 40 and 50 series GPUs, FP8 computation is handled by dedicated hardware (the FP8 tensor cores), so there's minimal speed penalty too.

Quality impact: Negligible for 90%+ of prompts. Side-by-side comparisons show differences only in very fine details at high magnification. Community blind tests consistently fail to distinguish FP8 from FP16 outputs.

When to use FP8:

  • Running Flux on 12GB GPUs (RTX 3060 12GB, RTX 5070) — see can the RTX 3060 run Stable Diffusion? for the full picture on what this card handles
  • Running Flux on 16GB GPUs and wanting headroom for ControlNet or batch generation
  • Running SD XL on 8GB GPUs

Hardware requirement: RTX 40/50 series GPUs have native FP8 tensor cores. RTX 30 series cards can use FP8 models but the computation falls back to FP16 math — you save VRAM but don't get the speed benefit.

NF4: when VRAM is desperate

NF4 (4-bit NormalFloat) quantization via ComfyUI nodes compresses models aggressively. This is how people run Flux on 8GB GPUs and SD XL on 6GB GPUs.

Quality impact: Visible. Fine details soften, color accuracy drifts slightly, and complex compositions show more artifacts. For quick drafts and iteration, it's usable. For final output quality, it's a compromise.

Quality aspect FP16 FP8 NF4
Fine detail Excellent Excellent Reduced
Color accuracy Reference Near-reference Slight drift
Complex scenes Full fidelity Full fidelity Occasional artifacts
Text rendering Best available Same as FP16 Degraded
Speed (RTX 40/50) Baseline Similar Slower (dequant overhead)

When to use NF4: Only when FP8 doesn't fit. Running Flux on 8GB cards. Running SD XL on 6GB cards. If your GPU has 12GB+, FP8 is almost always the better tradeoff.

See the recommended pick on the original guide

T5 text encoder quantization (Flux-specific)

Flux uses a large T5-XXL text encoder alongside its image model. At FP16, this encoder alone uses ~10GB of VRAM. Quantizing the T5 encoder to INT8 or INT4 reduces this to ~5GB or ~3GB respectively, with minimal impact on prompt understanding.

This is often more impactful than quantizing the image model itself. On a 16GB card, quantizing T5 to INT8 and keeping the DiT at FP8 gives a much better quality result than keeping T5 at FP16 and aggressively quantizing the DiT.

For a deeper look at Flux VRAM requirements, see the Flux GPU guide and VRAM requirements for Flux.

Which quantization should you use?

Your GPU VRAM SD 1.5 SD XL Flux
6GB FP16 NF4 Not viable
8GB FP16 FP16 NF4 (barely)
12GB FP16 FP16 FP8
16GB FP16 FP16 FP8 (with headroom)
24GB+ FP16 FP16 FP16

For GPU recommendations matched to these VRAM tiers, see the Stable Diffusion GPU guide and the ComfyUI GPU guide.

See the recommended pick on the original guide

Related guides on Best GPU for AI


The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.

Top comments (0)