Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.
Quantization for image generation works differently than quantization for LLMs. With language models, you compress billions of text-prediction weights. With Stable Diffusion and Flux, you compress a UNet (or DiT transformer) and text encoders — and the quality tradeoffs hit differently because the output is visual. A slightly off token is invisible in text. A slightly off pixel is immediately noticeable.
The practical summary: FP16 is the default and produces the best quality. FP8 halves UNet VRAM with nearly invisible quality loss on RTX 40/50 series GPUs (which have native FP8 support). NF4 (4-bit) is the last resort for low-VRAM cards — it works, but image quality degrades noticeably on complex prompts.
See the recommended pick on the original guide
VRAM usage by quantization level
| Model | FP16 VRAM | FP8 VRAM | NF4 VRAM | Notes |
|---|---|---|---|---|
| SD 1.5 | ~4GB | ~2.5GB | ~1.8GB | Runs on anything modern |
| SD XL | ~7GB | ~4.5GB | ~3GB | FP16 fits comfortably on 8GB+ |
| Flux.1 Dev | ~14GB | ~8GB | ~5GB | FP8 is the sweet spot for 12-16GB cards |
| Flux.1 Schnell | ~12GB | ~7GB | ~4.5GB | Faster variant, slightly less VRAM |
VRAM figures are approximate and vary by implementation, batch size, and resolution. Measured during generation at 1024x1024, batch size 1.
VRAM chart available at the original article
FP16: the baseline
FP16 (half-precision floating point) is the standard format for Stable Diffusion and Flux models. Every model checkpoint you download from CivitAI or Hugging Face is stored in FP16 by default.
Quality: Maximum. This is what the model was trained at. No information loss.
When to use FP16: Whenever your GPU has enough VRAM. For SD XL, that means 8GB+. For Flux, 16GB+. If you have an RTX 4090 (24GB) or RTX 5090 (32GB), there is no reason to quantize — run FP16 and get the best possible output.
See the recommended pick on the original guide
FP8: the practical sweet spot
FP8 (8-bit floating point) compresses the UNet/DiT weights to half the size of FP16. On paper, this should degrade quality. In practice, the difference is nearly invisible for most prompts — and on RTX 40 and 50 series GPUs, FP8 computation is handled by dedicated hardware (the FP8 tensor cores), so there's minimal speed penalty too.
Quality impact: Negligible for 90%+ of prompts. Side-by-side comparisons show differences only in very fine details at high magnification. Community blind tests consistently fail to distinguish FP8 from FP16 outputs.
When to use FP8:
- Running Flux on 12GB GPUs (RTX 3060 12GB, RTX 5070) — see can the RTX 3060 run Stable Diffusion? for the full picture on what this card handles
- Running Flux on 16GB GPUs and wanting headroom for ControlNet or batch generation
- Running SD XL on 8GB GPUs
Hardware requirement: RTX 40/50 series GPUs have native FP8 tensor cores. RTX 30 series cards can use FP8 models but the computation falls back to FP16 math — you save VRAM but don't get the speed benefit.
NF4: when VRAM is desperate
NF4 (4-bit NormalFloat) quantization via ComfyUI nodes compresses models aggressively. This is how people run Flux on 8GB GPUs and SD XL on 6GB GPUs.
Quality impact: Visible. Fine details soften, color accuracy drifts slightly, and complex compositions show more artifacts. For quick drafts and iteration, it's usable. For final output quality, it's a compromise.
| Quality aspect | FP16 | FP8 | NF4 |
|---|---|---|---|
| Fine detail | Excellent | Excellent | Reduced |
| Color accuracy | Reference | Near-reference | Slight drift |
| Complex scenes | Full fidelity | Full fidelity | Occasional artifacts |
| Text rendering | Best available | Same as FP16 | Degraded |
| Speed (RTX 40/50) | Baseline | Similar | Slower (dequant overhead) |
When to use NF4: Only when FP8 doesn't fit. Running Flux on 8GB cards. Running SD XL on 6GB cards. If your GPU has 12GB+, FP8 is almost always the better tradeoff.
See the recommended pick on the original guide
T5 text encoder quantization (Flux-specific)
Flux uses a large T5-XXL text encoder alongside its image model. At FP16, this encoder alone uses ~10GB of VRAM. Quantizing the T5 encoder to INT8 or INT4 reduces this to ~5GB or ~3GB respectively, with minimal impact on prompt understanding.
This is often more impactful than quantizing the image model itself. On a 16GB card, quantizing T5 to INT8 and keeping the DiT at FP8 gives a much better quality result than keeping T5 at FP16 and aggressively quantizing the DiT.
For a deeper look at Flux VRAM requirements, see the Flux GPU guide and VRAM requirements for Flux.
Which quantization should you use?
| Your GPU VRAM | SD 1.5 | SD XL | Flux |
|---|---|---|---|
| 6GB | FP16 | NF4 | Not viable |
| 8GB | FP16 | FP16 | NF4 (barely) |
| 12GB | FP16 | FP16 | FP8 |
| 16GB | FP16 | FP16 | FP8 (with headroom) |
| 24GB+ | FP16 | FP16 | FP16 |
For GPU recommendations matched to these VRAM tiers, see the Stable Diffusion GPU guide and the ComfyUI GPU guide.
See the recommended pick on the original guide
Related guides on Best GPU for AI
- How Much VRAM Do You Need for Stable Diffusion in 2026?
- Can the RTX 3060 Run Stable Diffusion? (Tested)
- Can the RTX 4060 Ti Run Flux in 2026? (Yes — 16GB Only)
The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.
Top comments (0)