This article was originally published on runaihome.com
TL;DR: NVFP4 is a Blackwell-exclusive quantization format that pushes FLUX 1 Dev to 7.73 it/s — 118% faster than GGUF Q8 and 84% faster than FP8 Scaled — while cutting VRAM from 26 GB (BF16) to 14 GB. The catch: it requires CUDA 13.0 and an RTX 50-series GPU. On RTX 40-series, NVFP4 delivers no speedup and can actually run 2× slower than FP8 if you don't have the right PyTorch build. RTX 40-series owners should use FP8 Scaled instead.
| RTX 50-Series + NVFP4 | RTX 40/30-Series + FP8 Scaled | RTX 40/30-Series + BF16 | |
|---|---|---|---|
| Best for | Maximum throughput on Blackwell | Speed + quality on Ada/Ampere | Full fidelity, no quality loss |
| FLUX 1 Dev speed | 7.73 it/s | 4.21 it/s | 4.53 it/s |
| VRAM (FLUX SRPO) | 14 GB | ~17 GB | 26 GB |
| The catch | RTX 50-series only, needs CUDA 13 | No hardware FP8 speedup on 30-series | 24+ GB card mandatory |
Honest take: If you own an RTX 50-series card, NVFP4 with PyTorch cu130 is the single highest-impact setting change you can make — 7 minutes to set up, nearly 2× faster generation immediately. If you're on RTX 40-series, skip NVFP4 entirely and use FP8 Scaled checkpoints, which give you 40% VRAM savings with near-identical quality.
What NVFP4 Actually Is
NVFP4 is NVIDIA's own 4-bit floating-point quantization format, introduced with Blackwell architecture. It is not the same as GGUF Q4, NF4, or bitsandbytes FP4 — those are generic community formats that fall back to software emulation on any hardware. NVFP4 uses dedicated FP4 instructions wired into the 5th-generation Tensor Cores on Blackwell's sm120 architecture. The math runs natively in silicon.
The format uses a two-level scaling scheme: a global scale factor per tensor, plus per-block scale factors. This preserves dynamic range better than naive 4-bit truncation, which is why quality degradation is minimal on most FLUX workflows despite the aggressive compression.
RTX 40-series (Ada Lovelace, sm89) has FP8 tensor cores but no FP4 datapath. NVFP4 will technically load on an RTX 4090, but without native FP4 acceleration, PyTorch falls back to software emulation — which is why NVIDIA explicitly warns that running NVFP4 without PyTorch cu130 can be up to 2× slower than FP8. That's not a misconfiguration; it's the expected behavior when emulating FP4 math on hardware built for FP8.
The Numbers: FLUX 1 Dev on RTX 5090 with CUDA 13
Benchmarks from Furkan Gözükara's FLUX precision comparison (RTX 5090, CUDA 13, 2048px, Quality 1 preset) on FLUX 1 Dev:
| Format | Speed (it/s) | vs GGUF Q8 |
|---|---|---|
| NVFP4 | 7.73 | +118% |
| BF16 | 4.53 | +28% |
| FP8 Scaled | 4.21 | +19% |
| GGUF Q8 (baseline) | 3.54 | — |
For FLUX SRPO on the same hardware: 5.7 seconds for 40 steps at NVFP4, using 14 GB VRAM vs 26 GB for the BF16 equivalent — a 46% reduction in VRAM footprint.
For reference, raw FLUX Dev FP8 generation times across GPU tiers (from the ComfyUI GitHub benchmark discussion, 20 steps, standard workflow):
| GPU | Time (s) | Speed (it/s) |
|---|---|---|
| RTX 5090 | 5.46 | 3.66 |
| RTX 5080 | 6.67 | 3.23 |
| RTX 5060 Ti | 25.71 | 1.20 |
| RTX 4090 | 11.28 | 1.85 |
| RTX 3090 | ~26 | ~0.77 |
NVFP4 on a properly configured RTX 5090 takes FLUX dev from 5.46s (FP8) to approximately 2.6–3.0 seconds per generation — matching NVIDIA's publicized "5 seconds for FLUX dev" figure on RTX 5090 at FP4. Meanwhile an RTX 4090 on FP8 lands at 11.28 seconds — still 2× slower than a Blackwell mid-range doing NVFP4, even though the 4090 nominally outspecs the 5080 on paper in other metrics.
GPU Tier Guide: Which Format to Use
RTX 50-Series (RTX 5060 Ti, 5070, 5070 Ti, 5080, 5090): Use NVFP4
Every RTX 50-series card — including the RTX 5060 Ti 16GB at the budget end — carries Blackwell's sm120 Tensor Cores with native FP4 hardware. NVFP4 is your native format. The speed advantage over FP8 is real (roughly 84% faster on FLUX 1 Dev), VRAM savings are significant, and quality degradation on FLUX models is acceptable for most production workflows.
The only prerequisite is getting PyTorch on CUDA 13 (see setup steps below). Without that, you're running software-emulated FP4 and will likely see worse performance than FP8.
One nuance for the RTX 5060 Ti: the 16GB card can load NVFP4 FLUX 1 Dev (which needs ~14 GB) with 2 GB of headroom. That's tight for large batches or multi-ControlNet workflows. FP8 Scaled at ~17 GB is over the limit, so NVFP4 is actually the only path to running full-resolution FLUX 1 Dev on that card. For the RTX 5070 and above with 12+ GB to spare above NVFP4's footprint, the math is comfortable.
RTX 40-Series (RTX 4070, 4080, 4090): Use FP8 Scaled — Skip NVFP4
The RTX 4090, RTX 4080 Super, RTX 4070 Ti Super — none of them have native FP4 Tensor Cores. NVFP4 will load on these cards if you try, but it runs in emulation mode and the benchmark reality is it runs at best comparably to FP8, and at worst 2× slower. The community consensus: stick to FP8 Scaled (also called NVFP8 in NVIDIA's naming).
FP8 Scaled on RTX 40-series delivers:
- ~40% VRAM reduction vs BF16
- Speed comparable to BF16 or slightly faster (4.21 it/s for FP8 vs 4.53 it/s for BF16 on Ada — essentially tied, with the VRAM savings being the actual win)
- No meaningful quality difference vs BF16 in practice
If you see people benchmarking NVFP4 on RTX 4090 and getting impressive numbers, check whether they're on PyTorch cu130 with CUDA 13 and whether they've confirmed FP4 hardware acceleration is actually being used. Without a Blackwell card, the claimed speedup won't materialize.
RTX 30-Series (RTX 3090, 3080, 3060): FP8 Scaled or GGUF
The RTX 3090 runs FLUX Dev at ~26 seconds per image in FP8 — roughly 5× slower than an RTX 5090 doing NVFP4. There's no quantization format that closes that gap on Ampere hardware; FP8 on 30-series runs in software emulation as well, delivering minimal speedup over FP16/BF16. GGUF Q4-Q8 is often the best choice here because it's memory-efficient without hardware-accelerated requirements.
That said, the RTX 3090 with its 24GB VRAM still has an edge over 16GB cards for running larger models without quality-hurting quantization. It's not a speed demon for image generation in 2026, but for tasks where VRAM ceiling matters more than throughput, it holds up.
Available NVFP4 Model Checkpoints
Black Forest Labs has released NVFP4-quantized versions of their main models, available on Hugging Face:
- FLUX.1-dev-NVFP4 — the standard text-to-image dev model at ~14 GB
- FLUX.2-dev-NVFP4 — the updated successor, same ~14 GB footprint
- FLUX.1-Kontext-dev-NVFP4 — the image-editing model (covered in the FLUX Kontext guide)
NVIDIA has also released NVFP4 checkpoints for:
- Z-Image (Alibaba) — a high-speed turbo model
- Qwen-Image (Alibaba)
LTX-2.3 (Lightricks' video generation model) was announced as coming to NVFP4 support. If you're running WAN video generation, check the WAN GPU guide for context on video model VRAM requirements.
All NVFP4 checkpoints use the .safetensors format and load through ComfyUI's standard diffusion model loader once you've upgraded PyTorch to cu130.
Setup: Upgrading to PyTorch CUDA 13 for NVFP4
This is the step most users miss. NVFP4 hardware acceleration requires PyTorch built with CUDA 13.0. Without it, you're emulating FP4 in software and the speedup disappears — or reverses.
Step 1: Update your NVIDIA driver
You need driver version ≥580. Check your current version with:
nvidia-smi
If below 580, download the latest driver from NVIDIA's site before continuing. Older drivers don'
Top comments (0)