Jovan Chan

Posted on Jun 15 • Originally published at runaihome.com

FLUX.1 Kontext Dev for Local AI in 2026: Image Editing on Consumer GPUs Without the API Bills

#flux #comfyui #imageediting #localai

This article was originally published on runaihome.com

TL;DR: FLUX.1 Kontext dev is a 12B open-weight image-editing model from Black Forest Labs. The FP8 checkpoint runs in 12GB VRAM at roughly 2× the speed of the raw BF16 model; an aggressive NF4 quantization squeezes it to 7GB. The API is $0.04 per image — local breaks even in under 13,000 edits.

	RTX 4090 (FP8)	RTX 4070 / 3060 12GB (FP8)	8GB GPU (NF4/GGUF)
Best for	Full-speed editing, FP4 on RTX 50-series	Sweet spot: quality + hardware you may already own	Budget entry, slower output
VRAM used	12–14 GB (headroom for FP8)	12–14 GB	7–8 GB
Speed	~2.29 iter/s at NF4 / faster at FP8 TensorRT	~1.5–2.0 iter/s at FP8	~0.6–1.0 iter/s
The catch	Hardware cost is steep if you don't own one	T5 encoder adds ~6–9 GB RAM overhead	Visible quality loss vs FP8

Honest take: If you own a 12GB+ GPU, run the FP8 checkpoint locally — the setup takes 20 minutes and you'll break even against API costs in a weekend of editing. Below 12GB, the quality compromise from NF4 is real enough to just use the API unless you're doing hundreds of edits daily.

What Flux Kontext Is (and Isn't)

Black Forest Labs released FLUX.1 Kontext Pro on June 1, 2025 as the first model in its Kontext suite. The open-weight [dev] variant followed shortly after. The key distinction: Kontext is not a text-to-image model. It is an image-editing model.

You hand it an existing image and a text instruction — "change the jacket color to red", "replace the background with a forest", "make her hold an umbrella" — and it applies that edit while preserving everything else: face identity, lighting, background elements, stylistic consistency. That consistency-across-edits capability is what sets it apart from running an inpaint workflow in standard FLUX.1 dev.

The architecture accepts both a text prompt and one or more reference images as conditioning inputs. Internally, it's a 12B parameter flow-matching diffusion transformer — same family as FLUX.1 dev, but trained on instruction-following editing tasks rather than pure text-to-image generation. The Pro and Max variants are closed API; the [dev] model is open-weight under the FLUX.1 Non-Commercial License, which restricts the model weights to non-commercial use but permits commercial use of the generated outputs under certain conditions.

If you're already running ComfyUI or ComfyUI on Linux, the Kontext dev workflow slots in without a framework change.

The VRAM Reality: 24GB Native, 7GB Quantized

The raw BF16 safetensors file weighs in at approximately 24 GB on disk — right at the VRAM ceiling of an RTX 3090 or RTX 4090. In practice, you need a few GB of headroom for KV cache and activations, so BF16 is tight on 24GB cards and requires lowering resolution or step count to stay within bounds.

The practical tiers, all of which Black Forest Labs and the community have released as ready-to-use checkpoints:

FP8 Scaled (12 GB VRAM required)
The recommended path for RTX 40/30-series cards. The file flux1-dev-kontext_fp8_scaled.safetensors is ~12 GB. NVIDIA's own benchmarks show 2× faster inference vs BF16 PyTorch when running on RTX 40-series hardware, which has FP8 tensor core acceleration. This is the sweet spot: near-full quality, half the memory, faster output.

NF4 / Q4 Quantization (7–8 GB VRAM required)
Community GGUF and NF4 checkpoints bring the model to ~7 GB on disk. Black Forest Labs benchmarking reported 97% quality retention vs the full BF16 model at NF4 precision. On an RTX 4090 using NF4, real-world edits benchmark at approximately 2.29 iterations/second — roughly 9 seconds per edit at 20 sampling steps.

FP4 via TensorRT (Blackwell RTX 50-series only)
RTX 5060 Ti and other Blackwell GPUs with native FP4 tensor cores can load Kontext at 7 GB through NVIDIA's TensorRT-RTX. The FP4 path hits similar speeds to FP8 on Ada — the model is smaller in memory, the throughput is comparable, and the quality is close to NF4. This requires the TensorRT-RTX library and NVIDIA's NIM microservice or a ComfyUI-TensorRT node, not the standard safetensors path.

GPU Tier Guide

24GB Cards (RTX 3090, RTX 4090): Run BF16 or FP8

Both the RTX 3090 and RTX 4090 comfortably handle the FP8 checkpoint. The RTX 4090 gains the additional TensorRT 2× speedup from FP8 tensor core acceleration; the RTX 3090 runs FP8 at full quality but without the same hardware-accelerated path, so expect speeds comparable to FP8 on a 40-series midrange rather than the flagship.

If you want to run BF16 on a 24GB card, keep your output resolution at 1024×1024 or below and use 20 steps. Above that, you will hit OOM errors. FP8 is strictly better here — same quality, half the memory, faster.

12–16GB Cards (RTX 4070 12GB, RTX 4060 Ti 16GB, RTX 3060 12GB): FP8 Sweet Spot

The RTX 4070 with 12GB and the RTX 4060 Ti 16GB are arguably the most practical targets for Kontext dev. The FP8 checkpoint fits with 0–2 GB headroom. Speed lands somewhere between a 3090 and 4090 depending on architecture — for Kontext's editing workload, you're looking at around 1.5–2.0 iterations/second at 20 steps, so 10–15 seconds per edit.

The RTX 3060 12GB is the minimum for running FP8 without offloading. It works; the speed is modest (~12–18 seconds per edit at FP8 estimated), and you will need to keep context length conservative. But it runs.

One practical issue on 12GB cards: the T5-XXL text encoder is a 4–9 GB RAM consumer depending on precision. If you load it at FP16, it adds roughly 9 GB of system RAM usage. Use the FP8-scaled T5 encoder (t5xxl_fp8_e4m3fn_scaled.safetensors) to keep RAM pressure manageable.

8GB Cards (RTX 3060 8GB, RTX 4060 8GB, RTX 5060 8GB): NF4/GGUF Only

An 8GB card requires NF4 or a GGUF quantization. With the 7GB NF4 checkpoint, there's 1 GB of headroom — fine for small resolution (768×768), tight for 1024×1024. Black Forest Labs reported 97% quality retention at NF4; in practice, you'll notice softened fine detail in complex scenes and slightly reduced text rendering compared to FP8, but for most portrait and product edits the output is usable.

GGUF variants in the Q4 range (4–7 GB) are available from the QuantStack repository on Hugging Face. Load these through the ComfyUI-GGUF custom node into the models/unet/ directory rather than the standard diffusion model loader.

ComfyUI Setup: 20 Minutes Start to First Edit

Prerequisites

ComfyUI v0.3.42 or newer — the Kontext workflow nodes were added in this release and are not available in older builds
30–50 GB of free storage (accounting for model files + working cache)
Python 3.11 or 3.12 with PyTorch 2.4+

Download the Model Files

You need four components:

1. Diffusion model — place in ComfyUI/models/diffusion_models/

For FP8 (recommended for 12GB+ VRAM):

flux1-dev-kontext_fp8_scaled.safetensors  (~12 GB)

Download from the Black Forest Labs Hugging Face repository.

For NF4/GGUF (8–12GB VRAM):
Use any Q4–Q8 GGUF from QuantStack's FLUX.1-Kontext-dev-GGUF repo. Place in ComfyUI/models/unet/ and use the GGUF loader node.

2. VAE — place in ComfyUI/models/vae/

ae.safetensors

This is shared with standard FLUX.1 dev — you likely already have it.

3. Text encoders — place in ComfyUI/models/text_encoders/

clip_l.safetensors
t5xxl_fp8_e4m3fn_scaled.safetensors   (FP8, recommended — saves ~5 GB RAM vs FP16)

Load the Workflow

The ComfyUI docs provide an official native workflow JSON. Download it, drag it onto your ComfyUI canvas. If nodes appear red afte

DEV Community