Forge WebUI Review 2026: Faster SDXL and Flux on Less VRAM

#opensource #ai #selfhosted #linux

This article was originally published on aifoss.dev

TL;DR: Forge is a backend rewrite of Automatic1111 that runs SDXL 30–75% faster, cuts VRAM usage significantly, and adds native Flux.1 support — all without changing the UI you already know. Switching from A1111 takes about 10 minutes. The only reasons not to: specific extensions that break, AMD hardware, or a need for ComfyUI's automation-first pipeline model.

	Forge WebUI	Automatic1111	ComfyUI
Best for	A1111 users wanting speed + Flux	Legacy extensions, max compat	Node pipelines, automation
VRAM — Flux.1 Dev NF4	6–8 GB	Not supported	8–12 GB
VRAM — SDXL 1024px	4–6 GB	8 GB+	4–6 GB
Extension compatibility	~80% of A1111 extensions	100% (baseline)	Own ecosystem
Setup difficulty	One-click installer	One-click installer	Manual clone + deps
Speed vs A1111 (SDXL)	30–75% faster	Baseline	Comparable or faster
License	AGPL-3.0	AGPL-3.0	GPL-3.0

Honest take: If you're on A1111 today, switch to Forge — it's the same interface with a meaningfully faster engine and Flux.1 support that A1111 simply doesn't have. Stick with A1111 only if you depend on extensions that are on the broken list.

What Forge Is

Stable Diffusion WebUI Forge is a fork of Automatic1111, created by lllyasviel — the same developer who built ControlNet. The goal was specific: replace A1111's inference backend with a more efficient memory manager while keeping the existing extension ecosystem and UI intact.

The project is hosted at github.com/lllyasviel/stable-diffusion-webui-forge under an AGPL-3.0 license. The latest one-click installer package was released February 5, 2025; development on the main branch continues beyond that date. The AGPL-3.0 license is relevant if you plan to deploy Forge as a public-facing service — that triggers the copyleft clause and requires releasing your modifications.

Visually, Forge looks like A1111. The txt2img, img2img, extras, and settings panels are nearly identical. The differences are in the engine that runs underneath.

Installation

Two paths: one-click package or manual clone.

One-click package (recommended for most users): Download from the GitHub releases page. Extract, run update.bat to sync the main branch, then run.bat (Windows) or webui.sh (Linux/macOS). The primary build uses CUDA 12.1 + PyTorch 2.3.1. A CUDA 12.4 + PyTorch 2.4 build is listed as "fastest" but has reported MSVC and xformers issues on some Windows configurations.

Manual clone:

git clone https://github.com/lllyasviel/stable-diffusion-webui-forge
cd stable-diffusion-webui-forge
# Linux/macOS
bash webui.sh
# Windows: run webui-user.bat

Your existing A1111 model files work without conversion. Copy or symlink your models/Stable-diffusion/, models/Lora/, and embeddings/ directories and Forge picks them up immediately. Extension reinstallation is the main migration cost — you'll need to reinstall from the Extensions tab, and some won't work (more on that below).

VRAM Savings

The headline feature, and the reason most people switch, is VRAM reduction. Forge achieves this through a dynamic memory management layer that splits model layers across GPU VRAM, CPU RAM, and shared GPU memory based on what fits — rather than requiring the full model to be GPU-resident.

SDXL: A1111 requires roughly 8 GB to run SDXL at 1024×1024 without medvram flags. Forge handles it at 4–6 GB. On an RTX 3070 (8 GB), SDXL generation that was unstable or impossible in A1111 runs cleanly. On 4 GB cards, some users report successful generation at reduced batch sizes, though performance degrades significantly.

Flux.1 Dev: A1111 does not support Flux.1 models natively. Forge does, through built-in BitsandBytes NF4 and FP8 quantization. VRAM breakdown by format:

Format	Approx. GPU VRAM	Notes
FP16	24 GB+	Full precision, highest quality
FP8	11–12 GB	Good quality, CUDA 11.7+, RTX 20xx+
NF4 (BitsandBytes)	6–8 GB	Best for limited VRAM, RTX 3xxx/4xxx
GGUF Q4	~6 GB	Via GGUF extension in Forge

For an RTX 3080 (10 GB), FP8 Flux.1 Dev is the practical choice — enough VRAM headroom for comfortable generation without the quality tradeoff of NF4. For 8 GB cards, NF4 is the path: expect generation times of 60–120 seconds per image at 1024×1024.

The memory offloading works best when you have ample system RAM as the overflow target. 32 GB of system RAM is a practical floor for comfortable Flux.1 Dev usage on GPUs below 12 GB.

Generation Speed

Community benchmarks put Forge's SDXL speed 30–75% faster than A1111, depending on hardware configuration and the number of active LoRAs. One specific benchmark on an RTX 3090 at 1024×1024 SDXL with five concurrent LoRAs clocked A1111 at 1 minute 45 seconds vs Forge at 1 minute 10 seconds — a 33% reduction. With fewer LoRAs and simpler configurations, some users report 50–75% improvements.

Against ComfyUI: a separate benchmark on an A6000 measured ComfyUI at 5.35 it/s vs Forge at 4.9 it/s for SDXL — Forge runs about 8–9% slower than ComfyUI's optimized pipeline. This is the expected tradeoff for retaining an extension-compatible frontend.

For Flux.1 Dev on an RTX 3090 (24 GB) running FP8, expect 20–40 seconds per 1024×1024 image. On 8 GB NF4 hardware, 60–120 seconds is a realistic expectation. Neither number is impressive against cloud inference, but for local generation with no per-image cost, it's the current state of the technology.

Extension Compatibility

Forge maintains backward compatibility with most A1111 extensions, but "most" is doing real work in that sentence.

What works reliably:

ControlNet: Built into Forge directly — no separate extension needed. The integrated version is faster than A1111's ControlNet extension. Adding ControlNet to an SDXL generation in Forge runs 30–45% faster than A1111 + ControlNet extension, per community benchmarks.
ADetailer: Functions normally
LoRA / LyCORIS: Full support, same model format as A1111
Most prompt and aesthetic extensions: Negative prompt tools, style selectors, regional prompters

What breaks or has limitations:

Batch ControlNet operations: Forge's integrated ControlNet is missing batch processing features from the standalone A1111 extension
Extensions that hook into A1111's sampling pipeline at a low level: These break because Forge replaced that pipeline
Approximately 20% of A1111 extensions: Either don't function or fail silently without error messages

The silent failure mode is the frustrating part. An extension that loads without an error but does nothing is harder to debug than a clear crash. Before switching, check the Forge Extension List and Extension Replacement List on GitHub — it documents specific incompatibilities and recommends Forge-compatible replacements for common extensions.

Forge Forks

The upstream Forge project is maintained by lllyasviel but has described itself as "experimental" since launch. Two community forks have emerged and are actively maintained in 2026:

reForge (Panchovix): Prioritizes stability and broader hardware support. Better support for older NVIDIA cards (GTX 10xx/20xx series) and AMD via DirectML. If you're on hardware that Forge's CUDA-centric optimizations don't target well, reForge is worth testing first.

Forge Classic (Haoming02, formerly Forge Neo): Continues the Gradio 4 UI path with ongoing UI improvements and expanded model support. More actively maintained for UI-layer features than upstream Forge.

The upstream Forge repository remains the most referenced starti