DEV Community

Cover image for Best GPU for Kohya_ss LoRA Training in 2026 (Ranked)
Thurmon Demich
Thurmon Demich

Posted on • Originally published at bestgpuforai.com

Best GPU for Kohya_ss LoRA Training in 2026 (Ranked)

This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.

The right GPU for Kohya_ss depends on what you are training. LoRA fine-tuning for Stable Diffusion XL or Flux.1, DreamBooth for character consistency, and full model fine-tuning all have different hardware demands. This guide breaks down what you actually need for each scenario.

Quick answer: For LoRA training, 16GB VRAM hits the sweet spot. The RTX 4090 is the fastest consumer training GPU. The RTX 4060 Ti 16GB is the best budget pick for VRAM headroom. The RTX 3060 12GB works for light LoRA but gets tight fast.

See the recommended pick on the original guide

VRAM needs by training task

Training task Minimum VRAM Recommended VRAM Notes
SD 1.5 LoRA 6GB 8GB Any modern GPU works
SD XL LoRA 10GB 12GB+ 8GB requires aggressive gradient checkpointing
Flux.1 LoRA 16GB 24GB Flux is memory-hungry
DreamBooth SD XL 16GB 24GB Higher batch sizes need more VRAM
DreamBooth Flux.1 24GB 32GB Very demanding
Full model fine-tune 24GB+ 40GB+ Rarely done on consumer hardware

GPU recommendations by scenario

Scenario 1: Flux.1 LoRA training (most demanding popular task)

Flux.1 LoRA training in Kohya_ss is the current standard for high-quality character and style training. It needs 16GB minimum and runs best with 24GB.

Top pick: RTX 4090 — 24GB GDDR6X trains Flux.1 LoRAs comfortably. A typical 1500-step run completes in 20-30 minutes. Fast enough to iterate quickly.

Value pick: RTX 4070 Ti Super — 16GB is tight for Flux.1 but works with gradient checkpointing enabled. Training takes 50-80% longer than the 4090.

See the recommended pick on the original guide

See the recommended pick on the original guide

Scenario 2: SD XL LoRA training (most common task)

SD XL LoRA is forgiving. 12GB VRAM handles it, and 16GB gives comfortable headroom for higher resolution or larger batch sizes.

Top pick: RTX 4090 — Fastest training times, excellent VRAM headroom.

Value pick: RTX 4060 Ti 16GB — 16GB GDDR6 is exactly right for SD XL LoRA. Significantly cheaper than the 4090. Training is slower but perfectly usable for personal projects.

See the recommended pick on the original guide

Scenario 3: Budget SD 1.5 / SD 2.1 training

For older model training, the requirements drop significantly. A 12GB GPU handles everything comfortably.

Budget pick: RTX 3060 12GB — 12GB at the lowest price point. Trains SD 1.5 LoRAs without issues. Struggles with Flux.1 and higher-VRAM tasks, but for basic character LoRA work it gets the job done.

See the recommended pick on the original guide

Training time comparison (SD XL LoRA, 1500 steps, batch 1)

GPU VRAM Approx. training time Relative speed
RTX 4090 24GB ~12 min 1x (baseline)
RTX 4070 Ti Super 16GB ~20 min 0.6x
RTX 4060 Ti 16GB 16GB ~30 min 0.4x
RTX 3060 12GB 12GB ~55 min 0.22x

Estimates at 1024x1024 resolution with network cache enabled. Flux.1 LoRA times are 2-3x longer across all GPUs.

What about the RTX 5090?

The RTX 5090 trains faster than the 4090 — roughly 1.5-2x depending on the task. For pure training throughput, it is the fastest consumer option. But the 4090 is already fast enough that the extra speed rarely justifies $400+ more cost for personal training work.

GPU tier list available at the original article

See also: Best GPU for LoRA training, Best GPU for fine-tuning, and Best GPU for DreamBooth.

Which GPU should YOU buy?

  • Training Flux.1 LoRAs regularly? RTX 4090 (24GB) is the minimum comfortable option. The 4060 Ti 16GB works but is slow.
  • Mostly SD XL LoRA training? RTX 4060 Ti 16GB is the best value — 16GB VRAM, much cheaper than the 4090.
  • On a tight budget running SD 1.5? RTX 3060 12GB works. Expect slower training times.
  • Professional training pipeline with iteration speed critical? RTX 4090 or RTX 5090 (if budget allows).
  • Training infrequently? Cloud GPUs are worth considering — RunPod hourly rates often beat buying hardware for light use.

Common mistakes to avoid

  • Buying an 8GB GPU to save money then hitting constant out-of-memory errors in Kohya_ss — 16GB is the proper minimum for modern tasks
  • Running Flux.1 LoRA training on 12GB and expecting a smooth experience — it technically runs but barely
  • Ignoring gradient checkpointing settings — enabling them on a 16GB GPU can mean the difference between a training run working or failing
  • Using a slow HDD for dataset storage — Kohya_ss reads training images repeatedly, and slow storage adds meaningful time across thousands of steps
  • Skipping the network cache step — caching latents before training dramatically speeds up LoRA runs and is often overlooked by beginners

Final verdict

Use case Best pick Budget pick
Flux.1 LoRA RTX 4090 RTX 4070 Ti Super
SD XL LoRA RTX 4090 RTX 4060 Ti 16GB
SD 1.5 LoRA RTX 4060 Ti 16GB RTX 3060 12GB
DreamBooth RTX 4090 RTX 4060 Ti 16GB

For most Kohya_ss users, the RTX 4060 Ti 16GB offers the best balance of VRAM capacity and price. If you train Flux.1 seriously, save up for the RTX 4090.

See the recommended pick on the original guide

In Kohya_ss, VRAM capacity determines what you can run. Training speed determines how fast you can iterate. Both matter.

Related guides on Best GPU for AI


The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.

Top comments (0)