Decoding the Shift and Diffusion Models Training Like Qwen Image, FLUX, SDXL, and More
Post by: SECourses: FLUX, Tutorials, Guides, Resources, Training, Scripts
Hopefully I am going to focus on Qwen Image training tutorial and 1-click installers with GUI and presets starting from this week. So here some important info. You don't need to know, learn or understand this but this is for people who wants to learn and understand more.
This article pretty much applies to all diffusion models like FLUX, Stable Diffusion 3, Hunyuan-DiT, PixArt-α & Σ, Stable Diffusion 1.5/2.1, SDXL, etc.
Section 1: The Basics of Training - Teaching an AI to Denoise
At its core, training a diffusion model is like teaching an artist to restore a damaged painting. The process works like this:
- Take a clean image from your dataset.
- Add a random amount of noise to it. The amount of noise is determined by a "timestep," t, where t=0 is a clean image and t=1 is pure noise.
- Show the model the noisy image, its corresponding timestep, and a text caption describing the original.
- Ask the model to predict what the clean, original image looked like.
- Measure the error between the model's prediction and the actual clean image.
- Update the model's weights slightly to reduce that error in the future.
By repeating this process millions of times with random images and random noise levels, the model learns to generate entirely new images from scratch.
Section 2: The Problem with Uniformity - Introducing the 'Shift'
A simple approach would be to pick a random timestep for each training step with a uniform probability. However, this is inefficient. The model quickly learns to handle low-noise images (e.g., denoising from t=0.1) but struggles with the much harder task of creating structure from high noise (e.g., t=0.9).
To make training more effective, modern techniques like Flow Matching (used by models like FLUX and Qwen-Image) introduce a timestep sampling shift.
The --timestep_sampling shift parameter tells the trainer to bias its random selection towards the noisier end of the spectrum. Instead of treating all noise levels equally, it forces the model to spend more time practicing the most difficult problems. This leads to a more robust model that learns the fundamental structure of images much faster.
Section 3: A Tale of Two Shifts - The Bug and the Breakthrough
-   FLUX Model (Linear Shift): This model uses a simple linear shift. The value provided (--discrete_flow_shift) is added to the sampled timesteps, pushing the distribution of noise levels higher.
- Qwen-Image Model (Exponential Shift): This model uses a more complex and aggressive method. It calculates a value μ (mu) based on the image's resolution and then multiplies the timestep by exp(μ).
Section 4: Unveiling the Magic Number: Why 2.205?
The official formula for μ is:
μ = log(height * width) - log(1024 * 1024)
For Qwen-Image's standard high-resolution outputs (1664x928), the calculation is:
- μ = log(1664 * 928) - log(1048576)
- μ ≈ 14.25 - 13.46
- μ ≈ 0.7911
The shift factor is exp(μ):
exp(0.7911) ≈ 2.205
This means that for its native resolution, Qwen-Image's complex exponential shift is mathematically equivalent to a simple linear shift of 2.205. By setting --discrete_flow_shift 2.205, users could perfectly emulate the official training conditions.
Section 5: A Single GPU Training Step in Action
So, how does this all come together in one step for a single image?
This cycle, biased by the crucial 2.205 shift, is the engine of learning.
Example Similar Models
A Diffusion Transformer (DiT) architecture combined with a Flow Matching (or Rectified Flow) training objective and biased timestep sampling—represents the cutting edge of generative AI. This combination has proven to be more scalable and efficient than the U-Net architecture of earlier models like Stable Diffusion 1.5.
The New Wave of AI: Image and Video Models Built on Transformer and Flow Matching Logic
The architecture pioneered by models like Qwen-Image and FLUX is defining the next generation of generative AI. It's a significant departure from the U-Net architecture that dominated for years. Here are the models that share this new DNA.
The core logic consists of three main pillars:
- Architecture: Diffusion Transformer (DiT). Instead of a U-Net, these models use a Transformer to process patches of latent space, allowing them to scale more effectively with more parameters and training data.
- Training Objective: Flow Matching / Rectified Flow. A more modern and efficient training method that defines a straight path from noise to data, making it easier for the model to learn.
- Sampling Strategy: Biased Timestep Sampling (the "Shift"). A direct consequence of Flow Matching, this technique focuses training on the most difficult, high-noise timesteps to accelerate learning.
Category 1: The Direct Successors (DiT + Flow Matching)
These models are the most direct relatives of Qwen-Image and follow the same fundamental principles.
Category 2: Architectural Cousins (DiT with other Objectives)
These models adopted the Diffusion Transformer architecture but may use different or less publicized training objectives. They are part of the same architectural family.
Category 3: The Logic Extended to Video
The DiT and Flow Matching concepts are not limited to images. The idea of treating media as a sequence of "patches" or "tokens" is perfectly suited for video, where the patches exist in both space and time.
Category 4: The Predecessors (For Context and Comparison)
These models do not use the same logic but are essential for understanding the technological shift.
Summary of the Trend
The pattern is clear: the high-performance generative models of the current and next generation are converging on the Diffusion Transformer (DiT) architecture. Its ability to scale effectively and handle multimodal inputs (text, image patches, video frames) makes it superior to the older U-Net for building large, powerful foundation models. The adoption of more efficient training objectives like Flow Matching and strategies like biased timestep sampling are the software that unlocks the full potential of this new hardware.
Conclusion: The Lesson Learned
The story of PR #408 is a masterclass in the subtleties of AI model training and the power of open-source collaboration. It teaches us three key lessons:
- Hyperparameters Matter Profoundly: A single, seemingly small parameter can be the difference between a state-of-the-art model and a useless one.
- Replicating Training Conditions is Key: When fine-tuning, matching the unique conditions of the original model's pre-training, like its specific timestep sampling strategy, is non-negotiable.
- Community is the Ultimate Debugger: The rapid identification and resolution of this issue were only possible because multiple experts.
The next time you train a model, remember the story of the 2.205 shift. It’s a powerful reminder that sometimes, the biggest breakthroughs are hidden in the smallest details.
The Amazing & Masterpiece Training Tutorials We Have Historically
The below tutorials are NOT up-to-date however they are gold information to learn historically how training works and progressed.
- Transform Your Selfie into a Stunning AI Avatar with Stable Diffusion - Better than Lensa for Free (16 December 2022 - 59 minutes) : https://youtu.be/mnCY8uM7E50
- How To Do Stable Diffusion LORA Training By Using Web UI On Different Models - Tested SD 1.5, SD 2.1 (31 December 2022 - 58 minutes) : https://youtu.be/mfaqqL5yOO4
- Zero To Hero Stable Diffusion DreamBooth Tutorial By Using Automatic1111 Web UI - Ultra Detailed (10 January 2023 - 100 minutes) : https://youtu.be/Bdl-jWR3Ukc
- How To Do Stable Diffusion Textual Inversion (TI) / Text Embeddings By Automatic1111 Web UI Tutorial (20 January 2023 - 72 minutes) : https://youtu.be/dNOpWt-epdQ
- Automatic1111 Stable Diffusion DreamBooth Guide: Optimal Classification Images Count Comparison Test (26 Feburary 2023 - 30 minutes) : https://youtu.be/Tb4IYIYm4os
- Epic Web UI DreamBooth Update - New Best Settings - 10 Stable Diffusion Training Compared on RunPods (4 March 2023 - 60 minutes) : https://youtu.be/sRdtVanSRl4
- Training Midjourney Level Style And Yourself Into The SD 1.5 Model via DreamBooth Stable Diffusion (20 March 2023 - 18 minutes) : https://youtu.be/m-UVVY_syP0
- Generate Studio Quality Realistic Photos By Kohya LoRA Stable Diffusion Training - Full Tutorial (28 April 2023 - 45 minutes) : https://youtu.be/TpuDOsuKIBo
- How To Install And Use Kohya LoRA GUI / Web UI on RunPod IO With Stable Diffusion & Automatic1111 (16 May 2023 - 14 minutes) : https://youtu.be/3uzCNrQao3o
- How To Install DreamBooth & Automatic1111 On RunPod & Latest Libraries - 2x Speed Up - cudDNN - CUDA (18 June 2023 - 14 minutes) : https://youtu.be/c_S2kFAefTQ
- The END of Photography - Use AI to Make Your Own Studio Photos, FREE Via DreamBooth Training (2 July 2023 - 42 minutes) : https://youtu.be/g0wXIcRhkJk
- First Ever SDXL Training With Kohya LoRA - Stable Diffusion XL Training Will Replace Older Models (18 July 2023 - 40 minutes) : https://youtu.be/AY6DMBCIZ3A
- Become A Master Of SDXL Training With Kohya SS LoRAs - Combine Power Of Automatic1111 & SDXL LoRAs (10 August 2023 - 85 minutes) : https://youtu.be/sBFGitIvD2A
- How To Do SDXL LoRA Training On RunPod With Kohya SS GUI Trainer & Use LoRAs With Automatic1111 UI (13 August 2023 - 33 minutes) : https://youtu.be/-xEwaQ54DI4
- How to Do SDXL Training For FREE with Kohya LoRA - Kaggle - NO GPU Required - Pwns Google Colab (2 September 2023 - 49 minutes) : https://youtu.be/JF2P7BIUpIU
- How To Do Stable Diffusion XL (SDXL) DreamBooth Training For Free - Utilizing Kaggle - Easy Tutorial (24 November 2024 - 52 minutes) : https://youtu.be/16-b1AjvyBE
The Below Tutorials Are Still Up-To-Date and Recommended
Their links, scripts, configs, workflows all updated and maintained and still fully up-to-date as of 11 August 2025.
- Full Stable Diffusion SD & XL Fine Tuning Tutorial With OneTrainer On Windows & Cloud - Zero To Hero (9 April 2024 - 133 minutes) : https://youtu.be/0t5l6CP9eBg
- FLUX LoRA Training Simplified: From Zero to Hero with Kohya SS GUI (8GB GPU, Windows) Tutorial Guide (29 August 2024 - 68 minutes) : https://youtu.be/nySGu12Y05k
- Blazing Fast & Ultra Cheap FLUX LoRA Training on Massed Compute & RunPod Tutorial - No GPU Required! (4 September 2024 - 84 minutes) : https://youtu.be/-uhL2nW7Ddw
-  FLUX Full Fine-Tuning / DreamBooth Training Master Tutorial for Windows, RunPod & Massed Compute (22 October 2024 - 169 minutes) : https://youtu.be/FvpWy1x5etM
- This is the very best training tutorial at the moment until hopefully Qwen Image model training arrives
 
The Amazing & Masterpiece Requirements and Environment Setup Tutorials We Have Historically
I highly recommend watch all and follow exactly the last one with up-to-date source article and links inside it - all these 3 videos and sources are 100% public
- How To Install Python, Setup Virtual Environment VENV, Set Default Python System Path & Install Git (15 April 2023 - 21 minutes) : https://youtu.be/B5U7LJOvH6g
- Essential AI Tools and Libraries: A Guide to Python, Git, C++ Compile Tools, FFmpeg, CUDA, PyTorch : (15 December 2023 - 34 minutes) : https://youtu.be/-NjNy7afOQ0
-  How to Install Python, CUDA, cuDNN, C++ Build Tools, FFMPEG & Git Tutorial for AI Applications : (2 October 2024 - 36 minutes) : https://youtu.be/DrhUHnYfwC0
- This tutorial and its links still up-to-date and I am using exactly same config as its source article
 
 
 
              






 
    
Top comments (0)