DEV Community

shinji shimizu
shinji shimizu

Posted on

One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own

TL;DR

HiDream-O1-Image is one of the strongest open-weight text-to-image models out right now (it debuted around #8 in the Artificial Analysis T2I Arena). But it shipped inference-only, and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can't touch it.

This post is one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image. I'll show why the standard trainers (kohya, ai-toolkit, SimpleTuner) don't fit, how I reverse-engineered a working training loop from the inference code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night.

What this LoRA is: a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It's not a character LoRA, not a single-style LoRA, and not a model-distillation artifact.

The short version of the recipe:

  • The model's output head predicts the clean image x0 (in patch space, [-1,1]).
  • Build the noised input as z_t = (1 - σ)·x0 + σ·(8.0·ε) and feed the model timestep 1 - σ.
  • Loss is just MSE(x_pred, x0) on the image-token positions.
  • LoRA attaches via plain PEFT to the language-model decoder linears, because the backbone is a stock HF Qwen3-VL.

Prior art (what existed before this)

To set expectations honestly: I'm not claiming "world's first LoRA file for O1."

  • Kijai published a ComfyUI workflow for HiDream-O1 that includes a distill LoRA — it extracts the Dev-2604 model's behavior as a LoRA applied to the Base model. That's a model-compression technique, not a visual-style LoRA trained on external images.
  • Ostris (author of AI Toolkit) has run initial LoRA training tests on HiDream-O1 and ai-toolkit lists O1 as a supported model. No resulting LoRA has been publicly released as of this writing.
  • TechnoEdge (Japanese tech media) reported using a face LoRA with HiDream-O1 Dev, though it's unclear whether that LoRA was purpose-trained for O1 or adapted from elsewhere.

What I didn't find: a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image. If you know of one, I'd genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe.

Why no trainer exists: the architecture

Most LoRA trainers assume the SDXL/Flux shape: a UNet/DiT denoiser + a VAE + one or two text encoders, all separate modules wired together by diffusers. You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings.

HiDream-O1-Image is a Pixel-level Unified Transformer (UiT). From its own description:

a natively unified image generative foundation model built on a Pixel-level Unified Transformer without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space.

Concretely (reading models/qwen3_vl_transformers.py):

  • The backbone is a Qwen3VLForConditionalGeneration — a stock Hugging Face Qwen3-VL multimodal transformer.
  • There is no VAE. Images are patchified directly: PATCH_SIZE = 32, so an H×W image becomes (H/32)·(W/32) tokens, each a 3·32·32 = 3072-dim vector of raw pixels.
  • A small x_embedder projects the noised patch tokens into the hidden space; a final_layer2 head projects hidden states back to patch space; a t_embedder injects the timestep at a dedicated <|tms_token|> position.
  • It's trained with flow matching (fm_solvers_unipc.py), and image tokens get full (bidirectional) attention while text tokens stay causal (this is what token_types controls).

So none of kohya/ai-toolkit/SimpleTuner can touch it — there's no UNet, no VAE, no separate text encoder for them to hook. That's exactly why there are no articles: it's a new architecture, released inference-only.

The good news: because the backbone is a plain transformers model, the LoRA adapter mechanics are trivial — PEFT injects into the nn.Linears natively. The hard part is the training loop, which the repo doesn't ship. So let's derive it.

Reverse-engineering the training forward from inference

The inference loop (models/pipeline.py:generate_image) tells you everything. Per denoising step it does roughly:

sigma = step_t / 1000.0                       # noise level, in (0, 1]
t_pixeldit = 1.0 - sigma                       # what the model receives as "timestep"
x_pred = model(..., vinputs=z, timestep=t_pixeldit).x_pred
v = (x_pred - z) / sigma                        # ... and -v is fed to the FM scheduler
Enter fullscreen mode Exit fullscreen mode

Two facts fall out of this:

  1. x_pred is the model's prediction of the clean image x0. Work the algebra backwards: if z_t = (1-σ)·x0 + σ·ε then (x_pred - z_t)/σ = x0 - ε = -(ε - x0), and ε - x0 is exactly the rectified-flow velocity the FlowMatch scheduler expects. Consistent ⇒ the head is x0-parameterized.
  2. The noise scale isn't 1. Inference initializes z = NOISE_SCALE · randn with NOISE_SCALE = 8.0, while x0 lives in [-1, 1]. So the interpolation the model was trained on is z_t = (1-σ)·x0 + σ·(8.0·ε).

That gives the entire training step:

sigma = random.uniform(T_EPS, 1.0)
eps   = torch.randn_like(x0)
z_t   = (1.0 - sigma) * x0 + sigma * (NOISE_SCALE * eps)   # NOISE_SCALE = 8.0
t     = torch.tensor([1.0 - sigma])

out    = gen(input_ids=ids, position_ids=pos, vinputs=z_t,
             timestep=t, token_types=tt)
x_pred = out.x_pred[0, vinput_mask[0]]      # image-token positions only
loss   = F.mse_loss(x_pred.float(), x0[0].float())
Enter fullscreen mode Exit fullscreen mode

x0 is just the image, normalized to [-1,1] and patchified with the same einops rearrange the pipeline uses for reference images. The token layout (prompt → <|boi_token|><|tms_token|> → image tokens) is built by reusing the pipeline's own build_t2i_text_sample, so positions and token_types line up with what the forward expects.

Uniform σ sampling and unweighted x0-MSE are enough to learn cleanly — no fancy loss weighting needed for a first cut.

Attaching the LoRA

Because the denoiser is model.model.language_model (a stock Qwen3-VL decoder), PEFT targets its attention/MLP linears and freezes everything else:

targets = [n for n, m in model.named_modules()
           if isinstance(m, torch.nn.Linear)
           and n.endswith(("q_proj","k_proj","v_proj","o_proj",
                           "gate_proj","up_proj","down_proj"))
           and "language_model" in n and "visual" not in n]

model = get_peft_model(model, LoraConfig(
    r=16, lora_alpha=16, target_modules=targets, lora_dropout=0.0, bias="none"))
Enter fullscreen mode Exit fullscreen mode

That's 252 linears, ~44M trainable params at rank 16. The vision encoder, x_embedder, t_embedder, and final_layer2 stay frozen. One subtlety: PEFT swaps the Linears in place, so a handle grabbed before get_peft_model (gen = model.model) still sees the LoRA layers — convenient for calling the generation forward directly and for model.disable_adapter() A/B renders.

Data, captions, and resolution

Resolution is not fixed at 2048. The find_closest_resolution() snapping you see in the pipeline is a quality default (the model is tuned for high res), not an architectural limit — height/width are free as long as they're multiples of 32. Since image tokens scale as (H/32)·(W/32):

resolution image tokens relative attention cost
2048² 4096
1024² 1024 ~1/16

So I train at 1024: ~4× shorter sequences, far less VRAM and time per step. The workflow becomes "iterate cheaply at 1024, upscale the keepers." Aspect ratios are left native (each image snapped to the nearest ×32, batch size 1) — no bucketing needed, and mixed portrait/landscape actually helps a style LoRA generalize.

For captions, HiDream wants natural-language prose, not danbooru tags (different text encoder lineage). I captioned ~190 images with a local multimodal VLM into one-to-three-sentence descriptions, each prefixed with a trigger phrase so the aesthetic stays prompt-controllable (invoke it when you want it, leave it off otherwise).

Results

Same prompt, same seed, adapter off vs on. All samples use the trigger phrase kotonia style:

Base vs LoRA — schoolgirl, cherry blossoms
Base vs LoRA — kimono, autumn
Base vs LoRA — semi-realistic portrait
Base vs LoRA — fantasy knight, storm

The base model is competent but soft and a bit generic; the LoRA pushes rendering toward a polished modern-anime look — directional lighting, glossier hair and skin, more confident stylization — and it holds across very different subjects (schoolgirl slice-of-life → epic fantasy), so it learned an aesthetic rather than memorizing images.

Training progression (500 → 2500 steps)

Same prompt, same seed, rank 16, ~190 images:

Progression: 500 → 1500 → 2500 steps

It keeps refining without melting or obvious overfitting even at 2500 steps — the sweet spot is further out than I expected for a set this small. (Loss drifts ~0.07 → 0.052.)

NSFW controllability

NSFW content controllability (prompt-gating) was also tested as part of this LoRA — the model produces NSFW only when explicitly prompted, and the LoRA's contribution is primarily visual quality rather than "uncensoring." For the full story including training data composition, motivation, and NSFW samples, see the companion article on kotonia.ai.

Reproduce it

The whole trainer is ~150 lines. Run:

uv pip install peft
CUDA_VISIBLE_DEVICES=0 python train_lora.py \
  --data_dir /path/to/images \
  --out_dir outputs/lora_run \
  --resolution 1024 --steps 2500 --rank 16 \
  --sample_every 500 --sample_prompt "<trigger>, ..."
Enter fullscreen mode Exit fullscreen mode

--sample_every renders an adapter on/off pair so you can watch the LoRA bite. Inference loads the base model, applies the adapter with PeftModel.from_pretrained, and generates — disable_adapter() gives you the baseline for free.

Gotchas that cost me time

  • Host-RAM OOM on load. from_pretrained(...).to(device) materializes the full 8B model in CPU RAM before moving it to GPU; on a 60 GB host alongside other services this got OOM-killed mid-load. low_cpu_mem_usage=True streams the shards and fixes it.
  • The 2048 "limit" is a default. Pass your own height/width (multiples of 32) and bypass the bucket snapping entirely.
  • Detach long runs. Launch training under setsid/tmux/systemd — if it's a child of your editor's terminal, an editor crash takes the run (and any GPU services in sibling terminals) down with it.
  • x0-param, not v-param. Train against x0 directly; if you assume velocity prediction the loss won't match the head and the LoRA won't converge to the right manifold.

Companion article (the story behind this LoRA): Why I Trained a HiDream-O1 LoRA — on kotonia.ai.


The LoRA is available on kotonia.ai/studio (my own creative platform where I serve the model alongside the LoRA, free to use). The full trainer code, captioning pipeline, and inference scripts are in the GitHub repo under HiDream-O1-Image/.

If you train something cool with this recipe — a character LoRA, a style LoRA, an NSFW-enhancing LoRA — I'd love to see it. The more community LoRAs exist for O1, the better for everyone.

Top comments (0)