shinji shimizu

Posted on May 22 • Originally published at kotonia.ai

HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution

#ai #python #machinelearning #gpu

TL;DR

I'm running HiDream-O1-Image Full as a persistent local server integrated into a Studio UI. The official recipe — 2048x2048 / 50 steps / guidance 5.0 — produces beautiful results, but each image takes around 33 seconds. That's too slow for iterative exploration.

So I held the prompt and seed constant and swept steps, guidance, and resolution. The sweet spots were clear.

Config	Time	vs. Official
`2048 / 50 steps / g5`	33.37s	1.00x
`2048 / 28 steps / g5`	18.41s	1.81x
`1536 / 20 steps / g5`	7.14s	4.67x
`1024 / 20 steps / g5`	3.83s	8.71x

The takeaway: explore direction at low resolution and low steps, then do the final render at full quality. In particular, 1536x1536 / 28–36 steps hits a very good speed-quality balance.

Motivation

Once image generation is embedded in a UI, iteration speed matters more than peak quality.

The real workflow isn't "generate one perfect image." It looks like this:

Check composition, mood, outfit, background direction
Tweak the prompt slightly
Try different seeds
Re-render only the promising candidates at full quality

Waiting 30+ seconds per generation makes that loop painful. Being able to see rough candidates in 5–10 seconds is a completely different experience.

The goal here isn't "the best single image" — it's understanding how far you can cut exploration cost without breaking quality in a meaningful way.

Environment

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)
Model: HiDream-O1-Image Full (8B, bf16)
Inference server: Custom Python HTTP server with model kept resident
Measured: One /generate/t2i request after model load
Seed: 42
Prompt:

A cinematic portrait photo of a woman in a rainy neon street,
detailed skin, 85mm lens, realistic lighting, high detail

All comparison images use the same prompt and seed. Only steps, guidance_scale, resolution, and resolution snapping are varied.

Parameter	Value
prompt	`A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail`
seed	`42`
mode	`t2i`
dtype	`bf16`
negative prompt	none
sampler / scheduler	HiDream pipeline default

I used a portrait because hair, skin, background light, and fine detail are easy to compare. That said, a young woman's face has relatively little texture and wrinkle detail to begin with, so it's actually a forgiving subject for low-step generation — I'll come back to that.

Images in this article are contact sheets with results side by side. Pixel-peeping is easier at full resolution, but for UI-driven exploration the first question is "does this look worth keeping?" — so I've prioritized at-a-glance comparison here.

Start by Reducing Steps

Fixed guidance=5.0 and 2048x2048, varied only steps.

Resolution	Steps	Guidance	Elapsed	Speedup vs 50 steps
2048x2048	20	5.0	13.070s	2.55x
2048x2048	28	5.0	18.412s	1.81x
2048x2048	36	5.0	23.854s	1.40x
2048x2048	50	5.0	33.370s	1.00x

Pretty much theoretical scaling. In this HiDream path, when guidance > 1.0, both conditional and unconditional forwards run, so reducing steps translates directly to lower latency.

Visually: 20 steps shows some roughness. 28 steps looks fine at first glance, though fine detail thins out under comparison. 36 steps holds up well for most use cases.

guidance=1.0 Is Significantly Faster

Next I varied guidance as well, comparing practical preset candidates.

Preset	Resolution	Steps	Guidance	CFG	Elapsed
Draft	2048x2048	24	1.0	off	8.164s
Balanced	2048x2048	36	3.0	on	23.664s
Official	2048x2048	50	5.0	on	32.609s

guidance=1.0 effectively disables CFG, so it's faster than step count alone would suggest — 24 steps lands in the 8-second range.

The trade-off is that lower guidance changes prompt adherence and overall aesthetics. Fine for idea validation, but for prompts involving text, specific clothing details, or precise multi-element placement, staying at guidance=3–5 is safer.

The Resolution Trap: Requesting 1024 Doesn't Make It Faster

My first instinct was to just pass width=1024, height=1024 and get a faster result. But the official pipeline doesn't use the requested resolution directly — it snaps to the nearest fixed aspect-ratio bucket.

Measured results:

Requested	Actual
512x512	2048x2048
1024x1024	2048x2048
2048x2048	2048x2048
1280x720	2560x1440
720x1280	1440x2560
1024x768	2304x1728

Sending 1024x1024 from the UI does nothing — square aspect ratios all resolve to 2048x2048. The snapping logic lives in models/utils.py under PREDEFINED_RESOLUTIONS, and it seems intentionally designed to favor output stability.

Bypassing Buckets for True Low-Resolution Generation

For experimentation I added a snap_resolution=false flag that bypasses the pipeline's resolution snapping. For safety, arbitrary resolutions are constrained to:

width and height aligned to 32px
256px minimum
max 4.3MP total

Comparing 1024 / 1536 / 2048 at 20 steps / guidance=5.0:

Resolution	Elapsed	Speedup vs 2048
1024x1024	3.831s	3.47x
1536x1536	7.139s	1.86x
2048x2048	13.278s	1.00x

This is where the real gains are. Given that the official 2048 recipe sits at 30+ seconds, 1536 + 28 steps should land around 10 seconds — a completely different feel.

1024 is fast but noticeably lower in information density. Good for directional checks, but probably too rough for regular output use.

Presets in the Studio UI

Based on these results, here's what I settled on in the Studio UI:

Use case	Resolution	Steps	Guidance	When to use
Quick preview	1024x1024	20–24	1.0–3.0	Composition / mood check
Standard	1536x1536	28–36	3.0–5.0	Day-to-day
High quality	2048x2048	36–50	5.0	Re-render of selected candidates
Official bucket	bucket	50	5.0	Match upstream recipe exactly

Steps and resolution are independently selectable in the UI. The workflow is: explore with 1024 / 24 steps, then re-render promising results at 1536 or 2048 with the same prompt and seed.

Cases Where Quality Degradation Shows Up

With this portrait, the difference between 28 steps and 50 steps was "visible under comparison" — not obvious at a glance. But part of that is the subject matter.

Low steps and low resolution tend to hurt most with:

Older faces, wrinkles, skin texture
Hands, fingers, jewelry
Fabric with fine patterns
Text in signs or books
Multiple people
Busy indoor scenes with lots of background objects

Conversely, young faces, simple backgrounds, and soft lighting are forgiving — low-cost settings hold up well.

That's why a single fixed preset isn't the right design. Giving users control over exploration cost depending on what they're generating is the better approach.

Reproduction Commands

The benchmark script lives at image_server/bench_quality_speed.py. It calls the HTTP API after the model is already resident, so model load time is excluded from all measurements.

./image_server/start_image_server.sh

Steps comparison:

python3 image_server/bench_quality_speed.py \
  --prompt "A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail" \
  --seed 42 \
  --variant s20_g5,20,5 \
  --variant s28_g5,28,5 \
  --variant s36_g5,36,5 \
  --variant s50_g5,50,5

Resolution comparison:

python3 image_server/bench_quality_speed.py \
  --prompt "A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail" \
  --seed 42 \
  --variant s20_g5,20,5 \
  --size 1024x1024 \
  --size 1536x1536 \
  --size 2048x2048 \
  --no-snap-resolution

Summary

HiDream-O1-Image Full is excellent at its official settings but too slow for iterative use. When you break down steps, CFG, and resolution separately, the speedups are clean and predictable.

Steps scale almost linearly with time
guidance=1.0 drops CFG and gives a large speed boost
The official pipeline snaps resolutions to fixed buckets
True low-resolution generation at 1024/1536 is dramatically faster
1536 / 28–36 steps is the practical sweet spot

For image generation UIs, low-cost exploration → high-quality final render is a much better flow than starting at maximum quality every time. This experiment gave me a solid basis for building exactly that.

DEV Community