DEV Community

shinji shimizu
shinji shimizu

Posted on • Originally published at kotonia.ai

HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution

TL;DR

I'm running HiDream-O1-Image Full as a persistent local server integrated into a Studio UI. The official recipe — 2048x2048 / 50 steps / guidance 5.0 — produces beautiful results, but each image takes around 33 seconds. That's too slow for iterative exploration.

So I held the prompt and seed constant and swept steps, guidance, and resolution. The sweet spots were clear.

Config Time vs. Official
2048 / 50 steps / g5 33.37s 1.00x
2048 / 28 steps / g5 18.41s 1.81x
1536 / 20 steps / g5 7.14s 4.67x
1024 / 20 steps / g5 3.83s 8.71x

The takeaway: explore direction at low resolution and low steps, then do the final render at full quality. In particular, 1536x1536 / 28–36 steps hits a very good speed-quality balance.


Motivation

Once image generation is embedded in a UI, iteration speed matters more than peak quality.

The real workflow isn't "generate one perfect image." It looks like this:

  1. Check composition, mood, outfit, background direction
  2. Tweak the prompt slightly
  3. Try different seeds
  4. Re-render only the promising candidates at full quality

Waiting 30+ seconds per generation makes that loop painful. Being able to see rough candidates in 5–10 seconds is a completely different experience.

The goal here isn't "the best single image" — it's understanding how far you can cut exploration cost without breaking quality in a meaningful way.


Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)
  • Model: HiDream-O1-Image Full (8B, bf16)
  • Inference server: Custom Python HTTP server with model kept resident
  • Measured: One /generate/t2i request after model load
  • Seed: 42
  • Prompt:
A cinematic portrait photo of a woman in a rainy neon street,
detailed skin, 85mm lens, realistic lighting, high detail
Enter fullscreen mode Exit fullscreen mode

All comparison images use the same prompt and seed. Only steps, guidance_scale, resolution, and resolution snapping are varied.

Parameter Value
prompt A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail
seed 42
mode t2i
dtype bf16
negative prompt none
sampler / scheduler HiDream pipeline default

I used a portrait because hair, skin, background light, and fine detail are easy to compare. That said, a young woman's face has relatively little texture and wrinkle detail to begin with, so it's actually a forgiving subject for low-step generation — I'll come back to that.

Images in this article are contact sheets with results side by side. Pixel-peeping is easier at full resolution, but for UI-driven exploration the first question is "does this look worth keeping?" — so I've prioritized at-a-glance comparison here.


Start by Reducing Steps

Fixed guidance=5.0 and 2048x2048, varied only steps.

steps

Resolution Steps Guidance Elapsed Speedup vs 50 steps
2048x2048 20 5.0 13.070s 2.55x
2048x2048 28 5.0 18.412s 1.81x
2048x2048 36 5.0 23.854s 1.40x
2048x2048 50 5.0 33.370s 1.00x

Pretty much theoretical scaling. In this HiDream path, when guidance > 1.0, both conditional and unconditional forwards run, so reducing steps translates directly to lower latency.

Visually: 20 steps shows some roughness. 28 steps looks fine at first glance, though fine detail thins out under comparison. 36 steps holds up well for most use cases.


guidance=1.0 Is Significantly Faster

Next I varied guidance as well, comparing practical preset candidates.

presets

Preset Resolution Steps Guidance CFG Elapsed
Draft 2048x2048 24 1.0 off 8.164s
Balanced 2048x2048 36 3.0 on 23.664s
Official 2048x2048 50 5.0 on 32.609s

guidance=1.0 effectively disables CFG, so it's faster than step count alone would suggest — 24 steps lands in the 8-second range.

The trade-off is that lower guidance changes prompt adherence and overall aesthetics. Fine for idea validation, but for prompts involving text, specific clothing details, or precise multi-element placement, staying at guidance=3–5 is safer.


The Resolution Trap: Requesting 1024 Doesn't Make It Faster

My first instinct was to just pass width=1024, height=1024 and get a faster result. But the official pipeline doesn't use the requested resolution directly — it snaps to the nearest fixed aspect-ratio bucket.

buckets

Measured results:

Requested Actual
512x512 2048x2048
1024x1024 2048x2048
2048x2048 2048x2048
1280x720 2560x1440
720x1280 1440x2560
1024x768 2304x1728

Sending 1024x1024 from the UI does nothing — square aspect ratios all resolve to 2048x2048. The snapping logic lives in models/utils.py under PREDEFINED_RESOLUTIONS, and it seems intentionally designed to favor output stability.


Bypassing Buckets for True Low-Resolution Generation

For experimentation I added a snap_resolution=false flag that bypasses the pipeline's resolution snapping. For safety, arbitrary resolutions are constrained to:

  • width and height aligned to 32px
  • 256px minimum
  • max 4.3MP total

Comparing 1024 / 1536 / 2048 at 20 steps / guidance=5.0:

resolution

Resolution Elapsed Speedup vs 2048
1024x1024 3.831s 3.47x
1536x1536 7.139s 1.86x
2048x2048 13.278s 1.00x

This is where the real gains are. Given that the official 2048 recipe sits at 30+ seconds, 1536 + 28 steps should land around 10 seconds — a completely different feel.

1024 is fast but noticeably lower in information density. Good for directional checks, but probably too rough for regular output use.


Presets in the Studio UI

Based on these results, here's what I settled on in the Studio UI:

Use case Resolution Steps Guidance When to use
Quick preview 1024x1024 20–24 1.0–3.0 Composition / mood check
Standard 1536x1536 28–36 3.0–5.0 Day-to-day
High quality 2048x2048 36–50 5.0 Re-render of selected candidates
Official bucket bucket 50 5.0 Match upstream recipe exactly

Steps and resolution are independently selectable in the UI. The workflow is: explore with 1024 / 24 steps, then re-render promising results at 1536 or 2048 with the same prompt and seed.


Cases Where Quality Degradation Shows Up

With this portrait, the difference between 28 steps and 50 steps was "visible under comparison" — not obvious at a glance. But part of that is the subject matter.

Low steps and low resolution tend to hurt most with:

  • Older faces, wrinkles, skin texture
  • Hands, fingers, jewelry
  • Fabric with fine patterns
  • Text in signs or books
  • Multiple people
  • Busy indoor scenes with lots of background objects

Conversely, young faces, simple backgrounds, and soft lighting are forgiving — low-cost settings hold up well.

That's why a single fixed preset isn't the right design. Giving users control over exploration cost depending on what they're generating is the better approach.


Reproduction Commands

The benchmark script lives at image_server/bench_quality_speed.py. It calls the HTTP API after the model is already resident, so model load time is excluded from all measurements.

./image_server/start_image_server.sh
Enter fullscreen mode Exit fullscreen mode

Steps comparison:

python3 image_server/bench_quality_speed.py \
  --prompt "A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail" \
  --seed 42 \
  --variant s20_g5,20,5 \
  --variant s28_g5,28,5 \
  --variant s36_g5,36,5 \
  --variant s50_g5,50,5
Enter fullscreen mode Exit fullscreen mode

Resolution comparison:

python3 image_server/bench_quality_speed.py \
  --prompt "A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail" \
  --seed 42 \
  --variant s20_g5,20,5 \
  --size 1024x1024 \
  --size 1536x1536 \
  --size 2048x2048 \
  --no-snap-resolution
Enter fullscreen mode Exit fullscreen mode

Summary

HiDream-O1-Image Full is excellent at its official settings but too slow for iterative use. When you break down steps, CFG, and resolution separately, the speedups are clean and predictable.

  • Steps scale almost linearly with time
  • guidance=1.0 drops CFG and gives a large speed boost
  • The official pipeline snaps resolutions to fixed buckets
  • True low-resolution generation at 1024/1536 is dramatically faster
  • 1536 / 28–36 steps is the practical sweet spot

For image generation UIs, low-cost exploration → high-quality final render is a much better flow than starting at maximum quality every time. This experiment gave me a solid basis for building exactly that.

Top comments (0)