TL;DR
- HiDream-O1-Image 8B Full raw outputs collapse on plain Japanese prompts — both instruction-following and aesthetics fail at once
- Tried to swap to Dev-2604 (preference-tuned, 3.5× faster). It's better aesthetically but the gap is small in our use case, and worse — the 96GB GPU can't host both models alongside the rest of the stack
- Pivoted away from model swap entirely. Stuck with Full + a Gemini Flash Lite prompt enhancer that bolts aesthetic polish on top
- Along the way, found four non-obvious HiDream pitfalls (brand names get rendered as literal text, "cute" triggers childlike body bias, "Wong Kar-wai" hallucinates Korean captions, "idol-class" auto-generates caption text) — all baked into the enhancer's system prompt
- Same plain Japanese prompt now produces a usable photoreal or anime variant from a single click. No model swap, no extra VRAM, no extra latency.
Act 1: "Raw output is busted"
Kotonia Studio runs HiDream-O1-Image 8B Full on a local GPU (RTX PRO 6000 Blackwell Max-Q, 96GB) and offers free T2I. Normally outputs are clean. But one day, a plain Japanese prompt — "a cute woman in a cheongsam, holding a fan, smiling" — returned this:
What went wrong:
- Asked for a cheongsam, got a kimono. Chinese attire drifted to Japanese.
- Face isn't pretty. We wanted idol-class beauty.
- Composition is generic full-body in a Kyoto-style garden. We wanted a closer crop showing the fan texture.
HiDream-O1 is a top-tier OpenWeight model — careful English prompts produce magazine-grade 2048×2048 outputs. So this isn't "the model is bad." It's a gap between user input and OpenWeight model expectations. Frontier models (Gemini Imagen / DALL-E / Midjourney) absorb natural-language nuance internally. OpenWeight models expect you to throw the prompt straight at them.
Either give up on the raw-output UX, or do something about it.
Act 2: Maybe Dev-2604 will save us?
Then I noticed HiDream-O1-Image-Dev-2604, a new variant released in May 2026. Debuts at #8 on the Artificial Analysis T2I Arena, runs 3.5× faster at 28 steps with no CFG.
Arena ranks models on human aesthetic preference. So Dev should be preference-tuned for "what looks good."
Hypothesis:
- Dev returns magazine-grade output even on vague Japanese prompts
- 3.5× speed improvement makes
/studiosnappier - Best case: deprecate Full, run Dev only
Phase 1 bench: 5 generic cinematic prompts (Tokyo izakaya, Bangkok night market, anime character, text-in-image, portrait), Full vs Dev-2604:
| mode | Full (s) | Dev-2604 (s) | speedup |
|---|---|---|---|
| T2I (avg) | 33.1 | 9.5 | 3.5× |
| Edit (avg) | 79.0 | 22.2 | 3.6× |
| IP | 84.3 | 23.8 | 3.5× |
On generic prompts, Dev is faster and impressionistically nicer. "OK, Dev is the answer" — that's where I almost stopped at the end of Phase 1.
Act 3: But on the use case, the gap is thin — and Edit performance drops hard
I almost locked in a wrong conclusion. Kotonia's actual strategy is "comedy-style short videos with idol-class beauty hooks." The fact that Dev wins on generic cinematic doesn't mean it wins on character-driven comedy with expression specificity.
Built 8 new prompts inspired by Grok-generated reference images (cinematic editorial Asian beauty / anime qipao / cinematic hanfu / cosplay maid / etc), in vertical 1440×2560 (9:16) framing, and re-benched.
Some of the Grok reference images (the level of polish we wanted to match):
| Editorial portrait | Cinematic hanfu |
|---|---|
![]() |
![]() |
The bench result was Full wins on instruction-following:
- editorial portrait: tied; Dev maybe a touch nicer aesthetically
- anime qipao: Full's cell-shading wins decisively. Dev drifts to semi-realistic and ignores the "anime" instruction
- hanfu brocade: Dev hallucinated the literal word "SAVE" onto the parasol (text artifact)
- comedy surprised face: Full produces a more cartoonish exaggerated expression + readable caption text
- comedy deadpan: Full nails the "really?" deadpan expression with crisp eyeliner
Dev-2604 traded instruction-following for aesthetic polish. It was preference-tuned on magazine-style fashion photos — so on non-magazine use cases, it pulls outputs back toward "magazine-looking" against the prompt's intent.
"Both fine, marginal gap" example: editorial portrait
The category I marked "tied" — same portrait prompt, Full vs Dev outputs side by side:
| Full (tight crop, dramatic) | Dev-2604 (wider, magazine-polished) |
|---|---|
![]() |
![]() |
Full leans high-contrast and moody (window-side Rembrandt light, dark library background). Dev leans soft and editorial (seated half-body, natural light, smoother skin retouch). Both are usable; Dev is slightly gentler. That's it.
Not enough of a gap to justify the cost of model swapping (VRAM, load time, architectural complexity). That's the conclusion Phase 2 drove me to.
The decisive blow: Edit and IP performance crater
Generic T2I alone might have left Dev viable. But the gap on Edit and IP (character consistency) was stark, and that's what finally killed the model-swap idea.
We took a T2I output with three people in a dark alley with lanterns, and ran the Edit instruction Same scene, same characters, same composition. Change the weather to a heavy rainy evening; the characters now wearing translucent rain ponchos.
| Full (scene preserved, weather changed) | Dev-2604 (abandoned the source scene entirely) |
|---|---|
![]() |
![]() |
Full followed the instruction: three people, rain ponchos, rainy alley. Dev replaced the reference entirely with a single woman in a kimono at a snowy temple gate — neither following the text instruction nor preserving any structural detail from the reference. This is past "weak edit fidelity"; it's "not functioning as an edit."
IP (character consistency) showed the same pattern. We handed the model two face photos and asked for "the same two people standing together on an autumn path in Kyoto."
| Full (identities mostly preserved) | Dev-2604 (different people generated) |
|---|---|
![]() |
![]() |
Full keeps the two faces recognizable. Dev generated two different people. The preference-tuning likely prioritizes "produce pretty faces" over "preserve the reference's identity."
The official README spells this out: For editing tasks we recommend using the full model. Phase 1 timing was Full 79s / Dev 22s — fast, but Dev's outputs are unusable for Edit/IP.
So Dev isn't a clear win. But it's not a clean loss either — it's faster (3.5×), and on cinematic atmosphere shots it does look better. Maybe I need to use both, switched per use case?
Act 4: VRAM math kills "use both"
"Just keep both models resident on GPU" sounds clean. Then I actually pulled up the GPU memory budget for the single 96GB GPU we run everything on:
| Co-resident process | resident VRAM | peak VRAM |
|---|---|---|
| E4B (reviewer LLM) | 19.6 GB | 19.6 GB |
| 31B Gemma 4 NVFP4 (orchestrator) | 38.0 GB | 38.0 GB |
| TTS server (Irodori + Whisper) | 9.6 GB | 9.6 GB |
| Ditto-TalkingHead | 3.0 GB | 3.0 GB |
| LTX-2 A2V (cold-start, fp8-cast) | 0.9 GB | 24.0 GB (during inference) |
| HiDream Full (resident) | 16.4 GB | 17.3 GB |
| Total | 87.5 GB | 111.5 GB ← when LTX-2 fires |
The moment LTX-2 video generation fires, we're already right at the OOM line on a 96GB GPU. Adding Dev-2604 as a second resident model means +16.4 GB → total 127 GB → impossible.
Options enumerated:
- Both resident: impossible (OOM, see above)
- Both cold-start: +22s load per request (vs 33s inference, that's a big hit. Idle 0GB is nice but first-touch UX collapses)
- Dev resident + Full cold-start: Dev as primary + Full for edit/IP. But Phase 2 invalidated that premise
- Full resident + Dev cold-start: Occasionally switch to Dev, eat 22s load each time
- Drop Dev, keep Full only: status quo, no speedup gained
From a service-viability standpoint, options 1-4 all sacrifice either "make free users wait 22s extra" or "shrink VRAM headroom so LTX-2 / 31B can't run." Running a single GPU for one solo operator means budget is tight: Dev's marginal aesthetic gain doesn't justify breaking the rest of the stack.
I decided to abandon the model-swap path entirely.
Act 5: Can we just beat this with prompts?
Step back. What was Dev actually winning on?
Just aesthetic polish. Instruction-following is better on Full.
So if I can keep Full's instruction-following while bolting aesthetic polish onto the output, model swap isn't needed.
Concrete approach: append an aesthetic anchor (a "magic suffix") to the prompt to steer Full's output toward magazine-quality.
Trade-offs:
- ✅ Zero VRAM cost (Full only)
- ✅ Inference time unchanged (33s/image)
- ✅ Edit/IP/skeleton/layout still work on Full (avoiding the Dev performance cliff from Act 3)
- ✅ No 22s Dev cold-start penalty
- ⚠️ Risk: do anchors actually work?
Phase 3 — tried 4 anchor variants on Full:
- v1 Lindbergh:
"Vogue cover composition, Peter Lindbergh editorial photography..." - v2 cinematic:
"Roger Deakins anamorphic, blockbuster color grade..." - v3 K-beauty:
"Vogue Korea / ELLE Korea aesthetic, glass-skin glow..." - v4 combined: kitchen-sink
3 base prompts × (baseline + 4 anchors) = 15 generations. And three deeply non-obvious HiDream behaviors surfaced.
Pitfall 1: Brand names get rendered as literal text on the image
Any anchor containing "Vogue" or "ELLE" produced outputs with "VOGUE" appearing in printed magazine-cover text on the image itself — top-right corner, in front of the subject. Worse on anime: the cel-shaded character had a magazine layout overlaid on top.
HiDream-O1 is SOTA on CVTG-2K (complex visual text generation). The strong text-rendering training means any brand name in the prompt gets a near-guaranteed shot at being literally generated as text on the canvas.
→ Strip brand names from anchors completely. Photographer/director names like Lindbergh, Deakins, Mihoyo are safe — trademarks are landmines.
Pitfall 2: Photoreal anchors contaminate anime outputs with magazine paper
When anime base prompts were paired with photoreal anchors (v1-v4), the output looked like a cel-shaded anime character with a literal VOGUE magazine cover layout overlaid on top.
When style hints conflict, diffusion models physically overlay both elements rather than blending them.
→ Anime needs its own anchor family (Mihoyo / Kyoto Animation / theatrical anime style) — never reuse photoreal anchors.
Pitfall 3: "Wong Kar-wai" → Korean text hallucination on photoreal scenes
The v5 grok-direction anchor included "Wong Kar-wai-style color grade", and the output rendered Korean text "신부의 아안" etc on the photoreal scene.
Wong Kar-wai is a Hong Kong director with no Korean connection. But the model's internal "Asian arthouse cinema" association routed toward Korean and surfaced as printed text. Director names carry similar risk to brand names — A/B before adopting.
Act 6: Defuse the "cute → child" bias, ship it
Phase 4 rewrote the anchor library:
- All brand names stripped
- Only A/B-verified safe names retained (Lindbergh, Deakins, Mihoyo)
- Separate anime anchor family added (Mihoyo / Kyoto Animation)
- Anime anchors include
"mature young-adult character proportions"to defuse the "cute" → childlike-body bias (a behavior the user had spotted before I even ran the bench)
Re-benched result:
- ✅ photoreal portrait: v3 K-beauty clean — no VOGUE leakage, glass-skin + cinematic light
- ✅ anime: v7 Mihoyo anchor — no magazine contamination, adult proportions preserved
- ⚠️ comedy caption text handled separately (embrace auto-caption when wanted, post-overlay otherwise)
"Full + cleaned anchors" locked in. Time to wire it into the product.
Implementation: /api/studio/enhance (Gemini Flash Lite)
Added an enhance endpoint in backend/src/handlers/studio.rs. Backed by gemini-3.1-flash-lite (cheap API), not the local 31B Gemma. Why:
- The 31B local model is 38GB resident — the VRAM budget above already ruled out adding more local LLM weight
- Flash Lite is $0.075/M input + $0.30/M output. One enhance is roughly 800 in + 400 out tokens = ~$0.0002/call. Effectively free
- Zero VRAM impact: adding this feature doesn't compete with the rest of the GPU stack
System prompt encodes everything from Phase 1-4:
const ENHANCE_SYSTEM_PROMPT: &str = r#"You are a prompt enhancer for HiDream-O1-Image.
Rules (learned from A/B benchmarking):
1. NEVER include brand names ("Vogue", "ELLE", "Nike") — HiDream renders them
as literal text overlays.
2. NEVER use "Wong Kar-wai" — triggers Korean text hallucination.
3. For photoreal portraits, append:
" High-end Korean fashion magazine photoshoot aesthetic, professional
beauty retouch, glass-skin glow, ..."
4. For anime / cell-shaded / illustration, append:
" In the visual style of Mihoyo / HoYoverse key art, semi-painterly cel
shading, ..., mature young-adult character proportions ..."
ALSO: if the prompt has "cute girl" / "kawaii girl" without age qualifier,
normalize to "young woman in her early twenties with adult proportions".
5. For cinematic scenes, append cinematic CG realism anchor (no Wong Kar-wai).
6. For text-design prompts, append no suffix.
Output JSON: { "detected_style": "...", "anchor_applied": "...",
"enhanced_prompt": "..." }
"#;
UI side: a small "✨ Enhance" button above the prompt textarea on /studio. Click → POST /api/studio/enhance → swap textarea contents for enhanced_prompt + green banner showing detected style + undo link.
Act 7: Won
Same plain Japanese prompt that produced the kimono failure earlier, now run via the Enhance button:
Photoreal anchor applied
Cheongsam intact, close-up framing, idol-class face, glass-skin retouch, magazine lighting.
Anime anchor applied
Cel-shaded anime style, Chinese architectural courtyard background, adult proportions preserved, fan texture kept.
Same plain Japanese prompt → photoreal and anime variants, one click each. Single model, zero extra VRAM, identical inference time.
Takeaways
Engineering judgment lessons from this exercise:
- "Model swap" and "prompt engineering" should be compared on the same budget. Without a frontier model, VRAM and service viability constraints dominate model selection. In this case, preserving Full's resident slot was a higher-priority constraint than Dev's aesthetic edge.
- A/B bench in two stages. Generic prompts → tentative conclusion → use-case prompts → reversal. That's exactly what Acts 2-3 of this story were. Stopping at one stage means you ship the wrong conclusion.
- Proper nouns are landmines. Models with strong text-rendering training will literally bake trademarks and director names into the canvas. A/B every name before adopting.
- Cheap LLM prompt enhancers are the strongest move under VRAM pressure. $0.0002/call for a noticeable UX bump. Adding more local LLM weight starves the rest of the stack.
- Anime and photoreal need separate anchor families. Style hints that conflict get physically overlaid, not blended.
What's next
- LoRA training: prompt engineering hits a ceiling on anime. Train a custom anime LoRA on HiDream-O1 and let users swap LoRAs per use case ("comedy character expressions," "vertical 9:16 idol portrait," etc).
- Composition diversity: current anchors over-bias toward "indoor magazine shoot." Need explicit outdoor / urban / cinematic-location variants.
-
A/B testing in prod: instrument
/admin/analytics/to measure enhance-on vs enhance-off retry rate and conversion.
Even for OpenWeight diffusion models, one layer of prompt engineering above the model is enough to lift "raw output failure" into "production quality." If you're putting HiDream-O1-Image into production, dodge these four pitfalls and you're 80% of the way there.
The implementation runs live at kotonia.ai/studio — the "✨ Enhance" button sits above the prompt textarea. Free to try.











Top comments (0)