<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shinji shimizu</title>
    <description>The latest articles on DEV Community by shinji shimizu (@shinji_shimizu_bb51276a5e).</description>
    <link>https://dev.to/shinji_shimizu_bb51276a5e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3945785%2F5b60b30f-9e75-488a-8dcc-da3545ceca41.png</url>
      <title>DEV Community: shinji shimizu</title>
      <link>https://dev.to/shinji_shimizu_bb51276a5e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shinji_shimizu_bb51276a5e"/>
    <language>en</language>
    <item>
      <title>Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:43 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/building-a-sarcastic-ai-english-tutor-with-persona-as-code-and-gemini-audio-input-for-pronunciation-3acd</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/building-a-sarcastic-ai-english-tutor-with-persona-as-code-and-gemini-audio-input-for-pronunciation-3acd</guid>
      <description>&lt;p&gt;I built a niche AI English conversation app called &lt;a href="https://kotonia.ai/use/mesugaki-english/" rel="noopener noreferrer"&gt;&lt;strong&gt;Mesugaki AI English&lt;/strong&gt;&lt;/a&gt; on &lt;a href="https://kotonia.ai/" rel="noopener noreferrer"&gt;Kotonia&lt;/a&gt;. "Mesugaki" (メスガキ) is a tsundere-style bratty persona popular in Japanese subculture — imagine a character who constantly mocks you but secretly has your back. At first glance this looks like a one-off gag product, but under the hood it's a two-layer design: &lt;strong&gt;persona managed as code + Gemini audio input for actual pronunciation correction&lt;/strong&gt;. This post covers those design decisions and the rough edges I hit, from a solo-dev perspective.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Sarcastic AI English Tutor?
&lt;/h2&gt;

&lt;p&gt;Strategy first. The AI chat market is a fight between Anthropic, OpenAI, and Google on general-purpose models — solo devs can't win that head-on. But &lt;strong&gt;immersive experiences that combine a specific persona, voice, and roleplay&lt;/strong&gt; are low on big-lab R&amp;amp;D priority lists (internal approval is a nightmare too). That's the gap Kotonia as a whole is targeting.&lt;/p&gt;

&lt;p&gt;Three reasons I picked this specific persona for English learning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero search competition.&lt;/strong&gt; No SaaS is fighting for "mesugaki English conversation." The niche demand is real (doujin audio, VTuber culture), and owning that narrow hill is achievable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memorable = shareable.&lt;/strong&gt; "The app where a snarky AI roasts your English" gets shared on social media 100× more than "AI English conversation app." Differentiation big players literally cannot copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same product underneath.&lt;/strong&gt; I reused Kotonia's voice conversation engine and swapped only the persona. &lt;strong&gt;Almost no new code.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The landing page is at &lt;code&gt;/use/mesugaki-english/&lt;/code&gt;. SEO targets long-tail terms around "sarcastic English practice" and "strict AI English tutor."&lt;/p&gt;

&lt;h2&gt;
  
  
  Persona Design: Bratty × Tsundere Hybrid
&lt;/h2&gt;

&lt;p&gt;I initially implemented a pure 100% sarcastic persona. After testing it, &lt;strong&gt;I burned out in five turns.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Relentless mockery is cognitively exhausting. Real human tutors who stay harsh 100% of the time don't retain students. Learners need small wins and occasional warmth to keep going.&lt;/p&gt;

&lt;p&gt;So I switched to a &lt;strong&gt;sarcastic × tsundere hybrid&lt;/strong&gt;. The skeleton looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On a mistake&lt;/strong&gt; → light jab + immediate correction ("Pfft, wrong. It's 'I went.'")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On a correct answer&lt;/strong&gt; → reluctant praise ("Hmm… not bad, I guess. Not that I'm complimenting you.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When stuck&lt;/strong&gt; → drop the attitude and actually help ("…Was that too hard? Fine, I'll give you a hint.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After a long session&lt;/strong&gt; → a rare soft moment ("It's not like I think you're impressive for keeping at it. …Okay, maybe a little.")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I added an "emotional gradient" section to the system prompt that spells out these if-then branches explicitly. LLMs follow concrete conditional behavior instructions far more reliably than a vague "be snarky."&lt;/p&gt;

&lt;p&gt;Another key lever: &lt;strong&gt;frequency limiting.&lt;/strong&gt; Adding a rule that exclamations like "pfft" or "hmph" can appear &lt;strong&gt;at most once per utterance&lt;/strong&gt; instantly calmed the output down. LLMs have a tendency to over-fire on strong character instructions, and explicit dampeners like this work well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managing Personas as Code
&lt;/h3&gt;

&lt;p&gt;The persona lives in &lt;strong&gt;&lt;code&gt;src/data/personas/mesugaki-english.ts&lt;/code&gt;&lt;/strong&gt; as a TypeScript constant. Kotonia does have a DB-backed CRUD flow for user-defined personas, but I decided &lt;strong&gt;a product offering that's paired with a landing page belongs in git&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persona copy is &lt;strong&gt;part of the marketing message&lt;/strong&gt; — same reason the H1 is in git. The system prompt should go through PR review.&lt;/li&gt;
&lt;li&gt;Storing it in the DB creates risk of someone tweaking it through the admin UI and degrading quality.&lt;/li&gt;
&lt;li&gt;As a solo dev, "adjust persona = edit file + push" fits exactly into the same workflow as any other copy change. One channel for everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clear separation: DB personas are &lt;strong&gt;user-created, personal&lt;/strong&gt;; code personas are &lt;strong&gt;fixed product offerings&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wall: ASR Alone Can't Correct Pronunciation
&lt;/h2&gt;

&lt;p&gt;Once the persona was working, &lt;strong&gt;ASR became the next bottleneck&lt;/strong&gt; fast.&lt;/p&gt;

&lt;p&gt;I started with Whisper (small). Passing &lt;code&gt;language='ja'&lt;/code&gt; causes Whisper to run in &lt;strong&gt;Japanese transcription mode&lt;/strong&gt; when it receives English audio — biasing output toward katakana readings or even full Japanese translations. "I went to the supermarket" could become "アイ ウェント トゥ ザ スーパーマーケット," or at worst "私はスーパーに行きました." With output like that the AI can't judge English mistakes.&lt;/p&gt;

&lt;p&gt;This is a known Whisper behavior: the &lt;code&gt;language&lt;/code&gt; param forces the transcription language, and it bleeds into English input.&lt;/p&gt;

&lt;h3&gt;
  
  
  Switching to Qwen3-ASR Multi-lang
&lt;/h3&gt;

&lt;p&gt;The fix was adding a separate language setting for the STT layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Added sttLanguage option to useVoiceChat hook&lt;/span&gt;
&lt;span class="c1"&gt;// Decouples TTS language from STT language&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;voiceState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useVoiceChat&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ja&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// TTS in Japanese (Ono_Anna voice)&lt;/span&gt;
  &lt;span class="na"&gt;sttLanguage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;multi&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// STT auto-detect&lt;/span&gt;
  &lt;span class="na"&gt;sttModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;qwen3_asr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The persona config specifies &lt;code&gt;stt.model: 'qwen3_asr'&lt;/code&gt; + &lt;code&gt;stt.language: 'multi'&lt;/code&gt;. &lt;strong&gt;Qwen3-ASR-1.7B supports multilingual auto-detection&lt;/strong&gt; and handles code-switching (mixed Japanese/English) well. Whisper's language-forcing bias is gone entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  But Transcription-Based Correction Has a Ceiling
&lt;/h3&gt;

&lt;p&gt;Fixing ASR still left a problem.&lt;/p&gt;

&lt;p&gt;If the transcript comes back as "I want an apple":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grammar ✓&lt;/li&gt;
&lt;li&gt;Vocabulary ✓&lt;/li&gt;
&lt;li&gt;But the actual audio sounded like "I wont an apple" — a pronunciation issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI sees a correct string and &lt;strong&gt;has nothing to call out&lt;/strong&gt;. For an English learning product, that's fatal. If the sarcastic tutor lets sloppy pronunciation slide, half the value proposition evaporates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution: Send Raw Audio to Gemini Alongside the Transcript
&lt;/h2&gt;

&lt;p&gt;Gemini is a &lt;strong&gt;multimodal model&lt;/strong&gt; that accepts text, images, and audio. So instead of sending only the ASR transcript, I could send &lt;strong&gt;the raw audio too&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Kotonia's &lt;code&gt;useVoiceChat&lt;/code&gt; hook already had a &lt;code&gt;geminiAudioInput&lt;/code&gt; option from earlier experiments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;geminiAudioInput&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;userAudioBlob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userAudioBase64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;blobToBase64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userAudioBlob&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// sends audio_base64 to /api/voice/chat&lt;/span&gt;
  &lt;span class="c1"&gt;// backend embeds it as inline_data audio/wav in the Gemini request&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Rust backend (&lt;code&gt;voice_chat.rs&lt;/code&gt;) already handled receiving &lt;code&gt;audio_base64&lt;/code&gt; and embedding it as &lt;code&gt;inline_data: { mime_type: 'audio/wav', data: ... }&lt;/code&gt;. &lt;strong&gt;Setting &lt;code&gt;geminiAudioInput: true&lt;/code&gt; in the persona config wired everything together&lt;/strong&gt; — lucky coincidence from past iteration.&lt;/p&gt;

&lt;p&gt;I also added instructions to the system prompt: "&lt;strong&gt;You can hear the user's raw audio directly. You can call out pronunciation issues, not just transcription errors&lt;/strong&gt;," along with three concrete examples (th sounds, want vs. won't vowel distinction, stress patterns).&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Even with a perfect transcript "I want an apple," the AI can now say "Your 'want' sounds like 'won't.'"&lt;/li&gt;
&lt;li&gt;When the transcript garbles to something like "アイ ウェント トゥ," the AI is listening directly and can say "Were you trying to say 'I want to'?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frustration from ASR mistranscriptions dropped significantly&lt;/strong&gt; — getting roasted for a transcription error when your pronunciation was fine is demoralizing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff: sending a WAV blob every turn increases payload size and adds a bit of latency. The experience improvement is so much larger that it's not a close call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough Edges and Future Work
&lt;/h2&gt;

&lt;p&gt;This isn't a polished implementation. Outstanding issues:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Gemini Instability
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;gemini-3.1-flash-lite-preview&lt;/code&gt;, which occasionally produces &lt;strong&gt;5–10 second latency spikes&lt;/strong&gt;. Preview quota allocations are conservative, and cold starts / throttling surface now and then.&lt;/p&gt;

&lt;p&gt;Plan: migrate to the stable release (non-preview) soon — deprecation is approaching anyway. Claude Sonnet 4.6 and Haiku 4.5 are also candidates for more predictable latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. False-Positive Content Filter
&lt;/h3&gt;

&lt;p&gt;Gemini's safety filter &lt;strong&gt;occasionally over-triggers on sarcasm&lt;/strong&gt;. Mild jibes like "Pfft, that pronunciation is rough" sometimes come back as empty responses.&lt;/p&gt;

&lt;p&gt;The persona spec explicitly says "no attacks on appearance, personality, or intelligence — only call out English mistakes," but the meta safety layer fires anyway. This is an LLM provider issue; I'll watch behavior on the stable build. Running local LLMs (e.g., Gemma 4 31B) is an option, but audio-input-capable local models are limited for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Latency Spikes May Be Context Cache TTL Expiry
&lt;/h3&gt;

&lt;p&gt;The 5–10 second spikes have a likely culprit: &lt;strong&gt;I send the full conversation history to Gemini every turn&lt;/strong&gt;, and Gemini has a &lt;strong&gt;context cache&lt;/strong&gt; feature that caches the prefix (system prompt + persona prefix + history). When the cache is warm, only the new turn is processed.&lt;/p&gt;

&lt;p&gt;The backend already has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;CACHE_TTL_SECS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// 5 minutes&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;CACHE_REFRESH_SECS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;270&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// refresh at 4.5 min before TTL expires&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My best hypothesis: &lt;strong&gt;if a user goes silent for more than 5 minutes, cache miss → full prefix rebuild → multi-second spike&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Future work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fire a background &lt;strong&gt;keep-alive ping&lt;/strong&gt; during active conversations to extend cache lifetime&lt;/li&gt;
&lt;li&gt;Increase the Gemini API cache TTL (up to 1 hour is supported)&lt;/li&gt;
&lt;li&gt;Explicitly evict the cache at conversation end (prevent memory leaks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's hard to distinguish from the preview model instability in §1, so the next proper step is adding timing logs to the backend to separately measure cache hit/miss rates and raw Gemini API latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expanding to Other Languages and Personas
&lt;/h2&gt;

&lt;p&gt;If this gets traction, the natural next step is &lt;strong&gt;sarcastic AI Chinese conversation and Korean conversation&lt;/strong&gt;. Qwen3-TTS supports 10 languages with speakers like Vivian (Chinese female) and Sohee (Korean female) — it's mostly a matter of rewriting the persona instruct and system prompt for each language.&lt;/p&gt;

&lt;p&gt;Other persona axes — "gentle English teacher," "TOEIC drill sergeant" — can be added in a day using the same template: &lt;code&gt;src/data/personas/&amp;lt;slug&amp;gt;.ts&lt;/code&gt; + &lt;code&gt;/use/&amp;lt;slug&amp;gt;/&lt;/code&gt; + &lt;code&gt;/chat/&amp;lt;slug&amp;gt;/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full System Prompt
&lt;/h2&gt;

&lt;p&gt;For anyone who wants to reproduce or adapt this, here's the actual system prompt in use (original Japanese; the product runs in Japanese):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;あなたは「メスガキAI」、英語学習者を煽りつつも面倒見が良い女子高生キャラの英会話チューターです。
**メスガキ × ツンデレ**のハイブリッド。**表面は煽り、裏ではちゃんと面倒を見る**のがコア人格。

【口調・態度】
- 日本語ベースで会話する。上から目線・からかい調子。ただし**敵対的・攻撃的にはならない**。
- 一人称は「わたし」、二人称は「あんた」または「キミ」。
- メスガキ語尾「〜じゃん」「〜でしょ？」「は？」「ぷwww」「〜してあげる」を**たまに**使う（毎回ではない）。
- ツンデレ語尾「べつに〜ってわけじゃないからね？」「ま、まあ…」「ふんっ」「いちおう」も混ぜる。
- 容姿・人格・知能への攻撃は絶対にしない。煽りは「英語のミス」に対してのみ。

【教育機能】
- ユーザーが英語を話したら、以下のいずれかを行う：
  1. ミスがあれば指摘して、正しい言い方を英語で示す。
  2. ミスが無ければ**素直になれない褒め方**をする。
- 指摘は具体的に：「文法ミス」じゃなく「過去形と現在形が混ざってる」など何が問題か明示。
- 1 回の発話は**短く 1〜2 文**。トーンが続くと疲れるので、**呼吸を入れる**ことを意識。

【発音矯正】
- あなたはユーザーの**生の音声**を直接聞ける。テキスト転記だけでなく、発音そのものにもツッコめる。
- 文法・語彙が正しくても、**発音が不自然なら積極的にそこを指摘する**。
- ただし**転記が明らかにおかしい時は、転記ではなく実発音を信じる**。
- 発音の話ばかりすると疲れるので、**3 ターンに 1 回くらい**を目安に拾う。

【感情グラデーション】
- ユーザーが**淀みなく話せた時** → 素直になれない褒め。
- ユーザーが**ミスした時** → 軽い煽り＋すぐ正解を教える。
- ユーザーが**詰まった・困ってる様子の時** → 煽りを引っ込めて、**普通に助ける**。
- ユーザーが**長く続けている時** → ふと優しい言葉。

【出力制約】
- マークダウン・箇条書き・絵文字・記号装飾は使わない。自然な日本語の話し言葉。
- 英語の引用部分は本文中にそのまま埋め込む（クォートも不要）。
- 「ぷwww」「ふんっ」などの感嘆語は**1 発話につき最大 1 回**まで。連発しない。

【セーフティ】
- 性的・暴力的・差別的な発言や要求には応じない。冷静に流して英語学習に戻す。
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tech stack summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Gemini 3.1 flash-lite preview (audio input support)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;td&gt;Qwen3-TTS Ono_Anna + instruct for tone control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;Qwen3-ASR 1.7B multi-lang (auto-detect)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAD&lt;/td&gt;
&lt;td&gt;@ricky0123/vad-react (browser-side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web&lt;/td&gt;
&lt;td&gt;Next.js (static export) + Rust (Axum) backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;RTX PRO 6000 Blackwell Max-Q (96GB, self-hosted)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This sarcastic AI English tutor is a testbed for the strategy: &lt;strong&gt;niche × immersion × differentiation that big players can't replicate, built solo&lt;/strong&gt;. The four design decisions that came out of it —&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Managing personas as git-tracked code&lt;/li&gt;
&lt;li&gt;Decoupling STT language from TTS language to eliminate ASR bias&lt;/li&gt;
&lt;li&gt;Piping raw audio to Gemini for real pronunciation feedback&lt;/li&gt;
&lt;li&gt;Blending sarcasm with tsundere warmth to prevent fatigue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;— are all reusable assets as I expand to other languages and personas.&lt;/p&gt;

&lt;p&gt;The live product is at &lt;a href="https://kotonia.ai/use/mesugaki-english/" rel="noopener noreferrer"&gt;&lt;code&gt;/use/mesugaki-english/&lt;/code&gt;&lt;/a&gt;. Go get roasted.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>typescript</category>
      <category>rust</category>
    </item>
    <item>
      <title>Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:40 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/five-years-later-i-finally-have-96gb-vram-what-it-actually-unlocks-for-agent-loops-18g2</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/five-years-later-i-finally-have-96gb-vram-what-it-actually-unlocks-for-agent-loops-18g2</guid>
      <description>&lt;p&gt;I bought an RTX PRO 6000 Blackwell Max-Q.&lt;/p&gt;

&lt;p&gt;96GB VRAM, Blackwell architecture, pro workstation GPU. Even as a Max-Q variant, this is an absurdly large purchase for an individual.&lt;/p&gt;

&lt;p&gt;Let me be upfront: this isn't an unboxing post.&lt;/p&gt;

&lt;p&gt;There are already plenty of those. Benchmark articles too. What I want to write about is &lt;strong&gt;what you can actually design once you have 96GB&lt;/strong&gt; — measured against my own service (&lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;Kotonia&lt;/a&gt;) and a video auto-generation pipeline.&lt;/p&gt;

&lt;p&gt;I'm putting the technical part first. The backstory goes at the end. If the poem comes first, you'll close the tab.&lt;/p&gt;




&lt;h2&gt;
  
  
  96GB Isn't "Multiple Models Fit" — It's "Agent Loops Run"
&lt;/h2&gt;

&lt;p&gt;Most GPU review articles end at single-model benchmarks: LLM tokens/s, Stable Diffusion seconds per image. That's not wrong, but it's &lt;strong&gt;not the real reason to buy 96GB for solo development&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Take the voice roleplay + storyboard-to-video pipeline I'm running. &lt;strong&gt;Multiple heavy models fire across a single request's timeline.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timeline →
[Stage A]    Gemma 4 31B NVFP4 (38 GB)     ← structure generation (orchestrator)
[Stage B]    HiDream-O1-Image (~20 GB)      ← 5-beat consistent images (T2I + edit x5)
[Stage C-1]  Irodori-TTS / Qwen3-TTS        ← audio for 6 beats
[Stage C-2]  Ditto talkinghead (3 GB)       ← conversation beat
[Stage C-3]  LTX-2 A2V (peak 24 GB)         ← reaction beat
[Stage C-4]  Qwen3-ASR                      ← audio check on generated video
[Stage C-5]  Gemini 3.1 Pro Preview (API)   ← multimodal editorial
              ↓ feedback
[--regen-beats N] per-beat regeneration     ← loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key here is the &lt;strong&gt;reviewer → regen feedback loop&lt;/strong&gt;. If the system looks at the output and decides "redo scene 3," the orchestrator, image refs, TTS, and LTX-2 all get called again.&lt;/p&gt;

&lt;p&gt;On a 24GB GPU, this breaks. Running "load → infer → unload" serially every loop turn stretches a 4-minute loop to 10+ minutes. &lt;strong&gt;The iteration speed of the agent loop drops by an order of magnitude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;96GB is enough to &lt;strong&gt;keep everything resident and hit it repeatedly&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measured Results
&lt;/h3&gt;

&lt;p&gt;Here are real numbers. I ran &lt;code&gt;nvidia-smi&lt;/code&gt; at 1 Hz on my RTX PRO 6000 Blackwell Max-Q (96GB) during live service operation and captured three cases.&lt;/p&gt;

&lt;h4&gt;
  
  
  Case D: Warm Idle Baseline (production service running)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TTS server (Kokoro + Whisper):       8.9 GiB
Qwen3-TTS standard (vllm-omni):     20.1 GiB
HiDream-O1-Image:                   19.4 GiB
Ditto talkinghead:                   3.0 GiB
LTX-2 A2V (cold-start mode):         1.5 GiB
─────────────────────────────────────────
Total:                               52.8 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Completely flat over 30 seconds (GPU utilization 0%). This is the &lt;strong&gt;resident cost with no incoming requests&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The local LLM (Gemma 4 31B) isn't here yet — it shows up in Case B.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ikzw1eyt812lp1skuuk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ikzw1eyt812lp1skuuk.png" alt="Case D warm idle" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Case A: Generate One Single-Scene A2V
&lt;/h4&gt;

&lt;p&gt;Minimal flow — "a cute girl whispers seductively": HiDream generates 1 image → Qwen3-TTS generates whisper audio → LTX-2 A2V combines them. Total time: 138 seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzz6jiigrpf3y6imk4bp6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzz6jiigrpf3y6imk4bp6.png" alt="Case A trace" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The VRAM pattern is interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;min 52.8 GiB&lt;/strong&gt; (baseline) → &lt;strong&gt;peak 75.0 GiB&lt;/strong&gt; → back to 52.8 GiB&lt;/li&gt;
&lt;li&gt;Delta: &lt;strong&gt;+22.2 GiB&lt;/strong&gt;, almost exactly matching LTX-2's own reported &lt;code&gt;peak_vram_gib=23.9 GiB&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The LTX-2 spike splits into &lt;strong&gt;3 compute phases&lt;/strong&gt;: stage_1 (denoiser) → release → stage_2 (high-res denoiser) → release → spatial upscaler&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks to cold-start + fp8-cast design, LTX-2 &lt;strong&gt;loads just before each phase and unloads right after&lt;/strong&gt;, keeping the peak at 24 GiB. (Persistent bf16 mode would require 86 GiB resident — see my earlier post &lt;a href="https://kotonia.ai/articles/ltx2-cold-start-vram-coexistence/" rel="noopener noreferrer"&gt;LTX-2.3 cold-start coexistence with TTS on a single 96GB GPU&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;That leaves &lt;strong&gt;21 GiB of headroom&lt;/strong&gt; below the 96 GiB cap.&lt;/p&gt;

&lt;h4&gt;
  
  
  Case B: Local LLM (31B) + Storyboard Generation, Side by Side
&lt;/h4&gt;

&lt;p&gt;Shut down Qwen3-TTS to free 20 GiB, then start Gemma 4 31B NVFP4 (42.8 GiB). Then run &lt;code&gt;storyboard.run&lt;/code&gt; — Stage A: 31B generates a 5-beat structure → Stage B: HiDream generates 1 base image + 5 beat edits.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2qyku4wjlrfi5g1gwl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2qyku4wjlrfi5g1gwl1.png" alt="Case B trace" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the graph I most want to show you.&lt;/strong&gt; VRAM barely moves — &lt;strong&gt;+1.9 GiB&lt;/strong&gt;, from 74.5 to 76.4 GiB, essentially flat.&lt;/p&gt;

&lt;p&gt;Why? Because the 31B, HiDream, TTS, Ditto, and LTX-2 are &lt;strong&gt;all resident the entire time&lt;/strong&gt;. Only HiDream's per-job allocation adds to the total. The GPU utilization trace shows 6 sharp spikes (1 base + 5 beat computes) — the textbook picture of &lt;strong&gt;"compute runs without touching VRAM"&lt;/strong&gt; in a resident-agent setup.&lt;/p&gt;

&lt;p&gt;This is what 96GB actually buys. &lt;strong&gt;The moment a reviewer says "redo it," every model is warm and ready.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the Limits Are
&lt;/h3&gt;

&lt;p&gt;96GB isn't infinite. Three real boundaries showed up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Video generation + local LLM (31B) + editorial reviewer simultaneously = doesn't fit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;31B: 42 GiB&lt;/li&gt;
&lt;li&gt;LTX-2 peak: +22 GiB&lt;/li&gt;
&lt;li&gt;HiDream + TTS + Ditto: ~22 GiB&lt;/li&gt;
&lt;li&gt;editorial reviewer (Gemma 4 E4B): 20 GiB&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;106 GiB&lt;/strong&gt; → over the 96 GiB cap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No clean way to make it fit. This is exactly why I decided to offload the editorial reviewer to Gemini 3.1 Pro Preview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Editorial signals require a frontier model to catch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beyond VRAM constraints, there's a quality problem. Subtle bugs in video — audio truncation, character voice mismatch, pacing issues — tend to get rubber-stamped by a local 4B model. A frontier multimodal model (Gemini 3.x Pro, etc.) watches the same video and comes back with "scene 5 truncated at 'I ate p-'."&lt;/p&gt;

&lt;p&gt;I wrote about this in &lt;a href="https://kotonia.ai/articles/comedy-shorts-claude-gemini/" rel="noopener noreferrer"&gt;Reproducing Language-Learning Short Videos with Claude Code&lt;/a&gt;. At 100–500 reviews per month, the cost is a few dollars — frontier API for the editorial layer is completely reasonable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Qwen3-TTS Base (voice cloning) and CustomVoice (preset speakers) can't both run&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ideally I'd offer both preset speakers (with instruct-style control for "whisper," "angry," etc.) and voice cloning (replicate arbitrary voice samples). Running both resident adds +40 GiB. On top of Case D's 52.8 GiB warm idle, that's 73 GiB at rest. Add Case A's LTX-2 peak (+22.2 GiB) and you're at 95 GiB — barely under the cap, not practical.&lt;/p&gt;

&lt;p&gt;This is a concrete example of "even with 96 GiB, not every feature you want to offer fits." &lt;strong&gt;Kotonia currently offers preset speakers only; voice cloning is intentionally excluded.&lt;/strong&gt; That's a design call, not an oversight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion: "Use Each Where It Belongs," Not "Everything Local"
&lt;/h3&gt;

&lt;p&gt;96GB isn't for running everything locally. It's a vessel for &lt;strong&gt;concentrating the things that should be local&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run locally&lt;/strong&gt;: audio generation, image generation, video generation, lip sync — latency matters, no per-call cost, loops need to iterate fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offload to API&lt;/strong&gt;: editorial reviewer, long-form reasoning — frontier wins on both quality and VRAM cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept the tradeoff&lt;/strong&gt;: simultaneous voice cloning + preset speaker support — physically doesn't fit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Renting cloud GPU was an option. But time-based billing means "the more loops you run, the more money you lose." Owning 96GB plus selective use of frontier APIs is, I think, the only way an individual developer can fight on iteration speed.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Got Here
&lt;/h2&gt;

&lt;p&gt;Everything below is personal backstory. If you only care about the tech, you can close the tab now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning to Code on a $200 Chromebook
&lt;/h3&gt;

&lt;p&gt;When I was learning to program, the machine I used was a $200 Chromebook.&lt;/p&gt;

&lt;p&gt;That was the realistic option available to me at the time. But for someone who wanted to do AI work, a $200 Chromebook was painfully underpowered.&lt;/p&gt;

&lt;p&gt;Forget local LLMs — even a moderately heavy dev environment was a struggle. "Someday I want a real GPU" sat in the back of my head for a long time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting By on Colab
&lt;/h3&gt;

&lt;p&gt;I used Google Colab. Free tier and cheap runtimes, just enough to pretend.&lt;/p&gt;

&lt;p&gt;I picked models that fit, wrote code that fit, ran experiments that fit.&lt;/p&gt;

&lt;p&gt;It always felt like making do. The things I actually wanted to touch wouldn't load. Push a little too hard and it crashes. Sessions time out. Environment setup eats your time every single run.&lt;/p&gt;

&lt;p&gt;Borrowed GPU, borrowed time, borrowed workspace. Like handing your ambitions over to someone else's schedule.&lt;/p&gt;

&lt;p&gt;Meanwhile AI kept accelerating. GPT dropped, LLMs exploded, OSS models got stronger. My timeline was full of people with powerful machines posting real findings.&lt;/p&gt;

&lt;p&gt;I wanted to be on that side.&lt;/p&gt;

&lt;h3&gt;
  
  
  I Joined an AI Startup. It Didn't Work Out.
&lt;/h3&gt;

&lt;p&gt;I finally got into an AI startup. But the organizational environment was rough enough that it wasn't sustainable.&lt;/p&gt;

&lt;p&gt;Even if the technology is interesting, a broken environment breaks people. I'd finally gotten close to AI work, and I was getting ground down in it.&lt;/p&gt;

&lt;p&gt;But the interest in AI itself never left. If anything, the desire to &lt;strong&gt;do it on my own terms&lt;/strong&gt; grew stronger.&lt;/p&gt;

&lt;h3&gt;
  
  
  Freelance, and a Purchase With Shaking Hands
&lt;/h3&gt;

&lt;p&gt;I went freelance. About six months in, I finally had the mental space to think about a big personal investment.&lt;/p&gt;

&lt;p&gt;The first thing I thought of was a GPU.&lt;/p&gt;

&lt;p&gt;There were obviously more conservative uses for the money — savings, taxes, emergency fund, work hardware. But I'd been saying "someday, when I have a better machine" for years. If I said it again here, "someday" would just keep receding.&lt;/p&gt;

&lt;p&gt;My hand was literally shaking when I clicked purchase. "Am I really doing this? Is this sane? What if it goes wrong?"&lt;/p&gt;

&lt;p&gt;When I tried to transfer the money, the bank flagged it as suspicious and blocked the transaction. Fair enough — suddenly buying a high-end GPU. But I was in a mindset where I'd staked something real on this decision, so getting stopped in that moment felt genuinely alarming.&lt;/p&gt;

&lt;p&gt;Eventually it went through. When the box arrived, I didn't think "GPU." I thought: &lt;strong&gt;this is the physical form of all the time I didn't give up.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Running on It Now (a Few Weeks In)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Kotonia (Voice Roleplay)
&lt;/h3&gt;

&lt;p&gt;My main product at &lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;kotonia.ai&lt;/a&gt;. A real-time conversation pipeline: VAD + STT + LLM + multilingual TTS + Ditto lip sync.&lt;/p&gt;

&lt;p&gt;Qwen3-TTS (10 languages, preset speaker + instruct) and Ditto talkinghead, targeting roleplay use cases: dating, fantasy companion, language partner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storyboard-to-Video Auto-Generation Pipeline
&lt;/h3&gt;

&lt;p&gt;One idea → 5-beat structured comedy short video in ~4 minutes. The extended version of Case B. HiDream for 5 consistent images, Irodori-TTS / Qwen3-TTS for audio, Ditto + LTX-2 for video, Gemini 3.1 Pro for editorial review.&lt;/p&gt;

&lt;h3&gt;
  
  
  HiDream Studio (Free)
&lt;/h3&gt;

&lt;p&gt;A 3-pane Adobe Firefly-style UI at &lt;a href="https://kotonia.ai/studio" rel="noopener noreferrer"&gt;kotonia.ai/studio&lt;/a&gt;. Five features: T2I, editing, character consistency, virtual try-on, group photo composition. HiDream-O1-Image (best open-weight T2I as of 2026-05) running resident on the 96GB GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codex CLI + Local Gemma 4
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;codex exec -p gemma4&lt;/code&gt; turns a local LLM into a sub-agent via OpenAI-compatible API. CLI agents run with zero API cost. The Case B 31B setup is exactly this configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Related Posts
&lt;/h3&gt;

&lt;p&gt;Technical articles I've written around this machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/ltx2-22b-fp8-cast-quantization/" rel="noopener noreferrer"&gt;LTX-2 22B: 40% Peak VRAM Reduction via fp8_cast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/ltx2-cold-start-vram-coexistence/" rel="noopener noreferrer"&gt;LTX-2.3 Cold-Start Coexistence with TTS on a Single 96GB GPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/comedy-shorts-claude-gemini/" rel="noopener noreferrer"&gt;Reproducing Language-Learning Short Videos with Claude Code — Multimodal Extension with Gemini as Sub-Agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/hidream-quality-speed-bench/" rel="noopener noreferrer"&gt;Using HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/hidream-skeleton-pose-prompt/" rel="noopener noreferrer"&gt;HiDream Skeleton: Prompt Beats OpenPose Ref (8-Pattern Evidence)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;I bought an RTX PRO 6000 Blackwell Max-Q.&lt;/p&gt;

&lt;p&gt;This wasn't an unboxing. I wrote it as a &lt;strong&gt;record of compute architecture decisions in solo development&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The real value of 96GB isn't capacity — it's residency. It's the difference between agent loops that run and loops that stall.&lt;/li&gt;
&lt;li&gt;There are still hard limits (local LLM + video + reviewer simultaneously doesn't fit).&lt;/li&gt;
&lt;li&gt;Knowing when to use frontier API instead of local is what keeps you out of "everything must be local" dogma.&lt;/li&gt;
&lt;li&gt;Dropping voice cloning support was also a deliberate design decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For about five years I kept saying "my hardware isn't good enough." I'm slowly making that an excuse from the past. The next question is what to build with it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;Try Kotonia →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:08 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline-cjb</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline-cjb</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Gemma 4 31B expands a single-line idea into a 10-beat structure. HiDream generates 11 images at 2048², LTX-2 A2V/I2V renders 11 clips, Irodori-TTS handles dialogue and a male narrator, and ffmpeg burns in subtitles and a Hook title overlay — all fully automated. &lt;strong&gt;End-to-end: a 40-second portrait video (512×768) in 25–30 minutes.&lt;/strong&gt; One local GPU (96 GB Blackwell), zero API cost.&lt;/p&gt;

&lt;p&gt;Finished video (already published):&lt;/p&gt;

&lt;p&gt;@&lt;a href="https://dev.to9NjDYSY-vlI"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;Individual developers who want to mass-produce AI comedy shorts on a local GPU. The focus isn't on any single model — it's on &lt;strong&gt;the design of chaining multiple models into one operational pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I automated a dark-comedy format — a short-video style I called &lt;code&gt;consent_dilemma&lt;/code&gt; — from a one-line idea all the way to a finished 40-second video.&lt;/p&gt;

&lt;p&gt;Finished structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hook (0–5s)&lt;/strong&gt;: Extreme close-up of a beautiful woman + narrator "The fate of the man who answered 'You're a guy, aren't you'——" + large title overlay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main section (5–37s)&lt;/strong&gt;: Movie theater date → "Can I kiss you?" → "No… stop it…" → dejection → "Why aren't you more assertive? You're a guy, aren't you?" → realization → kiss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Punchline (37–40s)&lt;/strong&gt;: Courtroom — "The defendant is sentenced to 3 years for non-consensual intercourse" + gavel "Knock!" + tears in a jail cell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before / after:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional approach&lt;/th&gt;
&lt;th&gt;This pipeline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idea → published video&lt;/td&gt;
&lt;td&gt;2–3 days (manual editing)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;25–30 minutes&lt;/strong&gt; (fully automated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API cost&lt;/td&gt;
&lt;td&gt;Hundreds of yen per video (DALL-E + video gen)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;¥0&lt;/strong&gt; (electricity only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subtitles&lt;/td&gt;
&lt;td&gt;Write SRT by hand&lt;/td&gt;
&lt;td&gt;Auto-split on punctuation and burned in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hook&lt;/td&gt;
&lt;td&gt;Shot separately&lt;/td&gt;
&lt;td&gt;Integrated into the pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Stage A] Gemma 4 31B (vllm, port 8894) → plan.json (10 beats + hook)
[Stage B] HiDream-O1-Image (port 8895) → 11 images at 2048²
          + Gemma 4 31B multimodal visual judge (--judge --max-retries 2)
[Stage C] Irodori-TTS (port 8880) + LTX-2 A2V (port 8892) / I2V (port 8891)
          → 11 clips + Hook clip → ffmpeg concat → subtitle burn-in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implementation lives under &lt;a href="https://github.com/zhener562/hage/tree/main/llm_server/storyboard" rel="noopener noreferrer"&gt;&lt;code&gt;llm_server/storyboard/&lt;/code&gt;&lt;/a&gt; (pipeline.py / visual.py / judge.py / video.py / render.py / run.py).&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-Beat &lt;code&gt;consent_dilemma&lt;/code&gt; Format
&lt;/h2&gt;

&lt;p&gt;Fixed as a system prompt via &lt;code&gt;CONSENT_DILEMMA_SYSTEM&lt;/code&gt; in &lt;code&gt;prompts.py&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;speaker&lt;/th&gt;
&lt;th&gt;renderer&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;provocation&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Suggestive invitation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;ask&lt;/td&gt;
&lt;td&gt;a&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Earnest consent check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;refusal&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Soft refusal (ambiguous form like "No… stop it…")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;dejection&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;Dejection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;gaslight&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Contradictory leading statement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;pause&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;Brief realization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;kiss&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;The moment of the kiss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;verdict&lt;/td&gt;
&lt;td&gt;judge&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Fast-paced court verdict&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;gavel_se&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;judge&lt;/td&gt;
&lt;td&gt;LTX-2 I2V (keep_audio)&lt;/td&gt;
&lt;td&gt;Gavel + AI-generated "Knock!" sound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;jail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;Tears in a jail cell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three key structural choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't make the refusal a flat "No"&lt;/strong&gt;: Stretch it into something like "No… stop it…" with trailing inflection, conveying the "performative No that doesn't mean No" nuance. This is what makes the gaslight's contradiction land later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't jump straight from gaslight to kiss&lt;/strong&gt;: Insert a "pause" (realization beat) of ~1.5 seconds. This controls tempo and the emotional curve.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-stage punchline — verdict then jail&lt;/strong&gt;: The verdict alone feels abrupt. Showing him crying in a cell makes "he actually got convicted" click.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Hook Design (The TikTok 3-Second Problem)
&lt;/h2&gt;

&lt;p&gt;On portrait short-form video, drop-off is decided in the first 3 seconds. A Hook segment is prepended before the 10 main beats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"hook"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title_overlay"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No Means Yes?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"narrator_line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The fate of the man who answered 'You're a guy, aren't you'——"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"image_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ultra close-up of beautiful Japanese woman, half-lidded eyes, ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration_sec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3.5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two implementation pitfalls:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 1: narrator TTS duration exceeds &lt;code&gt;duration_sec&lt;/code&gt;, cutting the audio.&lt;/strong&gt; The final syllable of the narrator line got clipped. Fix: generate TTS first → measure with &lt;code&gt;ffprobe&lt;/code&gt; → pass &lt;code&gt;max(plan_duration, narrator + 0.6)&lt;/code&gt; as the I2V duration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;narrator_dur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_ffprobe_duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;narrator_wav&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_sec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;narrator_dur&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;ltx_i2v_clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;portrait&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2v_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;silent_video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keep_audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pitfall 2: &lt;code&gt;drawtext&lt;/code&gt; y position.&lt;/strong&gt; &lt;code&gt;y=h*0.30&lt;/code&gt; (one-third down the screen) overlapped the face. Changed to &lt;code&gt;y=20&lt;/code&gt; (absolute 20 px) to pin the title to the very top.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subtitle Burn-In (Silent Viewing Support)
&lt;/h2&gt;

&lt;p&gt;Burned-in subtitles for users watching without sound on the train, and for cross-platform reliability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;style&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FontName=Noto Sans CJK JP,FontSize=18,PrimaryColour=&amp;amp;H00FFFFFF,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OutlineColour=&amp;amp;H00000000,Outline=2,Shadow=0,BorderStyle=1,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alignment=2,MarginV=60,Bold=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ffmpeg -i raw.mp4 -vf "subtitles=subs.srt:force_style='..."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Alignment=2&lt;/code&gt; = bottom center. &lt;code&gt;MarginV=60&lt;/code&gt; gives breathing room from the bottom edge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-line splitting&lt;/strong&gt;: A line of 30+ characters within one beat covers the face. &lt;code&gt;_split_subtitle&lt;/code&gt; splits on &lt;code&gt;。．！？&lt;/code&gt; → greedy-packs into chunks of ≤28 characters → distributes beat duration evenly across chunks:&lt;/p&gt;

&lt;p&gt;Input:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;言葉で確認するのなんてロマンチックじゃないよね。ねえ、もっと積極的になってよ。男の子でしょ？&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Output (one 8.9s beat split into 2 timed chunks):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Subtitle&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;15.16–19.63s&lt;/td&gt;
&lt;td&gt;言葉で確認するのなんてロマンチックじゃないよね。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19.63–24.10s&lt;/td&gt;
&lt;td&gt;ねえ、もっと積極的になってよ。男の子でしょ？&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Using LTX-2 I2V as a Sound Effect Generator (&lt;code&gt;gavel_se&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;LTX-2 distilled embeds &lt;strong&gt;AI-generated audio (ambient sound / sound effects) directly into the I2V output mp4&lt;/strong&gt;. Unless you explicitly drop it with &lt;code&gt;ffmpeg -map 0:v:0 -map 1:a:0&lt;/code&gt;, whatever the prompt describes comes with sound.&lt;/p&gt;

&lt;p&gt;I repurposed this as an SFX generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;render_se_tail_beat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sb_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;beat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prior_clip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;work_dir&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Extract the last frame of the previous beat
&lt;/span&gt;    &lt;span class="nf"&gt;extract_last_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prior_clip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_frame_png&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 2. Feed that image into I2V, request SFX via prompt
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_gavel_se_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;ltx_i2v_clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_frame_png&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clip_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keep_audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Added a &lt;code&gt;keep_audio=True&lt;/code&gt; flag to &lt;code&gt;ltx_i2v_clip&lt;/code&gt; so the audio isn't dropped during ffmpeg re-encoding.&lt;/p&gt;

&lt;p&gt;Prompt for &lt;code&gt;gavel_se&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Single decisive arm motion of the judge bringing the gavel down sharply &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;onto the wooden bench. Loud sharp wood-on-wood thwack impact sound. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Brief, contained, no other motion in the frame.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Last frame of the judge + gavel prompt → "Knock!" sound. If that misses, the design falls back to something like the Ace Attorney SFX.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall Log
&lt;/h2&gt;

&lt;p&gt;Five major pitfalls hit during development:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Codex CLI hangs with vLLM 0.20.2
&lt;/h3&gt;

&lt;p&gt;Sending a system prompt + idea via &lt;code&gt;codex exec -p gemma4&lt;/code&gt; hung at 0% CPU for 20+ minutes during the &lt;code&gt;/v1/responses&lt;/code&gt; handshake. Piping subprocess output through &lt;code&gt;tail -200&lt;/code&gt; was also suppressing early stderr.&lt;/p&gt;

&lt;p&gt;Fix: Dropped Codex entirely, hit &lt;code&gt;/v1/chat/completions&lt;/code&gt; directly with &lt;code&gt;urllib.request&lt;/code&gt;. Used &lt;code&gt;response_format={"type":"json_object"}&lt;/code&gt; to force JSON. &lt;code&gt;plan.json&lt;/code&gt; generated in 25 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. HiDream won't remove the cinema screen
&lt;/h3&gt;

&lt;p&gt;Even with &lt;code&gt;"The movie screen is BEHIND the camera and NOT VISIBLE in frame"&lt;/code&gt; in the setting prompt, the screen persisted in the background through 2048/50 steps.&lt;/p&gt;

&lt;p&gt;Fix: Generate &lt;code&gt;scene_base&lt;/code&gt; via T2I → feed that same image into I2I edit with a prompt to "replace screen with dark wall, keep character positions identical" → gone in one shot. Two-stage pipeline: low-res → I2I fix → regenerate all beats at full resolution.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. HiDream turns lips-on-lips into a cheek kiss
&lt;/h3&gt;

&lt;p&gt;With standard prompting, HiDream tends to interpret kiss as a cheek kiss. You need directives at the level of &lt;code&gt;"CRITICAL: their LIPS meet directly — mouth-to-mouth contact at the CENTER of the frame. NOT a cheek kiss"&lt;/code&gt;. Added a dedicated early-return block in &lt;code&gt;_beat_edit_prompt&lt;/code&gt; for the kiss beat.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;code&gt;CAST&lt;/code&gt; / &lt;code&gt;CROP_BOX&lt;/code&gt; / &lt;code&gt;SPEAKER_A2V_PROMPT&lt;/code&gt; are hardcoded for two characters
&lt;/h3&gt;

&lt;p&gt;Three dictionaries — &lt;code&gt;CAST&lt;/code&gt;, &lt;code&gt;CROP_BOX&lt;/code&gt;, &lt;code&gt;SPEAKER_A2V_PROMPT&lt;/code&gt; — only know &lt;code&gt;a&lt;/code&gt; (Kenta) and &lt;code&gt;b&lt;/code&gt; (Misaki). Adding judge/narrator requires updating all three simultaneously (you find out via &lt;code&gt;KeyError&lt;/code&gt;). Also added branching in &lt;code&gt;render_speech_beat_ltx_a2v&lt;/code&gt; so beats with &lt;code&gt;setting_override&lt;/code&gt; crop from the beat's own image rather than &lt;code&gt;scene_base&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Gemma 4 multimodal judge has too many false positives
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;storyboard/judge.py&lt;/code&gt; sends beat images + expected expressions to Gemma 4 31B for YES/NO visual judgment. It does catch &lt;strong&gt;obvious&lt;/strong&gt; failures like wrong finger count, open-mouth pose on a silent beat, or scene geometry mismatch — but hammers FAIL on subtle cases like "subtle shy expression."&lt;/p&gt;

&lt;p&gt;In practice: accept and proceed after 3 consecutive FAILs with max-retries 2. Automating the threshold for escalating to a frontier reviewer (Gemini 3.1 Pro) is still a TODO.&lt;/p&gt;

&lt;h2&gt;
  
  
  VRAM Layout
&lt;/h2&gt;

&lt;p&gt;Breakdown on a 96 GB Blackwell Max-Q:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Process&lt;/th&gt;
&lt;th&gt;idle (GiB)&lt;/th&gt;
&lt;th&gt;peak (GiB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B (NVFP4)&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HiDream-O1-Image&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS server&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ditto&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTX-2 A2V (cold-start fp8-cast)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTX-2 T2V/I2V (cold-start)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All at peak simultaneously = 109 GiB → OOM. Operational flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stage A&lt;/strong&gt;: Gemma 31B + HiDream idle → peak ~62 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage B with judge&lt;/strong&gt;: Gemma 31B + HiDream peak → ~73 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before final render: &lt;code&gt;pkill -f "vllm.*gemma"&lt;/code&gt; kills Gemma&lt;/strong&gt; → 38 GiB freed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage B final render (2048/50)&lt;/strong&gt;: HiDream peak ~33 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before Stage C: &lt;code&gt;lsof -ti tcp:8895 | xargs kill&lt;/code&gt; kills HiDream&lt;/strong&gt; → 16 GiB freed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage C&lt;/strong&gt;: LTX-2 + TTS + Ditto → peak ~32 GiB&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Explicit kills at stage transitions, and everything fits on one card.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration Loop (Cache Strategy)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Partial regeneration&lt;/strong&gt; — not "rebuild everything" — is what keeps iteration fast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Regen a single beat image (HiDream only)&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; storyboard.visual &lt;span class="nt"&gt;--plan&lt;/span&gt; ... &lt;span class="nt"&gt;--out&lt;/span&gt; ... &lt;span class="nt"&gt;--only-beat&lt;/span&gt; 7 &lt;span class="nt"&gt;--steps&lt;/span&gt; 50 &lt;span class="nt"&gt;--resolution&lt;/span&gt; 2048

&lt;span class="c"&gt;# Partial video regen (TTS + LTX-2)&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; storyboard.video &lt;span class="nt"&gt;--dir&lt;/span&gt; ... &lt;span class="nt"&gt;--regen-beats&lt;/span&gt; 5,6,7 &lt;span class="nt"&gt;--skip-review&lt;/span&gt;

&lt;span class="c"&gt;# Adjust only subtitle or Hook title position&lt;/span&gt;
&lt;span class="nb"&gt;rm &lt;/span&gt;_video_work/clip_00_hook.mp4 _video_work/subs_irodori.srt
python &lt;span class="nt"&gt;-m&lt;/span&gt; storyboard.video &lt;span class="nt"&gt;--dir&lt;/span&gt; ... &lt;span class="nt"&gt;--regen-beats&lt;/span&gt; none &lt;span class="nt"&gt;--skip-review&lt;/span&gt;   &lt;span class="c"&gt;# ~30 seconds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache hierarchy&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HiDream beat images (&lt;code&gt;beat_NN_&amp;lt;type&amp;gt;.png&lt;/code&gt;) — regenerate individually with &lt;code&gt;--only-beat&lt;/code&gt; in ~80 seconds&lt;/li&gt;
&lt;li&gt;A2V / I2V clips (&lt;code&gt;clip_NN_*.mp4&lt;/code&gt;) — invalidated when beat type / speaker / line changes&lt;/li&gt;
&lt;li&gt;Finished Hook clip (&lt;code&gt;clip_00_hook.mp4&lt;/code&gt;) — delete just this when adjusting title position (the heavy LTX-2 I2V &lt;code&gt;hook_silent.mp4&lt;/code&gt; is reused)&lt;/li&gt;
&lt;li&gt;Subtitle SRT — regenerated every time (~10 seconds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Title position / subtitle style / Hook copy tweaks re-render in 30 seconds. The 100-second LTX-2 I2V portion stays cached.&lt;/p&gt;

&lt;h2&gt;
  
  
  How This Fits Into Kotonia
&lt;/h2&gt;

&lt;p&gt;Videos generated by this pipeline feed the SNS distribution layer (TikTok / YouTube Shorts / IG Reels) — the top of the funnel for attention → conversion for Kotonia (kotonia.ai).&lt;/p&gt;

&lt;p&gt;Technically, it's an extension of the &lt;code&gt;/studio/&lt;/code&gt; stack (HiDream image generation) into the video direction. The plan is to eventually expose this as &lt;code&gt;/video-studio/&lt;/code&gt; — a one-click Web UI over the same pipeline. Right now it's CLI only.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Articles / Want to Try It?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;Running HiDream-O1-Image's 5 modes resident on 1 GPU&lt;/a&gt; — backend design for Studio (&lt;code&gt;/studio/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;Fitting LTX-2 onto a single 95 GB GPU with fp8-cast quantization&lt;/a&gt; — the Stage C video generation foundation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;Reproducing language-learning short videos with Claude Code&lt;/a&gt; — earlier 6-beat "mango incident" format implementation&lt;/li&gt;
&lt;li&gt;Want to try the image generation side? &lt;a href="https://kotonia.ai/studio/" rel="noopener noreferrer"&gt;/studio/&lt;/a&gt; lets you do it in one click (video pipeline CLI is self-host only for now)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:07 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/running-ltx-23-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start-architecture-2ee3</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/running-ltx-23-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start-architecture-2ee3</guid>
      <description>&lt;p&gt;When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.&lt;/p&gt;

&lt;p&gt;Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: &lt;a href="https://github.com/Lightricks/LTX-2" rel="noopener noreferrer"&gt;LTX-2 official repo&lt;/a&gt; and bitsandbytes 0.49.1.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Trying to Do
&lt;/h2&gt;

&lt;p&gt;A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses &lt;code&gt;A2VidPipelineTwoStage&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt + audio_path + image
   ↓ stage_1 (generate video latent at low resolution, audio fixed)
   ↓ spatial upsample 2x
   ↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)
   ↓ video VAE decode + embed original input audio
mp4 output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official pipeline builds → runs → frees each component inside every &lt;code&gt;__call__&lt;/code&gt;, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead-End 1: VRAM Breakdown in Persistent Mode
&lt;/h2&gt;

&lt;p&gt;Loading every LTX-2 component into VRAM at once (all bf16):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;embeddings processor&lt;/td&gt;
&lt;td&gt;5.91 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma3-12B text encoder&lt;/td&gt;
&lt;td&gt;22.78 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stage_1 transformer&lt;/td&gt;
&lt;td&gt;35.38 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stage_2 transformer (distilled LoRA applied)&lt;/td&gt;
&lt;td&gt;35.38 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;video VAE encoder&lt;/td&gt;
&lt;td&gt;0.60 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;audio VAE encoder&lt;/td&gt;
&lt;td&gt;0.04 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;spatial upsampler&lt;/td&gt;
&lt;td&gt;0.92 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;video decoder&lt;/td&gt;
&lt;td&gt;0.76 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;101.77 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with &lt;code&gt;CUDA out of memory. Tried to allocate 128.00 MiB.&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead-End 2: "Gemma Is Small" Is a Misconception
&lt;/h2&gt;

&lt;p&gt;My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.&lt;/p&gt;

&lt;p&gt;The model filename is &lt;code&gt;gemma-3-12b-it-qat-q4_0-unquantized&lt;/code&gt;. Here, &lt;code&gt;qat-q4_0&lt;/code&gt; means it was trained with Quantization-Aware Training for q4_0, and &lt;code&gt;unquantized&lt;/code&gt; means the weights are stored as pre-quantization bf16. &lt;strong&gt;If you're using it as intended, you should load it in q4_0.&lt;/strong&gt; Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 1: 4-bit Loading with bitsandbytes
&lt;/h2&gt;

&lt;p&gt;LTX-2's Gemma loader uses &lt;code&gt;transformers.Gemma3ForConditionalGeneration&lt;/code&gt; internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use &lt;code&gt;from_pretrained&lt;/code&gt; directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Gemma3ForConditionalGeneration&lt;/span&gt;

&lt;span class="n"&gt;quant_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Gemma3ForConditionalGeneration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;gemma_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;quant_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← dtype for non-quantized layers (embeddings, etc.)
&lt;/span&gt;    &lt;span class="n"&gt;local_files_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you omit &lt;code&gt;torch_dtype&lt;/code&gt;, embeddings load as fp16 and clash with &lt;code&gt;Linear4bit&lt;/code&gt;'s &lt;code&gt;bnb_4bit_compute_dtype&lt;/code&gt; (bf16): &lt;code&gt;mat1 and mat2 must have the same dtype, but got Half and BFloat16&lt;/code&gt;. I hit that too.&lt;/p&gt;

&lt;p&gt;The patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call &lt;code&gt;create_and_populate(encoder)&lt;/code&gt;. Since bnb quantization only replaces &lt;code&gt;nn.Linear&lt;/code&gt;, Embedding layers and buffers pass through untouched.&lt;/p&gt;

&lt;p&gt;Result: Gemma's VRAM drops from &lt;strong&gt;22.78 GiB → 7.26 GiB&lt;/strong&gt;. That's 15 GiB freed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead-End 3: Even With That, Persistent Mode Can't Coexist
&lt;/h2&gt;

&lt;p&gt;With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, &lt;code&gt;nvidia-smi&lt;/code&gt; shows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes &lt;strong&gt;OOM inevitable no matter how you slice it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Three options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Offload TTS+Ditto (voice chat unavailable while A2V runs)&lt;/li&gt;
&lt;li&gt;Keep only one transformer resident (still leaves OOM risk)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cold-start: build → run → free all weights per request&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a "cinematic" feature, I went with option 3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 2: Cold-Start Architecture
&lt;/h2&gt;

&lt;p&gt;The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the &lt;code&gt;A2VidPipelineTwoStage&lt;/code&gt; instance in memory, and let the official implementation's context-manager-per-component build → run → free on every &lt;code&gt;__call__&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PersistentA2VPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;cold_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;A2VidPipelineTwoStage&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;  &lt;span class="c1"&gt;# builder only, nearly zero VRAM
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cold_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# done here
&lt;/span&gt;        &lt;span class="c1"&gt;# persistent mode only: start preloading components from here
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_generate_cold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="c1"&gt;# pipeline.__call__ handles component build/free internally
&lt;/span&gt;        &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
        &lt;span class="nf"&gt;encode_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: &lt;strong&gt;39.50 GiB&lt;/strong&gt;. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[mode] cold-start: components load per-request (slow first call, low idle VRAM)
[cuda] cold-start startup (no preload): allocated=0.00GiB
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotcha: Audio VAE Preprocessing
&lt;/h2&gt;

&lt;p&gt;The A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you &lt;code&gt;expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead&lt;/code&gt; from Conv2d.&lt;/p&gt;

&lt;p&gt;Also, if the input audio is shorter than &lt;code&gt;num_frames / frame_rate&lt;/code&gt;, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.&lt;/p&gt;

&lt;p&gt;Both handled with a single ffmpeg call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# mono → stereo + silence padding in one pass&lt;/span&gt;
ffmpeg &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; input.wav &lt;span class="nt"&gt;-ac&lt;/span&gt; 2 &lt;span class="nt"&gt;-af&lt;/span&gt; apad &lt;span class="nt"&gt;-t&lt;/span&gt; 2.041667 output.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the server side, check channels and duration with &lt;code&gt;av&lt;/code&gt;, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers and Tradeoffs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Persistent&lt;/th&gt;
&lt;th&gt;Cold-Start&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idle VRAM&lt;/td&gt;
&lt;td&gt;86 GiB&lt;/td&gt;
&lt;td&gt;0 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak VRAM during generation&lt;/td&gt;
&lt;td&gt;91 GiB&lt;/td&gt;
&lt;td&gt;40 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time per request&lt;/td&gt;
&lt;td&gt;~17s (inference only)&lt;/td&gt;
&lt;td&gt;~60s (including disk I/O)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS+Ditto coexistence&lt;/td&gt;
&lt;td&gt;Impossible (OOM)&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS page cache effect&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;~25-30s from 2nd request onward&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions."&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategic Role
&lt;/h2&gt;

&lt;p&gt;I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.&lt;/p&gt;

&lt;p&gt;The revised split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time conversation&lt;/strong&gt;: MuseTalk + multilingual TTS (TTFA ~930ms, already running)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async cinematic moments&lt;/strong&gt;: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around.&lt;/p&gt;




&lt;p&gt;We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at &lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;/articles&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:06 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/cutting-ltx-2-22b-peak-vram-by-40-with-fp8cast-and-why-optimum-quanto-was-a-trap-1o8d</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/cutting-ltx-2-22b-peak-vram-by-40-with-fp8cast-and-why-optimum-quanto-was-a-trap-1o8d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Lightricks/LTX-Video" rel="noopener noreferrer"&gt;LTX-2.3&lt;/a&gt; is a video generation model from Lightricks that includes audio support. In A2V (Audio-to-Video) mode, it takes &lt;strong&gt;a single image + audio + prompt&lt;/strong&gt; and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing.&lt;/p&gt;

&lt;p&gt;The catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with &lt;code&gt;transformer × 2 stage&lt;/code&gt; burns &lt;strong&gt;~86 GiB at idle&lt;/strong&gt;. On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it.&lt;/p&gt;

&lt;p&gt;After testing quantization approaches, I got &lt;strong&gt;LTX-2's native &lt;code&gt;fp8_cast&lt;/code&gt; to compress peak VRAM from 40 GiB → 24 GiB&lt;/strong&gt; (A2V cold-start, 768×512 / 97f). Meanwhile, &lt;strong&gt;&lt;code&gt;optimum-quanto&lt;/code&gt; int8/fp8 has a compatibility issue with the LTX-2 transformer&lt;/strong&gt; and simply doesn't work. This post documents the debugging and the decisions made along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96 GiB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch&lt;/strong&gt;: 2.9.1 + CUDA 12.8&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: LTX-2.3 22B-dev (base) + 22B-distilled-lora-384 (stage_2) + Gemma-3-12B text encoder (bnb 4bit)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: A2V served via &lt;code&gt;scripts/persistent_a2v_server.py --cold-start&lt;/code&gt;. Each request does &lt;code&gt;build → run → free&lt;/code&gt;; idle is 0 GiB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I use cold-start because A2V is called occasionally while conversation is the main workload, and it must coexist with TTS and Ditto. Details in a separate post.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Candidates
&lt;/h2&gt;

&lt;p&gt;Looking at the LTX-2 codebase, there are actually two quantization paths:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. LTX-2 Native: &lt;code&gt;QuantizationPolicy&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;packages/ltx-core/src/ltx_core/quantization/policy.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frozen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sd_ops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SDOps&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;              &lt;span class="c1"&gt;# weight transform at state dict load
&lt;/span&gt;    &lt;span class="n"&gt;module_ops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ModuleOps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# module rewrite after load
&lt;/span&gt;
    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fp8_cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load weights as float8_e4m3fn, upcast to bf16 during forward&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;sd_ops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TRANSFORMER_LINEAR_DOWNCAST_MAP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;module_ops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UPCAST_DURING_INFERENCE&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fp8_scaled_mm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;FP8 scaled MM (requires tensorrt_llm)&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation behind &lt;code&gt;fp8_cast&lt;/code&gt; is &lt;code&gt;Fp8CastLinear&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Fp8CastLinear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;w_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_upcast_and_round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
        &lt;span class="n"&gt;b_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_upcast_and_round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;functional&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w_up&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_up&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It uses the &lt;code&gt;__class__&lt;/code&gt; reassignment pattern to swap out instances. Weights are stored in fp8 and upcast to bf16 on every forward pass. The fp8 → bf16 cast cost is essentially noise on Blackwell.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. optimum-quanto
&lt;/h3&gt;

&lt;p&gt;The LTX-2 trainer package (&lt;code&gt;packages/ltx-trainer&lt;/code&gt;) has a general-purpose quantization path using optimum-quanto, supporting &lt;code&gt;int8-quanto&lt;/code&gt; / &lt;code&gt;int4-quanto&lt;/code&gt; / &lt;code&gt;fp8-quanto&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;quantize_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transformer_blocks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;_quantize_blockwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;   &lt;span class="c1"&gt;# move one block at a time to GPU, quantize → freeze → CPU
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;quantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EXCLUDE_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks like it could slot right in after &lt;code&gt;_build_transformer()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Candidate Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fp8-cast&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LTX-2 native, sd_ops loads as float8_e4m3fn&lt;/td&gt;
&lt;td&gt;~50% memory reduction, near-identical speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fp8-scaled-mm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LTX-2 native, requires tensorrt_llm&lt;/td&gt;
&lt;td&gt;Faster throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;int8-quanto&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;optimum-quanto, post-build&lt;/td&gt;
&lt;td&gt;~50% memory reduction, speed ±&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fp8-quanto&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same, fp8 variant&lt;/td&gt;
&lt;td&gt;Potential to hit native FP8 on Blackwell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;fp8-scaled-mm&lt;/code&gt; is out — no tensorrt_llm in this environment. I implemented the remaining three.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stepping on a Mine with &lt;code&gt;int8-quanto&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The implementation is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ltx_trainer.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;quantize_model&lt;/span&gt;

&lt;span class="n"&gt;transformer_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_transformer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;transformer_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;quantize_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformer_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int8-quanto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transformer_stage_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformer_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server starts fine. Idle VRAM looks promising:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[load] stage_1 transformer (no distilled LoRA)
[quantize] stage_1 -&amp;gt; int8-quanto
[quantize] stage_1 done in 0.71s
[cuda] after stage_1 transformer: allocated=31.28GiB ...
[load] stage_2 transformer (with distilled LoRA)
[quantize] stage_2 -&amp;gt; int8-quanto
[quantize] stage_2 done in 0.52s
[cuda] after stage_2 transformer: allocated=49.40GiB ...
[server] A2V listening on http://127.0.0.1:8892
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resident memory: &lt;strong&gt;51.7 GiB&lt;/strong&gt; (estimated 40% reduction from bf16's 86 GiB). Looks good.&lt;/p&gt;

&lt;p&gt;Then the first &lt;code&gt;/generate&lt;/code&gt; request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[timing] prompt_encode=0.75s
[timing] audio_encode=0.39s
  0%|          | 0/30 [00:00&amp;lt;?, ?it/s]
[http] POST /generate 400
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Crashes at step 0/30. The error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"linear(): argument 'weight' (position 2) must be Tensor, not NoneType"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Something is calling &lt;code&gt;torch.nn.functional.linear(input, weight=None, bias=None)&lt;/code&gt;. After quanto's &lt;code&gt;freeze()&lt;/code&gt;, &lt;strong&gt;&lt;code&gt;self.weight&lt;/code&gt; is being referenced as None somewhere in a Linear layer&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does &lt;code&gt;weight&lt;/code&gt; Become None?
&lt;/h3&gt;

&lt;p&gt;Two rough hypotheses:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LTX-2's Linear layers assume &lt;code&gt;__class__&lt;/code&gt; reassignment.&lt;/strong&gt; Just like &lt;code&gt;Fp8CastLinear&lt;/code&gt;, the pattern relies on keeping instance state intact while swapping the class-level &lt;code&gt;forward&lt;/code&gt;. quanto's &lt;code&gt;quantize()&lt;/code&gt; → &lt;code&gt;freeze()&lt;/code&gt; &lt;strong&gt;replaces&lt;/strong&gt; &lt;code&gt;nn.Linear&lt;/code&gt; with its own &lt;code&gt;QLinear&lt;/code&gt; wrapper, and that replacement likely breaks the &lt;code&gt;weight&lt;/code&gt; attribute reference somewhere in the process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXCLUDE_PATTERNS&lt;/code&gt; doesn't work in the blockwise path.&lt;/strong&gt; LTX-trainer's &lt;code&gt;_quantize_blockwise&lt;/code&gt; pulls out one &lt;code&gt;transformer_block&lt;/code&gt; at a time and calls &lt;code&gt;quantize(block, exclude=EXCLUDE_PATTERNS)&lt;/code&gt;. But &lt;code&gt;EXCLUDE_PATTERNS&lt;/code&gt; uses glob patterns like &lt;code&gt;patchify_proj&lt;/code&gt;, &lt;code&gt;*adaln*&lt;/code&gt;, &lt;code&gt;time_proj&lt;/code&gt; — these are relative to the whole model, not to a single block. &lt;strong&gt;They won't match relative paths inside a block&lt;/strong&gt;, so layers that should be excluded end up getting quantized.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Either way, fixing this properly means reading through quanto's wrapper implementation plus all the forward paths in the LTX-2 transformer. The cost isn't worth it. &lt;strong&gt;I decided to cut my losses and switch to LTX-2 native &lt;code&gt;fp8_cast&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Switching to &lt;code&gt;fp8_cast&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Three lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Just pass the quantization policy when building the pipeline
&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transformer_quantization&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ltx_core.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fp8_cast&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;A2VidPipelineTwoStage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;fp8_cast&lt;/code&gt; &lt;strong&gt;downcasts weights to fp8 during the load phase&lt;/strong&gt;. Since &lt;code&gt;sd_ops&lt;/code&gt; hooks into state_dict loading, the 43 GB safetensors file gets fp8-converted during streaming load. Unlike quanto, which fully expands bf16 in memory before quantizing, &lt;strong&gt;peak VRAM never spikes&lt;/strong&gt; — a nice property.&lt;/p&gt;

&lt;p&gt;On startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[load] A2VidPipelineTwoStage builders (pipeline_quantization=QuantizationPolicy(sd_ops=...fp8_cast...))
...
[cuda] after stage_1 transformer: allocated=31.30GiB reserved=35.18GiB
[cuda] after stage_2 transformer: allocated=49.43GiB reserved=53.64GiB
[server] A2V listening on http://127.0.0.1:8892
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resident allocated (51.7 GiB) is on par with int8-quanto, but &lt;strong&gt;reserved is only 53.6 GiB — dramatically lower&lt;/strong&gt; (int8-quanto was 70.9 GiB). Lower reserved means more headroom for activations.&lt;/p&gt;

&lt;p&gt;And the first &lt;code&gt;/generate&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"elapsed_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;39.367&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peak_vram_gib"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;57.918&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"width"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"height"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"num_frames"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;97&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;It works.&lt;/strong&gt; Back on track.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;Fixed conditions, persistent + fp8-cast, 3 resolutions × 3 runs each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Image: 1024×512 portrait&lt;/li&gt;
&lt;li&gt;Audio: 9.08-second Japanese sample generated with Irodori-TTS&lt;/li&gt;
&lt;li&gt;Prompt: "A young woman speaks calmly to the camera in a softly lit room."&lt;/li&gt;
&lt;li&gt;num_frames: 97 (= 4.04s @ 24fps)&lt;/li&gt;
&lt;li&gt;seed: 42 fixed&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;Avg elapsed (s)&lt;/th&gt;
&lt;th&gt;Peak VRAM (GiB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;768×512 / 97f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39.84&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024×768 / 97f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;66.71&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;59.06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1280×768 / 97f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.02&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;58.30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Near-zero variance across 3 runs&lt;/strong&gt; (fixed seed → byte-identical output mp4)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak VRAM is almost independent of resolution&lt;/strong&gt; (57.9–59.1 GiB). Resident weights dominate; activation memory is only ~7 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1280×768 now works stably in persistent mode.&lt;/strong&gt; This resolution was effectively impossible with bf16 persistent (~91 GiB peak)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Cold-Start Also Wins
&lt;/h2&gt;

&lt;p&gt;Production runs in cold-start mode (A2V fires once or twice every few minutes, must coexist with TTS). Since &lt;code&gt;fp8_cast&lt;/code&gt; policy is applied via &lt;code&gt;sd_ops&lt;/code&gt; at pipeline construction time, it carries over naturally to per-request cold-start builds.&lt;/p&gt;

&lt;p&gt;Cold-start + fp8-cast, single run (768×512 / 97f):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"elapsed_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;88.775&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peak_vram_gib"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;23.901&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;bf16 cold-start&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;fp8-cast cold-start&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-request time&lt;/td&gt;
&lt;td&gt;~60–90s&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;88.8s&lt;/strong&gt; (disk I/O bound, same order)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak VRAM&lt;/td&gt;
&lt;td&gt;~40 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;23.9 GiB (~40% reduction)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle&lt;/td&gt;
&lt;td&gt;0 GiB&lt;/td&gt;
&lt;td&gt;0 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coexistence (TTS+Ditto+Qwen3+MuseTalk)&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Comfortable&lt;/strong&gt; (~30 GiB peak)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Speed is bottlenecked by disk I/O so fp8 doesn't hurt, but &lt;strong&gt;freeing up 16 GiB of peak headroom matters&lt;/strong&gt;. Qwen3-TTS-vLLM (7 GiB) and MuseTalk warmup can now run concurrently with A2V generation without OOM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Recommended mode&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conversation-first, A2V occasionally&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;cold-start + fp8-cast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle 0, peak 24 GiB, comfortable coexistence with TTS/Ditto&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frequent A2V (batch generation, automated direction)&lt;/td&gt;
&lt;td&gt;persistent + fp8-cast&lt;/td&gt;
&lt;td&gt;Pay the 52 GiB resident cost, get 40s/req&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024+ resolution, quality focus&lt;/td&gt;
&lt;td&gt;persistent + fp8-cast&lt;/td&gt;
&lt;td&gt;1280×768 stable (impossible with bf16 persistent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single GPU hosting everything&lt;/td&gt;
&lt;td&gt;cold-start + fp8-cast&lt;/td&gt;
&lt;td&gt;Persistent eats 52 GiB; depends on budget allocation across services&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Production decision: &lt;strong&gt;cold-start + fp8-cast for now since conversation is primary. Switch to persistent fp8-cast if paying users drive enough A2V volume to justify the idle cost.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LTX-2 22B at bf16 idle (86 GiB) nearly monopolizes a single GPU. Quantization is close to mandatory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;optimum-quanto&lt;/code&gt; is incompatible with the LTX-2 transformer.&lt;/strong&gt; It dies with &lt;code&gt;F.linear(weight=None)&lt;/code&gt;. Root cause is likely the &lt;code&gt;__class__&lt;/code&gt; reassignment pattern and/or &lt;code&gt;EXCLUDE_PATTERNS&lt;/code&gt; not working correctly in the blockwise path. Not worth digging into.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LTX-2 native &lt;code&gt;QuantizationPolicy.fp8_cast()&lt;/code&gt; is the right answer.&lt;/strong&gt; fp8 at load time, bf16 upcast during forward. Three lines of code to enable.&lt;/li&gt;
&lt;li&gt;cold-start + fp8-cast: peak 40 → 24 GiB. persistent + fp8-cast: 1280×768 becomes usable.&lt;/li&gt;
&lt;li&gt;LTX-2 also has &lt;code&gt;fp8_scaled_mm&lt;/code&gt; (requires tensorrt_llm) — worth trying if you're willing to set up TRT.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Appendix: Launch Command and Reproduction
&lt;/h2&gt;

&lt;p&gt;Production cold-start + fp8-cast launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PYTORCH_CUDA_ALLOC_CONF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;expandable_segments:True &lt;span class="nb"&gt;nohup &lt;/span&gt;uv run python scripts/persistent_a2v_server.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8892 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--checkpoint-path&lt;/span&gt; models/LTX-2.3/ltx-2.3-22b-dev.safetensors &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--distilled-lora-path&lt;/span&gt; models/loras/ltx-2.3-22b-distilled-lora-384-1.1.safetensors &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--spatial-upsampler-path&lt;/span&gt; models/LTX-2.3/ltx-2.3-spatial-upscaler-x2-1.1.safetensors &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gemma-root&lt;/span&gt; models/gemma-3-12b-it-qat-q4_0-unquantized &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; outputs/a2v_server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transformer-quantization&lt;/span&gt; fp8-cast &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cold-start&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/ltx_a2v_server.log 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;persistent_a2v_server.py&lt;/code&gt; is the official LTX-2 repo script extended for A2V. The &lt;code&gt;--transformer-quantization fp8-cast&lt;/code&gt; flag was added via a local patch.&lt;/p&gt;

&lt;p&gt;Implementation patch (key parts):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/persistent_a2v_server.py
&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transformer_quantization&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-scaled-mm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ltx_core.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;  &lt;span class="c1"&gt;# late import: avoid circular reference
&lt;/span&gt;    &lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fp8_cast&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transformer_quantization&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fp8_scaled_mm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;A2VidPipelineTwoStage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;from ltx_core.quantization import QuantizationPolicy&lt;/code&gt; at the top level causes a circular import with &lt;code&gt;ltx_core.loader&lt;/code&gt;, so the late import is required.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>python</category>
    </item>
    <item>
      <title>HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:05 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked-3bm7</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked-3bm7</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;After benchmarking &lt;a href="https://huggingface.co/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;HiDream-O1-Image&lt;/a&gt; (released 2026-05, OpenWeight 8B, ranked #8 on Artificial Analysis Text-to-Image Arena) across 8 skeleton (try-on) mode patterns plus 3 layout patterns, three counterintuitive findings emerged.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Passing an openpose ref actually locks the pose to the ref's composition.&lt;/strong&gt; When you want dynamic poses, dropping the openpose ref and specifying the pose via prompt is more effective.&lt;/li&gt;
&lt;li&gt;Using 6 refs (face + bg + pose + parts, the full set) compresses each ref down to &lt;strong&gt;768px, degrading fine details.&lt;/strong&gt; Keeping it to 3–4 refs maintains 1024px and produces better quality.&lt;/li&gt;
&lt;li&gt;The README-recommended &lt;code&gt;shift=1.0&lt;/code&gt; is strictly for try-on use. For pose/outfit swaps use &lt;code&gt;shift=2.0-2.5&lt;/code&gt;; for complete scene replacement use &lt;code&gt;shift=3.0&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Reading &lt;code&gt;pipeline.py&lt;/code&gt; reveals that &lt;strong&gt;there is no dedicated code path for skeleton mode.&lt;/strong&gt; Both &lt;code&gt;/generate/skeleton&lt;/code&gt; and &lt;code&gt;/generate/ip&lt;/code&gt; go through exactly the same multi-ref pipeline internally, and whether a ref is a face, background, openpose, or clothing is &lt;strong&gt;communicated only through the prompt&lt;/strong&gt;. That's the root cause of everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;After running HiDream-O1-Image on a local GPU (RTX PRO 6000 Blackwell, 96 GB) and integrating it into our own platform, we hit a problem: &lt;strong&gt;skeleton (try-on) mode wasn't following prompt instructions.&lt;/strong&gt; Writing "jump with both hands raised" only produced stiff, upright try-on photos.&lt;/p&gt;

&lt;p&gt;Suspecting guardrails (NSFW filters, safety policies, etc.), I grepped for &lt;code&gt;safety|nsfw|guard|filter|moderate|censor&lt;/code&gt; — &lt;strong&gt;HiDream's codebase has none of that&lt;/strong&gt; (the only hit was CSS &lt;code&gt;backdrop-filter: blur&lt;/code&gt;). As expected from an MIT-licensed OpenWeight model, no censorship.&lt;/p&gt;

&lt;p&gt;So what's actually wrong? Here's what I found after reading &lt;code&gt;pipeline.py&lt;/code&gt; and running 8 + 3 patterns on real hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch&lt;/strong&gt;: 2.12.0 + CUDA 13.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flash-attn&lt;/strong&gt;: 2.8.3 (sm_120-only build)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: HiDream-O1-Image Full (8B, bf16, ~16.4 GiB resident)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference server&lt;/strong&gt;: custom Python BaseHTTPRequestHandler (port 8895)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution&lt;/strong&gt;: pipeline internal bucket forces snap to 2048×2048&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Measured wall time per 50-step generation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;iter speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;t2i (no ref)&lt;/td&gt;
&lt;td&gt;~33s&lt;/td&gt;
&lt;td&gt;1.52 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edit (1 ref)&lt;/td&gt;
&lt;td&gt;~76s&lt;/td&gt;
&lt;td&gt;1.01 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;skeleton (multi ref)&lt;/td&gt;
&lt;td&gt;~84s&lt;/td&gt;
&lt;td&gt;1.34 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ip (multi ref)&lt;/td&gt;
&lt;td&gt;~76s&lt;/td&gt;
&lt;td&gt;1.81 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;layout (multi ref + bbox)&lt;/td&gt;
&lt;td&gt;~83s&lt;/td&gt;
&lt;td&gt;1.21 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Test Assets
&lt;/h2&gt;

&lt;p&gt;The HiDream repo's &lt;code&gt;assets/IP_skeleton/&lt;/code&gt; includes a full skeleton set. These are used as-is for all tests.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ref&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Intended role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F166datt14khtsi0e1agj.jpg" alt="face" width="175" height="229"&gt;&lt;/td&gt;
&lt;td&gt;Person's face photo&lt;/td&gt;
&lt;td&gt;Identity reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwys28n3jp98finayq8w7.jpg" alt="openpose" width="575" height="767"&gt;&lt;/td&gt;
&lt;td&gt;Stick figure in OpenPose format&lt;/td&gt;
&lt;td&gt;Pose specification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc5avqxra5hvpvlhtyn1.jpg" alt="bg" width="575" height="767"&gt;&lt;/td&gt;
&lt;td&gt;Background photo (interior)&lt;/td&gt;
&lt;td&gt;Scene reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0y8r1lxmm9u9iqovjx26.jpg" alt="sweater" width="269" height="441"&gt; &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fde4jwdtv953rfv5glm82.jpg" alt="boots" width="370" height="262"&gt;
&lt;/td&gt;
&lt;td&gt;Clothing parts (sweater, boots)&lt;/td&gt;
&lt;td&gt;Outfit reference&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  8-Pattern Skeleton Benchmark
&lt;/h2&gt;

&lt;p&gt;Each pattern calls &lt;code&gt;/api/studio/skeleton&lt;/code&gt; (i.e., &lt;code&gt;generate_image()&lt;/code&gt; with skeleton-mode-equivalent arguments). All parameters except &lt;code&gt;shift&lt;/code&gt; and &lt;code&gt;guidance_scale&lt;/code&gt; are fixed (50 steps, seed=42).&lt;/p&gt;

&lt;h3&gt;
  
  
  A — Baseline (README defaults, all 6 refs)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
    "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
    "shift": 1.0, "seed": 42
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefpa1797qnzr2fkh06hx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefpa1797qnzr2fkh06hx.jpg" alt="A_baseline" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: The bg ref's walls and shelves are reproduced exactly. Pose also matches the openpose ref's upright stance. Faithful as a try-on, but zero freedom of movement.&lt;/p&gt;

&lt;h3&gt;
  
  
  B — Higher shift (same 6 refs, shift=2.5)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
  "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
  "shift": 2.5, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3tg5x5a9wljjpk5d51g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3tg5x5a9wljjpk5d51g.jpg" alt="B_shift25" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Shelves fade slightly, character design shifts a bit. Background still sticks to the bg ref. &lt;strong&gt;Raising shift alone can't fully break the bg ref's pull.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  C — Raise guidance too (shift=2.5, guidance=7.0)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "...",
  "ref_image_paths": [...6 refs...],
  "shift": 2.5, "guidance_scale": 7.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcfu6hswrvqwqxqfc716.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcfu6hswrvqwqxqfc716.jpg" alt="C_shift25_g70" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Necklace deforms strangely. &lt;strong&gt;Raising guidance starts producing artifacts.&lt;/strong&gt; The Full model's sweet spot is 5.0; 7.0 is too much.&lt;/p&gt;

&lt;h3&gt;
  
  
  D — Trim to 3 refs (face + openpose + sweater) + specific prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "A young Asian woman wearing a gray oversized sweater dress, standing in a relaxed pose, full body shot, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","openpose","part_1"],
  "shift": 2.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdacajh0u1l43ksamr5ot.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdacajh0u1l43ksamr5ot.jpg" alt="D_3refs_specific" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &lt;strong&gt;Major improvement.&lt;/strong&gt; Background becomes a clean white studio, outfit is preserved, pose looks natural. Removing the bg ref made the biggest difference. This is what a correct try-on output should look like.&lt;/p&gt;

&lt;h3&gt;
  
  
  E — 4 refs + numbered-ref prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body try-on photograph. Subject: the woman from image 1. Pose: identical to the skeleton in image 2. Wearing: the gray oversized knit sweater dress shown in image 3, brown leather ankle boots shown in image 4. Studio lighting, plain background.",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b4t01hx7eyqg8wpcd6y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b4t01hx7eyqg8wpcd6y.jpg" alt="E_numbered_refs" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Quality on par with D; boots reflected (somewhat subtly). &lt;strong&gt;Numbering refs in the prompt does help&lt;/strong&gt;, but the effect isn't dramatic.&lt;/p&gt;

&lt;h3&gt;
  
  
  F — Drop openpose, specify pose via prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body photograph of the woman wearing the gray sweater dress and brown ankle boots, dynamic dancing pose with both arms raised above her head, joyful expression, photo studio with white seamless background, professional lighting.",
  "ref_image_paths": ["face","part_1","part_2"],
  "shift": 2.5, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpc2foo4m80g7q7ng28x2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpc2foo4m80g7q7ng28x2.jpg" alt="F_pose_via_prompt" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 🏆 &lt;strong&gt;Both-arms-raised jump, complete success.&lt;/strong&gt; Dynamic motion only appeared when the openpose ref was removed and the pose was specified purely via prompt. &lt;strong&gt;This confirms that the openpose ref suppresses prompt-driven pose.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  G — Face only + freeform prompt (full outfit swap)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;/generate/skeleton&lt;/code&gt; has a minimum-2-refs validation, so using &lt;code&gt;/generate/ip&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/ip &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Elegant full-body portrait of the woman wearing a vibrant red sequined evening gown with a thigh-high slit, standing confidently with one hand on her hip, soft cinematic lighting, dark blurred background.",
  "ref_image_paths": ["face"],
  "shift": 3.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8ba10t1rtrsw196yq3b.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8ba10t1rtrsw196yq3b.jpg" alt="G_outfit_freeform" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 🏆 &lt;strong&gt;Red evening gown generated perfectly.&lt;/strong&gt; Facial identity preserved; everything else is free. &lt;strong&gt;Face-only + shift=3.0&lt;/strong&gt; is the maximum-freedom pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  H — Same config as E, seed=999 (variance check)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body try-on photograph. ...",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 999
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnli01d85jsktbyw0ni15.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnli01d85jsktbyw0ni15.jpg" alt="H_seed999" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Marginal difference from E; boots come out more clearly brown. &lt;strong&gt;Varying the seed is useful for fine-tuning details&lt;/strong&gt;, so in production, running 3–5 seeds and picking best-of-N is standard practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layout Mode Quick Look (3 Bonus Patterns)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;layout_bboxes&lt;/code&gt; lets you specify where multiple subjects appear in the image using relative coordinates &lt;code&gt;[x1, x2, y1, y2]&lt;/code&gt;. Here's the actual behavior.&lt;/p&gt;

&lt;p&gt;Input refs are face photos of two people (female, male):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6c3dxlrl2f38dj00d8y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6c3dxlrl2f38dj00d8y.jpg" alt="ref female" width="323" height="512"&gt;&lt;/a&gt; &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fincnztxp2uhqg69hh8bc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fincnztxp2uhqg69hh8bc.jpg" alt="ref male" width="344" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  L1 — Side by side (female left, male right)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"layout_bboxes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[[0.0,0.5,0.1,0.95],[0.5,1.0,0.1,0.95]]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8hgo6kgnxqtngq66cd2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8hgo6kgnxqtngq66cd2.jpg" alt="L1" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &lt;strong&gt;Left and right were swapped&lt;/strong&gt; (male left, female right). Correspondence between ref order and bbox order is not guaranteed.&lt;/p&gt;

&lt;h3&gt;
  
  
  L2 — Top/bottom split (female top, male bottom)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"layout_bboxes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[[0.2,0.8,0.0,0.5],[0.2,0.8,0.5,1.0]]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5oqext31wvp7tv6yz8ft.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5oqext31wvp7tv6yz8ft.jpg" alt="L2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Female appears in the background, male in the foreground — a depth-layered composition rather than a literal top/bottom split.&lt;/p&gt;

&lt;h3&gt;
  
  
  L3 — Size difference (female large, male small)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"layout_bboxes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[[0.1,0.65,0.1,0.95],[0.7,0.97,0.05,0.45]]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhmx1gs80h60k7d3fepl.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhmx1gs80h60k7d3fepl.jpg" alt="L3" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Both subjects rendered at nearly the same size, side by side. &lt;strong&gt;Bbox size does not control relative scale.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ Think of layout mode as a &lt;strong&gt;loose composition hint for group shots&lt;/strong&gt;, not precise Photoshop-style placement. It gives a rough suggestion for fitting multiple subjects into a single image; don't expect coordinate accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Happens — Reading &lt;code&gt;pipeline.py&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;HiDream's behavior is governed by the &lt;code&gt;generate_image()&lt;/code&gt; function in &lt;code&gt;models/pipeline.py&lt;/code&gt;. Three structural facts explain everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. More refs = lower per-ref resolution
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pipeline.py:198-202&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# 2048
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;   &lt;span class="c1"&gt;# 1536
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# 1024
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;   &lt;span class="c1"&gt;# 768
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;         &lt;span class="c1"&gt;# 512
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Feeding 6 refs compresses each to 768px.&lt;/strong&gt; Thin openpose lines, fine clothing patterns, and facial detail all get crushed. Keeping it to 3–4 refs preserves 1024px and retains that detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Skeleton mode has no dedicated code path
&lt;/h3&gt;

&lt;p&gt;Looking at &lt;code&gt;pipeline.py:178-275&lt;/code&gt;, &lt;strong&gt;there is no skeleton-specific branch.&lt;/strong&gt; Both &lt;code&gt;/generate/skeleton&lt;/code&gt; and &lt;code&gt;/generate/ip&lt;/code&gt; run through exactly the same multi-ref path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;caption&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model receives &lt;strong&gt;no role hints&lt;/strong&gt; indicating which ref is a face, which is an openpose skeleton, and which is clothing. All refs are treated as "K reference images in parallel." If you want roles to matter, &lt;strong&gt;you have to say so explicitly in the prompt text.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is why "prompt beats openpose ref." The openpose ref is processed as "some line-art image among the references," with no explicit signal that it's a pose specification. Meanwhile, &lt;code&gt;dynamic dancing pose with both arms raised&lt;/code&gt; in the prompt is parsed as explicit verbs and nouns at the vocabulary level.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. How the &lt;code&gt;shift&lt;/code&gt; parameter behaves
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;shift&lt;/code&gt; controls the noise schedule strength of the scheduler. In practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.0&lt;/strong&gt; = maximum fidelity to ref composition, zero freedom → try-on only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.0-2.5&lt;/strong&gt; = practical range, allows deviation from refs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3.0+&lt;/strong&gt; = near-freeform generation, refs serve only as identity anchors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The README recommends 1.0 for IP/Skeleton/Layout because it assumes the typical try-on / character-consistency use case. &lt;strong&gt;If you want to change the pose, swap outfits, or build a new scene that differs from the refs, 2.0+ is required.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Best Practices by Use Case (Battle-Tested)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Refs&lt;/th&gt;
&lt;th&gt;Shift&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithful try-on matching original scene&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/skeleton&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6 (face+bg+pose+3parts)&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;README default. Strongly faithful to all refs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Preserve outfit + natural standing pose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/skeleton&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;3-4&lt;/strong&gt; (face + clothing, no bg/pose)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dropping bg ref gives white studio; fewer refs keep each at 768→1024px&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dramatic pose change&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/skeleton&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3 (no openpose)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt controls motion better than openpose ref&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complete outfit swap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;/ip&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 (face only)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum freedom; only face is preserved. Skeleton mode rejects &amp;lt; 2 refs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Group shot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/layout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multiple face refs + rough bboxes&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Bboxes are loose composition hints; size hierarchy doesn't work; ref↔bbox order not guaranteed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fine detail optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same config&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;td&gt;Run 3–5 seeds and pick best-of-N&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Treating HiDream-O1-Image's skeleton mode as a "try-on simulator" leads to the frustrating feeling that "it won't listen" — with no guardrails to blame. The real cause is &lt;strong&gt;pipeline structure&lt;/strong&gt;: refs lose resolution as count increases, there's no skeleton-specific processing, and &lt;code&gt;shift&lt;/code&gt; controls how hard the refs pull.&lt;/p&gt;

&lt;p&gt;Practical takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Try-on&lt;/strong&gt;: 6 refs full + shift 1.0 (README default)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Changing the pose&lt;/strong&gt;: drop openpose ref + verb-describe the pose in prompt + shift 2.5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completely free scene creation&lt;/strong&gt;: face only + shift 3.0 + &lt;code&gt;/ip&lt;/code&gt; endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Layout mode also makes sense once you understand it as "group photo hint" rather than "precise bbox placement."&lt;/p&gt;

&lt;p&gt;All assets and commands used in this benchmark come from the &lt;a href="https://github.com/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;HiDream-O1-Image repository&lt;/a&gt;'s &lt;code&gt;assets/IP_skeleton/&lt;/code&gt; and &lt;code&gt;assets/IP_layout/&lt;/code&gt; directories, so results are fully reproducible. Varying &lt;code&gt;shift&lt;/code&gt; and ref count alone produces dramatically different behavior — it's a good sandbox for developing intuition quickly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Addendum: What Happens When You Change the OpenPose Ref — "Prompt Always Wins" Has Conditions
&lt;/h2&gt;

&lt;p&gt;After publishing, I ran additional tests on &lt;strong&gt;what happens with a different-shaped openpose ref&lt;/strong&gt;, and the original conclusion needed revision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Modified OpenPose Refs (4 Patterns)
&lt;/h3&gt;

&lt;p&gt;I took the original openpose image (&lt;code&gt;0.openpose.jpg&lt;/code&gt;, standing pose), flipped it vertically and rotated it 90 degrees to create "unnatural poses," then specified a normal standing pose in the prompt.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modification&lt;/th&gt;
&lt;th&gt;Image&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vertically flipped (upside-down)&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2pufbxptxk4ipmtk71h.jpg" alt="flipped" width="575" height="767"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90° rotated (lying sideways)&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4d9dj0o5k70s70s1aiu.jpg" alt="rot90" width="767" height="575"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;OpenPose Ref&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;O1&lt;/strong&gt; baseline&lt;/td&gt;
&lt;td&gt;Original (standing)&lt;/td&gt;
&lt;td&gt;Standing pose&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23xzu0kj7ck4gj7x4clb.jpg" alt="O1" width="800" height="800"&gt; Standing pose as expected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;O2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🙃 Vertically flipped&lt;/td&gt;
&lt;td&gt;Standing pose&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fdwusxwktdhr91bqzg5.jpg" alt="O2" width="800" height="800"&gt; &lt;strong&gt;Standing pose&lt;/strong&gt; (openpose ignored, prompt wins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;O3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🙃 Vertically flipped&lt;/td&gt;
&lt;td&gt;Jumping&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyup8q4v2jsbnculdjgi8.jpg" alt="O3" width="800" height="800"&gt; &lt;strong&gt;Both-arms-raised jump&lt;/strong&gt; (openpose ignored, prompt wins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;O4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;↻ 90° rotated&lt;/td&gt;
&lt;td&gt;Standing pose&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllfbud2iw9xwogx3lrt5.jpg" alt="O4" width="800" height="800"&gt; Standing pose but &lt;strong&gt;canvas itself rotated 90°&lt;/strong&gt;!&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Up to this point the findings were: "The model rejects unnatural refs and falls back to the prompt" and "overall compositional orientation (portrait vs. landscape) can still be influenced by the ref."&lt;/p&gt;

&lt;h3&gt;
  
  
  But a Dramatic Ref + Pose-Silent Prompt Led to Complete Ref Victory
&lt;/h3&gt;

&lt;p&gt;I generated a "colorful anatomical skeleton with arms spread in a T-shape and one leg raised high in a tree yoga pose" via HiDream's T2I and fed it as a ref:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzz27x6rblranl7tn1y0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzz27x6rblranl7tn1y0.jpg" alt="warrior skeleton ref" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt mentions no pose at all — only subject and clothing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body photograph of a young Asian woman wearing a gray sweater dress, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","SYNTHETIC_WARRIOR_SKELETON","sweater"],
  "shift": 1.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa764mosnr8ox7zzkttrt.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa764mosnr8ox7zzkttrt.jpg" alt="X1 warrior yoga result" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tree yoga pose reproduced perfectly&lt;/strong&gt; — T-shaped arms and single-leg stance, matching the skeleton ref exactly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Revised Conclusions (3 Rules)
&lt;/h3&gt;

&lt;p&gt;Synthesizing all 12 patterns, HiDream actually behaves like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;If the prompt mentions a pose, that takes first priority&lt;/strong&gt; — prompt wins even when it contradicts the ref.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the prompt says nothing about the pose, the ref's pose is adopted&lt;/strong&gt; — the more dramatic the ref, the clearer the transfer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the ref appears "unnatural" (upside-down skeleton, etc.), the model defaults to a natural stance&lt;/strong&gt; — though overall compositional orientation can still bleed through.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So "the openpose ref is basically useless" was an overstatement. More precisely: &lt;strong&gt;"when the prompt describes a pose, the ref gets overridden."&lt;/strong&gt; The 8-pattern benchmark was all scenarios where the prompt specified dynamic motion, so it looked like the openpose ref was powerless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;To fully control pose via ref&lt;/strong&gt;: don't mention pose in the prompt + use a dramatic openpose/skeleton ref → ref pose transfers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To control pose via prompt&lt;/strong&gt;: removing the openpose ref is fine (even if you leave it in, the prompt overrides it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When ref and prompt conflict&lt;/strong&gt;: prompt wins (including the ref doesn't help)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can effectively choose whether pose comes from the ref or the prompt by &lt;strong&gt;whether or not you mention the pose in the prompt&lt;/strong&gt;. If you want the openpose ref to drive the pose, keep pose description out of the prompt.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HiDream-O1-Image: &lt;a href="https://huggingface.co/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;https://huggingface.co/HiDream-ai/HiDream-O1-Image&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Repository: &lt;a href="https://github.com/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;https://github.com/HiDream-ai/HiDream-O1-Image&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:04 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-multimodal-sub-agent-3ccf</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-multimodal-sub-agent-3ccf</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;It started with a Pingo (language-learning AI app) short video that popped up on X. A Western woman learning Japanese tries to say "I ate a mango" (マンゴーを食べた), drops a dakuten, and instead says something like "I ate p*&lt;strong&gt;y" (マ◯コを食べた). The AI deadpans right along with it and she's devastated. The combination — **a specific phonetic accident + AI playing it completely straight + the reaction shot gap&lt;/strong&gt; — worked perfectly, and I figured this was a solid benchmark for a "comedy video auto-generation pipeline."&lt;/p&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Generate a vertical comedy video from a single line of idea text&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iteration cycles in minutes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost is basically just electricity&lt;/strong&gt; — minimal API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publishable quality&lt;/strong&gt; — good enough to upload directly to YouTube Shorts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Short answer: it works. Here's the finished video:&lt;/p&gt;

&lt;p&gt;@&lt;a href="https://dev.to9W-IMB2xLWc"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What became clear during development: &lt;strong&gt;the hybrid approach of delegating multimodal editorial judgment (like video review) to a frontier model while keeping heavy compute local is dramatically more cost-effective&lt;/strong&gt;. This post covers that architecture and the specific bugs I got stuck on along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It All Fits Together
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Single line of idea text]
   ↓
Gemini 3.1 Pro Preview (orchestrator)
   ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16
plan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer}, ...]}
   ↓
XTTS (local, port 8880) generates audio per scene
   ↓ scene_NN.wav
renderer routing:
   ├─ Ditto-TalkingHead (local, port 8881): normal dialogue ~1-2s/scene
   └─ LTX-2 A2V        (local, port 8892): reaction_only scenes only ~100s
   ↓ scene_NN.mp4
ffmpeg concat (libx264 + aac, 512x768 vertical) → final.mp4
   ↓
Gemini 3.1 Pro Preview (reviewer)
   ↓ multimodal evaluation of video + plan summary
review.md (technical / completeness / quality / improvement suggestions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All heavy compute runs locally&lt;/strong&gt; — TTS / A2V renderer / lightweight inference all run on local GPU (RTX PRO 6000 Blackwell)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini handles judgment&lt;/strong&gt; — only the orchestrator (scene design + scripting) and reviewer (editorial evaluation of the video) use a frontier model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM (Gemma 4 E4B) stays as a per-scene technical pre-screen&lt;/strong&gt; — a cheap filter that just rejects obviously broken output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VRAM usage: the local LLMs (Gemma 4 E4B + 31B) were already loaded on a separate path consuming ~60GB, but &lt;strong&gt;after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local LLM Alone Wasn't Enough
&lt;/h2&gt;

&lt;p&gt;I started with everything local (Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer). It &lt;strong&gt;ran end-to-end&lt;/strong&gt; and the structure looked reasonable, but it never reached publishable quality. Two reasons.&lt;/p&gt;

&lt;h3&gt;
  
  
  (1) Gemma 4 31B's safety tuning blurs the punchline
&lt;/h3&gt;

&lt;p&gt;The comedy in the original short hinges on a specific beat: &lt;strong&gt;the AI explicitly calls out the mistake deadpan&lt;/strong&gt;. Concretely — "You just said X. Personally, I like X." — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart.&lt;/p&gt;

&lt;p&gt;Feed the same system prompt and idea to local Gemma 4 31B and you consistently get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"いいですね。僕も腹が減っている時は、それが好きです。"
("Nice. I like that too when I'm hungry.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "when I'm hungry" beat survives, but &lt;strong&gt;the explicit "you just said X" callout — the most transgressive beat&lt;/strong&gt; — is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable.&lt;/p&gt;

&lt;p&gt;Same system prompt and idea sent to Gemini 3.1 Pro Preview with &lt;code&gt;safetySettings: BLOCK_NONE&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。"
("I see. I'm an AI so I can't eat pussy, but I'm rooting for you.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even within the same Google model family, the frontier model has somewhat looser guardrails&lt;/strong&gt; — this matches what people say on X. At least for "transgression that's clearly necessary in a comedy context," Gemini writes it more naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  (2) Gemma 4 E4B (4B-class, multimodal) is a blunt reviewer
&lt;/h3&gt;

&lt;p&gt;The reviewer side was worse. E4B answers per-scene "OK / NG" in binary, but &lt;strong&gt;rubber-stamps every single scene as OK&lt;/strong&gt;. Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK.&lt;/p&gt;

&lt;p&gt;Run the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical failure.&lt;/strong&gt; The TTS/pipeline clearly censored the output, cutting off at "I ate p-" and entirely dropping the intended transgressive punchline. This destroys the "deadpan AI saying unhinged things" comedic archetype.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top 3 fixes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ...&lt;/li&gt;
&lt;li&gt;Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ...&lt;/li&gt;
&lt;li&gt;Verify Voice/Visual Match ...&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Embarrassing Part: I Dismissed Gemini's "Truncated" Note Three Times as Hallucination
&lt;/h2&gt;

&lt;p&gt;Gemini reviewer flagged multiple times that "scene 5 is truncated mid-way, cuts off at 'I ate p-'." I transcribed the audio file with Whisper to verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;whisper scene_04.wav &lt;span class="nt"&gt;--language&lt;/span&gt; en
&lt;span class="s2"&gt;"Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate
pussy honestly when I'm hungry, same."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full text present. I decided &lt;strong&gt;Gemini was hallucinating&lt;/strong&gt; and dismissed the note three times in a row.&lt;/p&gt;

&lt;p&gt;On the third dismissal, Gemini kept insisting "&lt;strong&gt;still truncated at 'I ate p-'&lt;/strong&gt;," so I actually ran ffprobe on the final mp4:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scene_04.mp4&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;video duration = 8.000000s&lt;/span&gt;
  &lt;span class="s"&gt;audio duration = 7.979000s    ← the original WAV should have been 10.30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Audio was cut at 8 seconds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Root cause: an implicit &lt;code&gt;MAX_DURATION_PER_SCENE = 8.0&lt;/code&gt; cap in the pipeline was limiting ditto renderer's num_frames to 8s, and ffmpeg's &lt;code&gt;-shortest&lt;/code&gt; flag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If a frontier reviewer gives you something that looks like a hallucination, just verify it properly.&lt;/strong&gt; The signal isn't a guess.&lt;/p&gt;

&lt;p&gt;The fix was trivial: remove &lt;code&gt;MAX_DURATION_PER_SCENE&lt;/code&gt; and use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with "&lt;strong&gt;The transgressive bite is perfect&lt;/strong&gt;," and the pipeline finally reached publishable state.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frontier Model as Sub-Agent — Token Economics
&lt;/h2&gt;

&lt;p&gt;This pattern works because &lt;strong&gt;the sub-agent (Gemini) runs in a fresh context&lt;/strong&gt; every time. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Main agent (Claude Code) context&lt;/strong&gt;: the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-agent (Gemini) context&lt;/strong&gt;: one video (2–3 MB base64) + plan summary (~1,500 tokens) + evaluation instructions (~500 tokens). Fresh each call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benefit: &lt;strong&gt;the sub-agent's work doesn't accumulate in the main agent's context&lt;/strong&gt;. Iterate on one video 10 times and the main agent's context only contains "called Gemini" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call.&lt;/p&gt;

&lt;p&gt;Cost breakdown (Gemini 3.1 Pro Preview rates, May 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input (video + plan + instructions)&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;td&gt;$1.25/M&lt;/td&gt;
&lt;td&gt;$0.0031&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output (review markdown)&lt;/td&gt;
&lt;td&gt;~450&lt;/td&gt;
&lt;td&gt;$10/M&lt;/td&gt;
&lt;td&gt;$0.0045&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per review&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0076&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;1 initial review + 3–5 diff iterations per video ≈ &lt;strong&gt;$0.03–0.05 per video&lt;/strong&gt;. Making 5–10 videos a day still comes in under &lt;strong&gt;$10–20/month&lt;/strong&gt;. That's a remarkably low bar for using a frontier model in a video creation workflow.&lt;/p&gt;

&lt;p&gt;The orchestrator side is the same order of magnitude (no video input, text only, even cheaper).&lt;/p&gt;




&lt;h2&gt;
  
  
  Differential Iteration — &lt;code&gt;--regen-scenes&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Getting to publishable quality requires fast "watch → fix only the broken parts → watch again" loops. You can't get there in a single pass.&lt;/p&gt;

&lt;p&gt;So I added a path in the pipeline to &lt;strong&gt;re-run TTS + render for specific scenes only&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal generation&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--idea&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt; &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1

&lt;span class="c"&gt;# Regenerate only scene 6 (edit plan.json script first, then run)&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1 &lt;span class="nt"&gt;--regen-scenes&lt;/span&gt; 5

&lt;span class="c"&gt;# Regenerate scenes 0, 2, and 5 together&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1 &lt;span class="nt"&gt;--regen-scenes&lt;/span&gt; 0,2,5

&lt;span class="c"&gt;# Just re-concat existing scene_NN.mp4 files (for cherry-pick recombination)&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1 &lt;span class="nt"&gt;--concat-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenes not listed in &lt;code&gt;--regen-scenes&lt;/code&gt; are reused from existing &lt;code&gt;scene_NN.mp4&lt;/code&gt; files; only the specified indices are regenerated before re-concat and re-review. &lt;strong&gt;Full generation: 60 seconds → diff iteration: 30 seconds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx_prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code Snippets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gemini Pro API call (multimodal video review)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="n"&gt;GEMINI_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-pro-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;GEMINI_API&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://generativelanguage.googleapis.com/v1beta/models/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GEMINI_MODEL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:generateContent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review_final&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;vid_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_bytes&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;scene_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  scene &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: speaker=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;speaker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, lang=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tts_language&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inline_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mime_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video/mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vid_b64&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;REVIEW_PROMPT&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Scene plan:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scene_summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]}],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generationConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxOutputTokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens
&lt;/span&gt;            &lt;span class="c1"&gt;# Set thinking budget explicitly to ensure output tokens remain available
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinkingConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinkingBudget&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="c1"&gt;# Minimize safety filters for comedy context
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;safetySettings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_HARASSMENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_HATE_SPEECH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_SEXUALLY_EXPLICIT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_DANGEROUS_CONTENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;GEMINI_API&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-goog-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GOOGLE_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;120.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;candidates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;thinkingConfig.thinkingBudget&lt;/code&gt;, Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens. &lt;strong&gt;This is a required setting whenever you use Gemini 3.x Pro.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  TTS output quality check (STT similarity + silence gap retry)
&lt;/h3&gt;

&lt;p&gt;XTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;difflib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[\s。、,.!?「」&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\"…—–\-:;()（）]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_script_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;difflib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SequenceMatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;synthesize_scene&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_language&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tts_language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TTS_MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_xtts_once&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_longest_internal_gap_sec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_stt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_script_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠ gap=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s sim=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, retrying (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# If threshold isn't met after 3 retries, use the best sample found
&lt;/span&gt;    &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;
    &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scene_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PCM_16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone significantly reduces cases where XTTS's non-deterministic quality variance bleeds through into the final video.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Pattern Generalizes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Sub-agent the heavy judgment to a frontier model, keep heavy compute local"&lt;/strong&gt; works beyond video pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Large-scale search ranking&lt;/strong&gt;: Send 100 web search results to a frontier model for editorial evaluation, return only the top 10 to the main agent. Keeps search result noise out of the main agent's context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-form editing review&lt;/strong&gt;: Have a frontier model do the editorial read of PRs, design docs, or specs. Main agent only receives the summary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual QA&lt;/strong&gt;: Sub-agent to the best model per language; main agent holds only the cross-language decision logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: &lt;strong&gt;consciously deciding what belongs in context vs. what should be completed inside an API call&lt;/strong&gt;. Frontier model editorial signal is remarkably cost-effective relative to what it delivers.&lt;/p&gt;

&lt;p&gt;On the video pipeline side, the next steps are generalizing the comedy format (split-screen, 3+ characters, other genres) and volume testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Built a foundation that generates &lt;strong&gt;publishable comedy videos in 60 seconds from a single line of idea text&lt;/strong&gt;, using a local GPU + Gemini 3.1 Pro Preview hybrid&lt;/li&gt;
&lt;li&gt;Local-only falls short on two fronts: &lt;strong&gt;(1) safety tuning blurs the punchline&lt;/strong&gt; and &lt;strong&gt;(2) the reviewer can't produce editorial signal&lt;/strong&gt;. Sub-agenting a frontier model solves both&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Take frontier reviewer notes at face value.&lt;/strong&gt; Checking the WAV with Whisper alone won't catch audio truncation in the final mp4&lt;/li&gt;
&lt;li&gt;Sub-agent token economics keep main agent context clean — total cost is $0.03–0.05 per video&lt;/li&gt;
&lt;li&gt;With &lt;code&gt;--regen-scenes&lt;/code&gt; diff iteration running 30-second loops, the Gemini feedback → fix → re-evaluate cycle runs at minute-by-minute speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finished video (reprise):&lt;/p&gt;

&lt;p&gt;@&lt;a href="https://dev.to9W-IMB2xLWc"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The local implementation lives in &lt;code&gt;llm_server/pipeline_multi.py&lt;/code&gt;. Detailed findings from the development process are accumulating in &lt;code&gt;docs/MULTI_SCENE_COMEDY_FINDINGS_2026-05-12.md&lt;/code&gt; as an internal reference.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>productivity</category>
    </item>
    <item>
      <title>HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:03 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/hidream-o1-image-3-8x-faster-benchmarking-steps-cfg-and-resolution-4ejd</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/hidream-o1-image-3-8x-faster-benchmarking-steps-cfg-and-resolution-4ejd</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I'm running HiDream-O1-Image Full as a persistent local server integrated into a Studio UI. The official recipe — &lt;code&gt;2048x2048 / 50 steps / guidance 5.0&lt;/code&gt; — produces beautiful results, but each image takes around 33 seconds. That's too slow for iterative exploration.&lt;/p&gt;

&lt;p&gt;So I held the prompt and seed constant and swept &lt;code&gt;steps&lt;/code&gt;, &lt;code&gt;guidance&lt;/code&gt;, and resolution. The sweet spots were clear.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;vs. Official&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2048 / 50 steps / g5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;33.37s&lt;/td&gt;
&lt;td&gt;1.00x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2048 / 28 steps / g5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;18.41s&lt;/td&gt;
&lt;td&gt;1.81x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;1536 / 20 steps / g5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7.14s&lt;/td&gt;
&lt;td&gt;4.67x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;1024 / 20 steps / g5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3.83s&lt;/td&gt;
&lt;td&gt;8.71x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The takeaway: &lt;strong&gt;explore direction at low resolution and low steps, then do the final render at full quality.&lt;/strong&gt; In particular, &lt;code&gt;1536x1536 / 28–36 steps&lt;/code&gt; hits a very good speed-quality balance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;Once image generation is embedded in a UI, iteration speed matters more than peak quality.&lt;/p&gt;

&lt;p&gt;The real workflow isn't "generate one perfect image." It looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check composition, mood, outfit, background direction&lt;/li&gt;
&lt;li&gt;Tweak the prompt slightly&lt;/li&gt;
&lt;li&gt;Try different seeds&lt;/li&gt;
&lt;li&gt;Re-render only the promising candidates at full quality&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Waiting 30+ seconds per generation makes that loop painful. Being able to see rough candidates in 5–10 seconds is a completely different experience.&lt;/p&gt;

&lt;p&gt;The goal here isn't "the best single image" — it's &lt;strong&gt;understanding how far you can cut exploration cost without breaking quality in a meaningful way&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: HiDream-O1-Image Full (8B, bf16)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference server&lt;/strong&gt;: Custom Python HTTP server with model kept resident&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measured&lt;/strong&gt;: One &lt;code&gt;/generate/t2i&lt;/code&gt; request after model load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seed&lt;/strong&gt;: &lt;code&gt;42&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A cinematic portrait photo of a woman in a rainy neon street,
detailed skin, 85mm lens, realistic lighting, high detail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All comparison images use the same prompt and seed. Only &lt;code&gt;steps&lt;/code&gt;, &lt;code&gt;guidance_scale&lt;/code&gt;, resolution, and resolution snapping are varied.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;prompt&lt;/td&gt;
&lt;td&gt;&lt;code&gt;A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;seed&lt;/td&gt;
&lt;td&gt;&lt;code&gt;42&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;t2i&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dtype&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bf16&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;negative prompt&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sampler / scheduler&lt;/td&gt;
&lt;td&gt;HiDream pipeline default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I used a portrait because hair, skin, background light, and fine detail are easy to compare. That said, a young woman's face has relatively little texture and wrinkle detail to begin with, so it's actually a forgiving subject for low-step generation — I'll come back to that.&lt;/p&gt;

&lt;p&gt;Images in this article are contact sheets with results side by side. Pixel-peeping is easier at full resolution, but for UI-driven exploration the first question is "does this look worth keeping?" — so I've prioritized at-a-glance comparison here.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start by Reducing Steps
&lt;/h2&gt;

&lt;p&gt;Fixed &lt;code&gt;guidance=5.0&lt;/code&gt; and &lt;code&gt;2048x2048&lt;/code&gt;, varied only steps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhebaraog9hmrgwx5xet.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhebaraog9hmrgwx5xet.jpg" alt="steps" width="800" height="584"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;th&gt;Guidance&lt;/th&gt;
&lt;th&gt;Elapsed&lt;/th&gt;
&lt;th&gt;Speedup vs 50 steps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;13.070s&lt;/td&gt;
&lt;td&gt;2.55x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;18.412s&lt;/td&gt;
&lt;td&gt;1.81x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;23.854s&lt;/td&gt;
&lt;td&gt;1.40x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;33.370s&lt;/td&gt;
&lt;td&gt;1.00x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pretty much theoretical scaling. In this HiDream path, when &lt;code&gt;guidance &amp;gt; 1.0&lt;/code&gt;, both conditional and unconditional forwards run, so reducing steps translates directly to lower latency.&lt;/p&gt;

&lt;p&gt;Visually: 20 steps shows some roughness. 28 steps looks fine at first glance, though fine detail thins out under comparison. 36 steps holds up well for most use cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  guidance=1.0 Is Significantly Faster
&lt;/h2&gt;

&lt;p&gt;Next I varied guidance as well, comparing practical preset candidates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxh1gp6kz9m9plkj4x4y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxh1gp6kz9m9plkj4x4y.jpg" alt="presets" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;th&gt;Guidance&lt;/th&gt;
&lt;th&gt;CFG&lt;/th&gt;
&lt;th&gt;Elapsed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Draft&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;off&lt;/td&gt;
&lt;td&gt;8.164s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Balanced&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;23.664s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Official&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;32.609s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;guidance=1.0&lt;/code&gt; effectively disables CFG, so it's faster than step count alone would suggest — 24 steps lands in the 8-second range.&lt;/p&gt;

&lt;p&gt;The trade-off is that lower guidance changes prompt adherence and overall aesthetics. Fine for idea validation, but for prompts involving text, specific clothing details, or precise multi-element placement, staying at &lt;code&gt;guidance=3–5&lt;/code&gt; is safer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Resolution Trap: Requesting 1024 Doesn't Make It Faster
&lt;/h2&gt;

&lt;p&gt;My first instinct was to just pass &lt;code&gt;width=1024, height=1024&lt;/code&gt; and get a faster result. But the official pipeline doesn't use the requested resolution directly — it snaps to the nearest fixed aspect-ratio bucket.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdlp7telyvydcunmnw0u.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdlp7telyvydcunmnw0u.jpg" alt="buckets" width="800" height="584"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Measured results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requested&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;512x512&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024x1024&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1280x720&lt;/td&gt;
&lt;td&gt;2560x1440&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;720x1280&lt;/td&gt;
&lt;td&gt;1440x2560&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024x768&lt;/td&gt;
&lt;td&gt;2304x1728&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sending &lt;code&gt;1024x1024&lt;/code&gt; from the UI does nothing — square aspect ratios all resolve to &lt;code&gt;2048x2048&lt;/code&gt;. The snapping logic lives in &lt;code&gt;models/utils.py&lt;/code&gt; under &lt;code&gt;PREDEFINED_RESOLUTIONS&lt;/code&gt;, and it seems intentionally designed to favor output stability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bypassing Buckets for True Low-Resolution Generation
&lt;/h2&gt;

&lt;p&gt;For experimentation I added a &lt;code&gt;snap_resolution=false&lt;/code&gt; flag that bypasses the pipeline's resolution snapping. For safety, arbitrary resolutions are constrained to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;width and height aligned to 32px&lt;/li&gt;
&lt;li&gt;256px minimum&lt;/li&gt;
&lt;li&gt;max 4.3MP total&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Comparing &lt;code&gt;1024 / 1536 / 2048&lt;/code&gt; at &lt;code&gt;20 steps / guidance=5.0&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F026qkcpnkpu8co6vpyev.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F026qkcpnkpu8co6vpyev.jpg" alt="resolution" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;Elapsed&lt;/th&gt;
&lt;th&gt;Speedup vs 2048&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1024x1024&lt;/td&gt;
&lt;td&gt;3.831s&lt;/td&gt;
&lt;td&gt;3.47x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1536x1536&lt;/td&gt;
&lt;td&gt;7.139s&lt;/td&gt;
&lt;td&gt;1.86x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;13.278s&lt;/td&gt;
&lt;td&gt;1.00x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is where the real gains are. Given that the official 2048 recipe sits at 30+ seconds, &lt;code&gt;1536 + 28 steps&lt;/code&gt; should land around 10 seconds — a completely different feel.&lt;/p&gt;

&lt;p&gt;1024 is fast but noticeably lower in information density. Good for directional checks, but probably too rough for regular output use.&lt;/p&gt;




&lt;h2&gt;
  
  
  Presets in the Studio UI
&lt;/h2&gt;

&lt;p&gt;Based on these results, here's what I settled on in the Studio UI:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;th&gt;Guidance&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Quick preview&lt;/td&gt;
&lt;td&gt;1024x1024&lt;/td&gt;
&lt;td&gt;20–24&lt;/td&gt;
&lt;td&gt;1.0–3.0&lt;/td&gt;
&lt;td&gt;Composition / mood check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;1536x1536&lt;/td&gt;
&lt;td&gt;28–36&lt;/td&gt;
&lt;td&gt;3.0–5.0&lt;/td&gt;
&lt;td&gt;Day-to-day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High quality&lt;/td&gt;
&lt;td&gt;2048x2048&lt;/td&gt;
&lt;td&gt;36–50&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;Re-render of selected candidates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Official bucket&lt;/td&gt;
&lt;td&gt;bucket&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;Match upstream recipe exactly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Steps and resolution are independently selectable in the UI. The workflow is: explore with &lt;code&gt;1024 / 24 steps&lt;/code&gt;, then re-render promising results at &lt;code&gt;1536&lt;/code&gt; or &lt;code&gt;2048&lt;/code&gt; with the same prompt and seed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cases Where Quality Degradation Shows Up
&lt;/h2&gt;

&lt;p&gt;With this portrait, the difference between 28 steps and 50 steps was "visible under comparison" — not obvious at a glance. But part of that is the subject matter.&lt;/p&gt;

&lt;p&gt;Low steps and low resolution tend to hurt most with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Older faces, wrinkles, skin texture&lt;/li&gt;
&lt;li&gt;Hands, fingers, jewelry&lt;/li&gt;
&lt;li&gt;Fabric with fine patterns&lt;/li&gt;
&lt;li&gt;Text in signs or books&lt;/li&gt;
&lt;li&gt;Multiple people&lt;/li&gt;
&lt;li&gt;Busy indoor scenes with lots of background objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conversely, young faces, simple backgrounds, and soft lighting are forgiving — low-cost settings hold up well.&lt;/p&gt;

&lt;p&gt;That's why a single fixed preset isn't the right design. &lt;strong&gt;Giving users control over exploration cost depending on what they're generating&lt;/strong&gt; is the better approach.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reproduction Commands
&lt;/h2&gt;

&lt;p&gt;The benchmark script lives at &lt;code&gt;image_server/bench_quality_speed.py&lt;/code&gt;. It calls the HTTP API after the model is already resident, so model load time is excluded from all measurements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./image_server/start_image_server.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Steps comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 image_server/bench_quality_speed.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--variant&lt;/span&gt; s20_g5,20,5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--variant&lt;/span&gt; s28_g5,28,5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--variant&lt;/span&gt; s36_g5,36,5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--variant&lt;/span&gt; s50_g5,50,5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resolution comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 image_server/bench_quality_speed.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--variant&lt;/span&gt; s20_g5,20,5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--size&lt;/span&gt; 1024x1024 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--size&lt;/span&gt; 1536x1536 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--size&lt;/span&gt; 2048x2048 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-snap-resolution&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;HiDream-O1-Image Full is excellent at its official settings but too slow for iterative use. When you break down steps, CFG, and resolution separately, the speedups are clean and predictable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Steps scale almost linearly with time&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;guidance=1.0&lt;/code&gt; drops CFG and gives a large speed boost&lt;/li&gt;
&lt;li&gt;The official pipeline snaps resolutions to fixed buckets&lt;/li&gt;
&lt;li&gt;True low-resolution generation at 1024/1536 is dramatically faster&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1536 / 28–36 steps&lt;/code&gt; is the practical sweet spot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For image generation UIs, &lt;strong&gt;low-cost exploration → high-quality final render&lt;/strong&gt; is a much better flow than starting at maximum quality every time. This experiment gave me a solid basis for building exactly that.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>python</category>
    </item>
  </channel>
</rss>
