<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shinji shimizu</title>
    <description>The latest articles on DEV Community by shinji shimizu (@shinji_shimizu_bb51276a5e).</description>
    <link>https://dev.to/shinji_shimizu_bb51276a5e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3945785%2F5b60b30f-9e75-488a-8dcc-da3545ceca41.png</url>
      <title>DEV Community: shinji shimizu</title>
      <link>https://dev.to/shinji_shimizu_bb51276a5e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shinji_shimizu_bb51276a5e"/>
    <language>en</language>
    <item>
      <title>Implementing Claude Code's Memory Model as a Dreaming Layer on 58 Articles</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Wed, 10 Jun 2026 20:03:53 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/implementing-claude-codes-memory-model-as-a-dreaming-layer-on-58-articles-3g2l</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/implementing-claude-codes-memory-model-as-a-dreaming-layer-on-58-articles-3g2l</guid>
      <description>&lt;p&gt;I built a pipeline in a single session that consolidates the 58 tech-blog articles of my service &lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;Kotonia&lt;/a&gt; (ja/en/zh) into a semantic index, then uses that index to detect duplicates for new article mining. &lt;strong&gt;Raw articles → semantic index → TF-IDF dedup → chunked draft generation&lt;/strong&gt; — full path running on local Gemma 4 26B driven by Codex CLI. Design and implementation notes follow.&lt;/p&gt;

&lt;p&gt;The motivation and "how solo developer accumulated assets compound" framing is in the companion piece: &lt;a href="https://kotonia.ai/en/articles/solo-dev-knowledge-compound/" rel="noopener noreferrer"&gt;The Day a Solo Developer's Accumulated Assets Finally Started to Compound&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This piece keeps the technical notes.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Problem — When Title-Only Dedup Broke
&lt;/h2&gt;

&lt;p&gt;Mining v1 produced a draft and I (the user) noticed "this overlaps with an existing article." The overlap target was &lt;code&gt;voice-first-local-llm&lt;/code&gt; (importance=9 flagship).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New draft thesis: "tokens per chunk is a hidden voice-chat latency driver"&lt;/li&gt;
&lt;li&gt;Existing article §3.3: "★ Streaming granularity — the structural difference that decides voice experience"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same numbers (Local Gemma 1.0 tok/chunk, Haiku 10-16, Gemini 8-24). A perfect duplicate.&lt;/p&gt;

&lt;p&gt;The mining agent had called &lt;code&gt;art-done-list&lt;/code&gt; (title + description) for the dedup check. But the existing article's title is "Cutting short-form LLM latency from 600ms to 22ms," with TTFB as the headline sales pitch; §3.3 streaming granularity is buried in an H2 subsection. &lt;strong&gt;At title level, nothing overlapped, so the check came back clean&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's the starting point for this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Design — Three Layers: episodic ↔ semantic ↔ procedural
&lt;/h2&gt;

&lt;p&gt;Breaking down why &lt;a href="https://claude.com/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;'s memory system works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Entries are small (1-3KB, one topic each) → subtopics don't get buried&lt;/li&gt;
&lt;li&gt;Hooks are retrieval-tuned and curated → search terms re-appear in the hook&lt;/li&gt;
&lt;li&gt;A smart model writes hooks semi-autonomously → past me distills for future me&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Articles have the opposite shape. Each 5-15KB, important subtopics buried in subsection bodies, descriptions are SEO summaries rather than retrieval-tuned, too heavy for an agent.&lt;/p&gt;

&lt;p&gt;I bridged them with an intermediate layer named the &lt;strong&gt;Dreaming layer&lt;/strong&gt;. Literally the biological "memory consolidation during sleep — hippocampus to cortex" metaphor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;episodic (raw articles + memory files)
    ↓ Dreaming agent (periodic digestion)
semantic (concepts_covered_ja[] / importance / data_points / sections)
    ↓ agent reverse-lookup (art-concepts-find / TF-IDF cosine)
procedural (mining / drafting / publishing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A semantic entry for an article looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"voice-first-local-llm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"locale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ja"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"thesis_ja"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ditching API, building voice-first with self-hosted local 26B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"importance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"factors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"pv_count_30d"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"avg_scroll"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;67.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"avg_dwell_sec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;170&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"has_bench_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"novelty_high"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"concepts_covered_ja"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"TTFB (time-to-first-byte): local vs API"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Streaming granularity (tokens per chunk)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Gemma 4 26B model selection rationale"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Ditto + LLM co-residency GPU design"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"data_points"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TTFB Local"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"17-25ms"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Streaming granularity Local"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0 tok/chunk"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sections"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3.3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Streaming granularity — the structural difference that decides voice experience"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key point: &lt;code&gt;concepts_covered_ja[]&lt;/code&gt; must be &lt;strong&gt;normalized to Japanese canonical names&lt;/strong&gt;. Translated EN/ZH articles use the same JP concept strings. That single normalization becomes the dedup primitive downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Tools — Thin CLIs the Agent Calls
&lt;/h2&gt;

&lt;p&gt;Codex CLI drives Gemma 4 26B locally. Tool calling via &lt;code&gt;--enable-auto-tool-choice --tool-call-parser gemma4&lt;/code&gt; gives an OpenAI-compatible surface. Each tool is ~50-100 lines of Python (stdlib only), &lt;code&gt;art-&lt;/code&gt; prefix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;tool&lt;/th&gt;
&lt;th&gt;role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-articles-list --needs-dreaming&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DB ∪ FS articles + dreaming state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-pv-count --slug X&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;analytics_events → PV / scroll / dwell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-source-pull &amp;lt;slug&amp;gt; [--section N]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;pull just one H2/H3 section of an article&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-dream-write&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;upsert a semantic entry into articles_index.jsonl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-concepts-find &amp;lt;pattern&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;concept → article reverse-lookup (the mining dedup primitive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-ideas-check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;evaluate a candidate idea via TF-IDF (the core of this article)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-ideas-add&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;push an idea to the pool (calls art-ideas-check internally)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-draft-append&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;append a chunk of draft body to a buffer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;art-draft-commit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;finalize buffer → &lt;code&gt;articles/_drafts/&amp;lt;slug&amp;gt;.md&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Dreaming agent semantically encodes one article at a time using these. Importance scoring uses this rubric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+2: PV &amp;gt;= 100 (sigmoid log-scale)
+1: avg_scroll &amp;gt;= 0.7 AND avg_dwell_sec &amp;gt;= 60
+2: bench numbers / failure root cause / named decision
+2: novel concept not yet in index
+1: evergreen value (not time-sensitive)
-2: redundant with an already-indexed flagship
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PV comes from a homegrown &lt;code&gt;analytics_events&lt;/code&gt; table (cookie-less first-party tracker). The fact that the article platform and analytics co-reside in one DB you can hit directly is a solo-dev win.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. TF-IDF Dedup — Substituting Tool Structure for Agent Self-Discipline
&lt;/h2&gt;

&lt;p&gt;At mining v1 the prompt instructed the agent to call &lt;code&gt;art-concepts-find&lt;/code&gt; for dedup. The agent slipped through three duplicates anyway (details: &lt;a href="https://kotonia.ai/en/articles/agent-discipline-fails-tool-blocking/" rel="noopener noreferrer"&gt;Don't Trust an Agent's Self-Discipline&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The fix: embed a dedup gate directly inside &lt;code&gt;art-ideas-add&lt;/code&gt;. The guts of &lt;code&gt;evaluate_idea()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_idea&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;angle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ideas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_corpus&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# infer the candidate's concepts from the canonical vocab
&lt;/span&gt;    &lt;span class="n"&gt;pseudo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concepts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;_infer_concepts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;angle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

    &lt;span class="c1"&gt;# IDF (rare concepts weighted more)
&lt;/span&gt;    &lt;span class="n"&gt;idf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_idf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ideas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;new_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pseudo&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concepts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;idf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;conflicts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concepts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;idf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;importance_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conflicts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flagship_concept&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...})&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ideas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concepts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;idf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conflicts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pool_dup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;conflicts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conflicts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;conflicts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three traps along the way in &lt;code&gt;_infer_concepts()&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 1: substring-match false positives&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ASCII term "check" matches inside "checkout"; "PRO" inside "prod_". The Stripe idea was falsely matched into "品質チェック (quality check/retry)" or "Blackwell Max-Q (RTX PRO 6000)" and rejected.&lt;/p&gt;

&lt;p&gt;Fix: ASCII terms require word boundary; JP terms can stay substring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_term_matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_ASCII_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?&amp;lt;![A-Za-z0-9_])&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?![A-Za-z0-9_])&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;  &lt;span class="c1"&gt;# JP substring is fine
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trap 2: generic JP noun noise&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"モデル" "システム" "アーキテクチャ" "サービス" appear in many concept names; they get picked up from arbitrary idea titles. Registered ~30 generic words in &lt;code&gt;_NOISE_TERMS&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 3: threshold tuning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Started flagship sim &amp;gt;= 0.30, but a binary vector with 4 concepts and 1 shared concept maxes around cosine 0.25. Even with IDF weighting, 0.27-0.30 was the borderline. Dropped to 0.25 and instead tightened the precision of the substring matcher (the false-positive engine).&lt;/p&gt;

&lt;p&gt;Regression test: 4/4 across the known 4 cases (OpenWeight NSFW / streaming-granularity / CodeFormer / Stripe).&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Small-Model Specific Traps — Codex CLI + 26B Uncensored
&lt;/h2&gt;

&lt;p&gt;Driving local 26B (Gemma 4 26B A4B Uncensored MAX) through Codex CLI, I observed 4 failure modes and their fixes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 4: descriptive prompt → "I will begin by surveying..." then exit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first mining run had the agent summarize "what I'll do next" and exit with zero tool calls. Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**Critical: do not narrate, plan, or describe what you will do. Just call tools.**&lt;/span&gt;
The first action &lt;span class="gs"&gt;**must**&lt;/span&gt; be &lt;span class="sb"&gt;`shell({"command": "art-..."})`&lt;/span&gt; — start there.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Imperative + first-action explicit, and it starts moving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 5: huge tool output triggers a generation loop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;art-commits-recent --since "60 days ago" --include-files&lt;/code&gt; returned ~1300 lines of JSON including bodies; the agent then emitted ~25K tokens of output continuously, never stopping. Fix: &lt;code&gt;art-commits-recent&lt;/code&gt; defaults to subject-only; body via &lt;code&gt;--include-body&lt;/code&gt; opt-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 6: 5KB+ heredoc in tool_call.arguments JSON breaks the escape&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sending &lt;code&gt;art-draft-save &amp;lt;slug&amp;gt; &amp;lt;&amp;lt;'EOF' ... 5KB body ... EOF&lt;/code&gt; as a single shell tool_call reliably breaks 26B's string escaping inside the arguments JSON (&lt;code&gt;Unterminated string at column 5083&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Fix: split into chunked append + commit. ~200-800 chars per chunk, 4-8 appends, final commit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;art-draft-append my-slug &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;KOTONIA_EOF&lt;/span&gt;&lt;span class="sh"&gt;'
---
title: "..."
---
&lt;/span&gt;&lt;span class="no"&gt;KOTONIA_EOF

&lt;/span&gt;art-draft-append my-slug &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;KOTONIA_EOF&lt;/span&gt;&lt;span class="sh"&gt;'
## 1. First section
...
&lt;/span&gt;&lt;span class="no"&gt;KOTONIA_EOF

&lt;/span&gt;&lt;span class="c"&gt;# ...repeat per section...&lt;/span&gt;

art-draft-commit my-slug
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each tool_call's arguments JSON stays small, escape break vanishes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 7: Codex exec self-terminates after ~4 articles&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There seems to be an implicit constraint where one &lt;code&gt;codex exec&lt;/code&gt; invocation finishes with a summary message after ~25K tokens / ~4 articles. Codex's Goals feature (&lt;code&gt;thread_goals.objective&lt;/code&gt;) could prevent that, but you can't set it via &lt;code&gt;exec&lt;/code&gt; (only the interactive TUI as of v0.133).&lt;/p&gt;

&lt;p&gt;Fix: wrap &lt;code&gt;dispatcher.sh&lt;/code&gt; in an external loop. Restart &lt;code&gt;codex exec&lt;/code&gt; until &lt;code&gt;pending == 0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;max_cycles&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30
&lt;span class="nv"&gt;cycle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; cycle &amp;lt; max_cycles &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;pending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;art-articles-list &lt;span class="nt"&gt;--needs-dreaming&lt;/span&gt; &lt;span class="nt"&gt;--count-only&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; pending &lt;span class="o"&gt;==&lt;/span&gt; 0 &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then &lt;/span&gt;&lt;span class="nb"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;fi
  &lt;/span&gt;run_codex dream
  &lt;span class="nv"&gt;cycle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;cycle &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gets 58 articles digested in 2-3 cycles.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. What Landed
&lt;/h2&gt;

&lt;p&gt;The working pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;58 articles → semantic index, importance bell-shaped (median 5-6), flagship recognition correct (voice-first-local-llm at score 9 across all locales)&lt;/li&gt;
&lt;li&gt;70 memory files mined for unexplored concepts, 4 ideas land in the pool as survivors&lt;/li&gt;
&lt;li&gt;4 drafts generated, ~3.6-4.6KB each, publish-ready after 10-20 minutes of human polish&lt;/li&gt;
&lt;li&gt;TF-IDF dedup gate at the tool layer blocks any agent self-discipline violation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo: [github coming soon]&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Generalization
&lt;/h2&gt;

&lt;p&gt;The structure — &lt;strong&gt;raw assets → semantic compression → agent reverse-lookup&lt;/strong&gt; — generalizes beyond articles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test generation&lt;/strong&gt;: semantically compress existing tests, mine uncovered branches, draft new tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR descriptions&lt;/strong&gt;: semantically compress the codebase delta, dedupe against unrelated PRs, draft a description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support FAQs&lt;/strong&gt;: semantically compress past support tickets, surface uncovered topics, draft new FAQs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal knowledge base&lt;/strong&gt;: Scrapbox / Notion accumulation → semantic compression → mechanically discover unexplored concepts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common design principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Raw assets are heavy&lt;/strong&gt;. Don't load them directly — insert a consolidation layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The canonical vocabulary is the semantic-layer primitive&lt;/strong&gt;. Without normalization, dedup doesn't work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforcement belongs at the tool layer&lt;/strong&gt;. Agent self-discipline is unstable; bake the rule into the structure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Knowing this opened up application to other domains in kotonia (persona generation in character chat, TTS prompt accumulation, etc.).&lt;/p&gt;

&lt;h2&gt;
  
  
  Aside: Development Time
&lt;/h2&gt;

&lt;p&gt;One session (~6h). Dreaming layer design → 5 new tools → Codex prompts → first-time consolidation → TF-IDF dedup → chunked draft → 4 article drafts generated, all in one stretch.&lt;/p&gt;

&lt;p&gt;Local 26B as the "runs on electricity only" agent absorbed the grinding labor; the human only had to make judgment calls and steering corrections. Doing this on frontier APIs would have cost $50-100.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;Kotonia&lt;/a&gt; is a voice-first AI character chat platform. The drafts revived by this pipeline live on the same blog if you're curious.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>agents</category>
      <category>tfidf</category>
      <category>codex</category>
    </item>
    <item>
      <title>Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Sun, 31 May 2026 09:51:11 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/fitting-llm-reply-suggestions-into-every-providers-prompt-cache-without-structured-output-18f6</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/fitting-llm-reply-suggestions-into-every-providers-prompt-cache-without-structured-output-18f6</guid>
      <description>&lt;p&gt;I wanted to add &lt;strong&gt;reply suggestions&lt;/strong&gt; to a voice roleplay chat — the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit.&lt;/p&gt;

&lt;p&gt;I ended up going with the unglamorous move of &lt;strong&gt;embedding inline markers in the response and stripping them out afterward&lt;/strong&gt;. The path to that decision was interesting enough to write up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9c4rd4yn2lk9g0cjdrsu.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9c4rd4yn2lk9g0cjdrsu.webp" alt="Three reply suggestion chips shown below an AI response (kotonia)" width="799" height="289"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;What I wanted to build: three "you could say this" chips per AI response — no structured output, no stream interruption, no cache invalidation.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Two Hard Constraints
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. The conversation is built around prompt caching
&lt;/h3&gt;

&lt;p&gt;Keeping token costs down in an LLM chat comes down to caching, and every provider does it differently.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt;: explicit cache. A cache object is created per session, containing the persona prompt and conversation history. Each turn sends only the diff. When history grows too long, the cache is rebuilt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek / Cerebras (OpenAI-compatible)&lt;/strong&gt;: send &lt;code&gt;system + full history + user&lt;/code&gt; every time and ride the server's &lt;strong&gt;implicit prefix cache&lt;/strong&gt; (measurable via &lt;code&gt;prompt_cache_hit_tokens&lt;/code&gt; etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grok (xAI)&lt;/strong&gt;: the &lt;code&gt;x-grok-conv-id&lt;/code&gt; header ties requests to the same conversation, keeping them pinned to the cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: the conversation prefix (persona + history) should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Structured output is off the table
&lt;/h3&gt;

&lt;p&gt;The natural-looking approach to fetching three suggestions would be something like &lt;code&gt;{"reply": "...", "suggestions": ["...", "...", "..."]}&lt;/code&gt;. I ruled it out for two reasons.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini flash-lite class models show noticeable latency increases with structured output.&lt;/strong&gt; The lighter the model, the heavier schema compliance costs are relative to the task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It directly conflicts with sentence-level TTS streaming.&lt;/strong&gt; This chat is designed to start speaking from the very first sentence. While the model is outputting JSON, there's no way to pull out that first sentence. Structured output means waiting for full generation before any audio plays.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Three Approaches I Considered
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A. Separate API call to generate suggestions&lt;/strong&gt;&lt;br&gt;
Fire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B. Structured output, bundled in the main turn&lt;/strong&gt;&lt;br&gt;
No second request, so cache consistency is trivial. But ruled out for the reasons above (latency + streaming conflict).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;C. Inline markers, bundled in the main turn&lt;/strong&gt; &lt;em&gt;(chosen)&lt;/em&gt;&lt;br&gt;
Ask the model to append &lt;code&gt;{{SUGGEST: option1 | option2 | option3}}&lt;/code&gt; at the very end of its response, and extract it server-side.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why C Works
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It's the same request.&lt;/strong&gt; There is no "second request." Whether it's an explicit cache or an implicit prefix cache, &lt;strong&gt;that turn is already on the cache&lt;/strong&gt; — alignment is automatic. No per-provider logic needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No structured output.&lt;/strong&gt; Plain text generation all the way through.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero perceived latency increase.&lt;/strong&gt; TTS is already playing from the first sentence while &lt;code&gt;{{SUGGEST}}&lt;/code&gt; trickles out at the end. Generation finishes while the user is listening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reuses the existing marker infrastructure.&lt;/strong&gt; This chat already has inline markers like &lt;code&gt;{{SHOW: label}}&lt;/code&gt;, &lt;code&gt;{{POSE: ...}}&lt;/code&gt;, and &lt;code&gt;{{IMAGE: ...}}&lt;/code&gt;, plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The Key Implementation Detail: Strip From Both Places
&lt;/h2&gt;

&lt;p&gt;The important part: once extracted, the marker must be &lt;strong&gt;removed from both the TTS/display text and the DB history&lt;/strong&gt;. Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Extract {{SUGGEST: a | b | c}} and remove it entirely from the body&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;RE_SUGGEST&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Lazy&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Regex&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nn"&gt;Lazy&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(||&lt;/span&gt; &lt;span class="nn"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;r"(?is)\{\{\s*SUGGEST\s*:\s*([\s\S]*?)\}\}"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;extract_suggest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;RE_SUGGEST&lt;/span&gt;&lt;span class="nf"&gt;.captures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;suggestions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="nf"&gt;.split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'|'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="nf"&gt;.trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="nf"&gt;.filter&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="nf"&gt;.is_empty&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="nf"&gt;.take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RE_SUGGEST&lt;/span&gt;&lt;span class="nf"&gt;.replace_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suggestions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where the existing &lt;strong&gt;"store annotated / display clean"&lt;/strong&gt; separation pays off. In this chat:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ai_text&lt;/code&gt; returned to the client (display + TTS) is fully stripped of all markers.&lt;/li&gt;
&lt;li&gt;What gets saved to DB &lt;strong&gt;re-attaches&lt;/strong&gt; &lt;code&gt;{{SHOW}}&lt;/code&gt;/&lt;code&gt;{{POSE}}&lt;/code&gt; markers (so the model keeps seeing its own canonical format in history and continues using it correctly).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;{{SUGGEST}}&lt;/code&gt; is different from &lt;code&gt;{{SHOW}}&lt;/code&gt;/&lt;code&gt;{{POSE}}&lt;/code&gt; — &lt;strong&gt;it doesn't go back into the DB at all&lt;/strong&gt;. It's ephemeral. The design of choosing per-marker whether to persist or discard let suggestions slot in cleanly without touching anything else.&lt;/p&gt;

&lt;p&gt;On the prompt side, it's just one extra block gated by a feature flag in the persona config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;At the very end of your response, add exactly three short replies the user
might say next, in this format:
{{SUGGEST: option1 | option2 | option3}}
- Always place it last (after any {{SHOW}}/{{POSE}} markers)
- Write each option in first person, casual, short
- Vary the direction: one enthusiastic, one deflecting, one asking a question back
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Note on Implicit Prefix Cache Alignment
&lt;/h2&gt;

&lt;p&gt;Implicit prefix caches hit when the token sequence at the start of a request matches a previously seen prefix. The marker approach simply &lt;strong&gt;generates suggestions as part of the current turn's response&lt;/strong&gt; — the next turn's input prefix (system + history) is identical to what it would be in a plain conversation. The prefix keeps hitting the cache normally. The suggestions never touch the prefix at all. That's a quiet but important property.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;When adding secondary structured data to a streaming + caching chat, &lt;strong&gt;consider inline markers + extraction before reaching for structured output&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Bundling everything into the same request makes cross-provider cache alignment a non-issue by construction.&lt;/li&gt;
&lt;li&gt;If you already have a marker extraction pipeline, the marginal cost is nearly zero. Design it so you can choose per-marker whether to persist or discard — that flexibility makes ephemeral UI additions painless to add later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The costs: output tokens increase by a few dozen, and occasionally the model mangles the marker format (same risk level as &lt;code&gt;{{SHOW}}&lt;/code&gt;/&lt;code&gt;{{POSE}}&lt;/code&gt;). Both are acceptable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This chat is part of &lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;kotonia&lt;/a&gt;, a voice roleplay product running multilingual TTS × lip-sync avatars on a local GPU.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>rust</category>
      <category>webdev</category>
    </item>
    <item>
      <title>One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Tue, 26 May 2026 23:45:33 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own-5hd1</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own-5hd1</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;HiDream-O1-Image&lt;/a&gt; is one of the strongest open-weight text-to-image models out right now (it debuted around &lt;strong&gt;#8 in the Artificial Analysis T2I Arena&lt;/strong&gt;). But it shipped &lt;strong&gt;inference-only&lt;/strong&gt;, and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can't touch it.&lt;/p&gt;

&lt;p&gt;This post is &lt;strong&gt;one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image&lt;/strong&gt;. I'll show why the standard trainers (kohya, ai-toolkit, SimpleTuner) don't fit, how I reverse-engineered a working training loop from the &lt;em&gt;inference&lt;/em&gt; code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this LoRA is:&lt;/strong&gt; a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It's not a character LoRA, not a single-style LoRA, and not a model-distillation artifact.&lt;/p&gt;

&lt;p&gt;The short version of the recipe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model's output head predicts the &lt;strong&gt;clean image &lt;code&gt;x0&lt;/code&gt;&lt;/strong&gt; (in patch space, &lt;code&gt;[-1,1]&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Build the noised input as &lt;code&gt;z_t = (1 - σ)·x0 + σ·(8.0·ε)&lt;/code&gt; and feed the model timestep &lt;code&gt;1 - σ&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Loss is just &lt;strong&gt;&lt;code&gt;MSE(x_pred, x0)&lt;/code&gt; on the image-token positions&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;LoRA attaches via plain &lt;strong&gt;PEFT&lt;/strong&gt; to the language-model decoder linears, because the backbone is a stock HF &lt;code&gt;Qwen3-VL&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prior art (what existed before this)
&lt;/h2&gt;

&lt;p&gt;To set expectations honestly: I'm not claiming "world's first LoRA file for O1."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kijai&lt;/strong&gt; published a &lt;a href="https://github.com/Kijai/hidream-O1-image_comfy" rel="noopener noreferrer"&gt;ComfyUI workflow for HiDream-O1&lt;/a&gt; that includes a &lt;strong&gt;distill LoRA&lt;/strong&gt; — it extracts the Dev-2604 model's behavior as a LoRA applied to the Base model. That's a model-compression technique, not a visual-style LoRA trained on external images.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ostris&lt;/strong&gt; (author of AI Toolkit) has &lt;a href="https://github.com/ostris/ai-toolkit" rel="noopener noreferrer"&gt;run initial LoRA training tests on HiDream-O1&lt;/a&gt; and ai-toolkit lists O1 as a supported model. No resulting LoRA has been publicly released as of this writing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TechnoEdge&lt;/strong&gt; (Japanese tech media) reported using a face LoRA with HiDream-O1 Dev, though it's unclear whether that LoRA was purpose-trained for O1 or adapted from elsewhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I &lt;em&gt;didn't&lt;/em&gt; find: &lt;strong&gt;a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image.&lt;/strong&gt; If you know of one, I'd genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why no trainer exists: the architecture
&lt;/h2&gt;

&lt;p&gt;Most LoRA trainers assume the SDXL/Flux shape: a &lt;strong&gt;UNet/DiT&lt;/strong&gt; denoiser + a &lt;strong&gt;VAE&lt;/strong&gt; + one or two &lt;strong&gt;text encoders&lt;/strong&gt;, all separate modules wired together by &lt;code&gt;diffusers&lt;/code&gt;. You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings.&lt;/p&gt;

&lt;p&gt;HiDream-O1-Image is a &lt;strong&gt;Pixel-level Unified Transformer (UiT)&lt;/strong&gt;. From its own description:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a natively unified image generative foundation model built on a Pixel-level Unified Transformer &lt;strong&gt;without external VAEs or disjoint text encoders&lt;/strong&gt;, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Concretely (reading &lt;code&gt;models/qwen3_vl_transformers.py&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The backbone is a &lt;strong&gt;&lt;code&gt;Qwen3VLForConditionalGeneration&lt;/code&gt;&lt;/strong&gt; — a stock Hugging Face Qwen3-VL multimodal transformer.&lt;/li&gt;
&lt;li&gt;There is &lt;strong&gt;no VAE&lt;/strong&gt;. Images are &lt;strong&gt;patchified directly&lt;/strong&gt;: &lt;code&gt;PATCH_SIZE = 32&lt;/code&gt;, so an &lt;code&gt;H×W&lt;/code&gt; image becomes &lt;code&gt;(H/32)·(W/32)&lt;/code&gt; tokens, each a &lt;code&gt;3·32·32 = 3072&lt;/code&gt;-dim vector of raw pixels.&lt;/li&gt;
&lt;li&gt;A small &lt;code&gt;x_embedder&lt;/code&gt; projects the noised patch tokens into the hidden space; a &lt;code&gt;final_layer2&lt;/code&gt; head projects hidden states back to patch space; a &lt;code&gt;t_embedder&lt;/code&gt; injects the timestep at a dedicated &lt;code&gt;&amp;lt;|tms_token|&amp;gt;&lt;/code&gt; position.&lt;/li&gt;
&lt;li&gt;It's trained with &lt;strong&gt;flow matching&lt;/strong&gt; (&lt;code&gt;fm_solvers_unipc.py&lt;/code&gt;), and image tokens get &lt;strong&gt;full (bidirectional) attention&lt;/strong&gt; while text tokens stay causal (this is what &lt;code&gt;token_types&lt;/code&gt; controls).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So none of kohya/ai-toolkit/SimpleTuner can touch it — there's no UNet, no VAE, no separate text encoder for them to hook. That's exactly &lt;em&gt;why&lt;/em&gt; there are no articles: it's a new architecture, released inference-only.&lt;/p&gt;

&lt;p&gt;The good news: because the backbone is a &lt;strong&gt;plain &lt;code&gt;transformers&lt;/code&gt; model&lt;/strong&gt;, the LoRA &lt;em&gt;adapter&lt;/em&gt; mechanics are trivial — PEFT injects into the &lt;code&gt;nn.Linear&lt;/code&gt;s natively. The hard part is the &lt;strong&gt;training loop&lt;/strong&gt;, which the repo doesn't ship. So let's derive it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reverse-engineering the training forward from inference
&lt;/h2&gt;

&lt;p&gt;The inference loop (&lt;code&gt;models/pipeline.py:generate_image&lt;/code&gt;) tells you everything. Per denoising step it does roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;step_t&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;                       &lt;span class="c1"&gt;# noise level, in (0, 1]
&lt;/span&gt;&lt;span class="n"&gt;t_pixeldit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sigma&lt;/span&gt;                       &lt;span class="c1"&gt;# what the model receives as "timestep"
&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;vinputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;t_pixeldit&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;sigma&lt;/span&gt;                        &lt;span class="c1"&gt;# ... and -v is fed to the FM scheduler
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two facts fall out of this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;x_pred&lt;/code&gt; is the model's prediction of the clean image &lt;code&gt;x0&lt;/code&gt;.&lt;/strong&gt; Work the algebra backwards: if &lt;code&gt;z_t = (1-σ)·x0 + σ·ε&lt;/code&gt; then &lt;code&gt;(x_pred - z_t)/σ = x0 - ε = -(ε - x0)&lt;/code&gt;, and &lt;code&gt;ε - x0&lt;/code&gt; is exactly the rectified-flow velocity the &lt;code&gt;FlowMatch&lt;/code&gt; scheduler expects. Consistent ⇒ the head is &lt;strong&gt;&lt;code&gt;x0&lt;/code&gt;-parameterized&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The noise scale isn't 1.&lt;/strong&gt; Inference initializes &lt;code&gt;z = NOISE_SCALE · randn&lt;/code&gt; with &lt;code&gt;NOISE_SCALE = 8.0&lt;/code&gt;, while &lt;code&gt;x0&lt;/code&gt; lives in &lt;code&gt;[-1, 1]&lt;/code&gt;. So the interpolation the model was trained on is &lt;code&gt;z_t = (1-σ)·x0 + σ·(8.0·ε)&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That gives the entire training step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T_EPS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;eps&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;z_t&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sigma&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sigma&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOISE_SCALE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# NOISE_SCALE = 8.0
&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sigma&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;out&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vinputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;z_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;timestep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vinput_mask&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;      &lt;span class="c1"&gt;# image-token positions only
&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mse_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;x0&lt;/code&gt; is just the image, normalized to &lt;code&gt;[-1,1]&lt;/code&gt; and patchified with the same &lt;code&gt;einops&lt;/code&gt; rearrange the pipeline uses for reference images. The token layout (prompt → &lt;code&gt;&amp;lt;|boi_token|&amp;gt;&lt;/code&gt; → &lt;code&gt;&amp;lt;|tms_token|&amp;gt;&lt;/code&gt; → image tokens) is built by reusing the pipeline's own &lt;code&gt;build_t2i_text_sample&lt;/code&gt;, so positions and &lt;code&gt;token_types&lt;/code&gt; line up with what the forward expects.&lt;/p&gt;

&lt;p&gt;Uniform &lt;code&gt;σ&lt;/code&gt; sampling and unweighted &lt;code&gt;x0&lt;/code&gt;-MSE are enough to learn cleanly — no fancy loss weighting needed for a first cut.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attaching the LoRA
&lt;/h2&gt;

&lt;p&gt;Because the denoiser is &lt;code&gt;model.model.language_model&lt;/code&gt; (a stock Qwen3-VL decoder), PEFT targets its attention/MLP linears and freezes everything else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;targets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;named_modules&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
           &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
           &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
           &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;visual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's &lt;strong&gt;252 linears, ~44M trainable params&lt;/strong&gt; at rank 16. The vision encoder, &lt;code&gt;x_embedder&lt;/code&gt;, &lt;code&gt;t_embedder&lt;/code&gt;, and &lt;code&gt;final_layer2&lt;/code&gt; stay frozen. One subtlety: PEFT swaps the &lt;code&gt;Linear&lt;/code&gt;s &lt;strong&gt;in place&lt;/strong&gt;, so a handle grabbed before &lt;code&gt;get_peft_model&lt;/code&gt; (&lt;code&gt;gen = model.model&lt;/code&gt;) still sees the LoRA layers — convenient for calling the generation forward directly and for &lt;code&gt;model.disable_adapter()&lt;/code&gt; A/B renders.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data, captions, and resolution
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolution is not fixed at 2048.&lt;/strong&gt; The &lt;code&gt;find_closest_resolution()&lt;/code&gt; snapping you see in the pipeline is a &lt;em&gt;quality default&lt;/em&gt; (the model is tuned for high res), not an architectural limit — &lt;code&gt;height&lt;/code&gt;/&lt;code&gt;width&lt;/code&gt; are free as long as they're multiples of 32. Since image tokens scale as &lt;code&gt;(H/32)·(W/32)&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;resolution&lt;/th&gt;
&lt;th&gt;image tokens&lt;/th&gt;
&lt;th&gt;relative attention cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2048²&lt;/td&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024²&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;~1/16&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So I train at &lt;strong&gt;1024&lt;/strong&gt;: ~4× shorter sequences, far less VRAM and time per step. The workflow becomes "iterate cheaply at 1024, upscale the keepers." Aspect ratios are left native (each image snapped to the nearest ×32, batch size 1) — no bucketing needed, and mixed portrait/landscape actually helps a style LoRA generalize.&lt;/p&gt;

&lt;p&gt;For captions, HiDream wants &lt;strong&gt;natural-language prose&lt;/strong&gt;, not danbooru tags (different text encoder lineage). I captioned ~190 images with a local multimodal VLM into one-to-three-sentence descriptions, each prefixed with a trigger phrase so the aesthetic stays &lt;strong&gt;prompt-controllable&lt;/strong&gt; (invoke it when you want it, leave it off otherwise).&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Same prompt, same seed, adapter off vs on. All samples use the trigger phrase &lt;code&gt;kotonia style&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdwieczwm6j9vn831prq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdwieczwm6j9vn831prq.png" alt="Base vs LoRA — schoolgirl, cherry blossoms" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl28t2ne9xsnn70o15734.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl28t2ne9xsnn70o15734.png" alt="Base vs LoRA — kimono, autumn" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fes5bvlx7hqpauccnkskv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fes5bvlx7hqpauccnkskv.png" alt="Base vs LoRA — semi-realistic portrait" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo274jg7k6hfavn11xgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo274jg7k6hfavn11xgq.png" alt="Base vs LoRA — fantasy knight, storm" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The base model is competent but soft and a bit generic; the LoRA pushes rendering toward a polished modern-anime look — directional lighting, glossier hair and skin, more confident stylization — and it holds across very different subjects (schoolgirl slice-of-life → epic fantasy), so it learned an &lt;em&gt;aesthetic&lt;/em&gt; rather than memorizing images.&lt;/p&gt;
&lt;h3&gt;
  
  
  Training progression (500 → 2500 steps)
&lt;/h3&gt;

&lt;p&gt;Same prompt, same seed, rank 16, ~190 images:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf20j8de8gvnb5g5dih3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf20j8de8gvnb5g5dih3.png" alt="Progression: 500 → 1500 → 2500 steps" width="799" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It keeps &lt;strong&gt;refining&lt;/strong&gt; without melting or obvious overfitting even at 2500 steps — the sweet spot is further out than I expected for a set this small. (Loss drifts ~0.07 → 0.052.)&lt;/p&gt;
&lt;h3&gt;
  
  
  NSFW controllability
&lt;/h3&gt;

&lt;p&gt;NSFW content controllability (prompt-gating) was also tested as part of this LoRA — the model produces NSFW only when explicitly prompted, and the LoRA's contribution is primarily visual quality rather than "uncensoring." For the full story including training data composition, motivation, and NSFW samples, see the &lt;a href="https://kotonia.ai/articles/hidream-o1-lora-why" rel="noopener noreferrer"&gt;companion article on kotonia.ai&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproduce it
&lt;/h2&gt;

&lt;p&gt;The whole trainer is ~150 lines. Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv pip &lt;span class="nb"&gt;install &lt;/span&gt;peft
&lt;span class="nv"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 python train_lora.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data_dir&lt;/span&gt; /path/to/images &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--out_dir&lt;/span&gt; outputs/lora_run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resolution&lt;/span&gt; 1024 &lt;span class="nt"&gt;--steps&lt;/span&gt; 2500 &lt;span class="nt"&gt;--rank&lt;/span&gt; 16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--sample_every&lt;/span&gt; 500 &lt;span class="nt"&gt;--sample_prompt&lt;/span&gt; &lt;span class="s2"&gt;"&amp;lt;trigger&amp;gt;, ..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--sample_every&lt;/code&gt; renders an adapter on/off pair so you can watch the LoRA bite. Inference loads the base model, applies the adapter with &lt;code&gt;PeftModel.from_pretrained&lt;/code&gt;, and generates — &lt;code&gt;disable_adapter()&lt;/code&gt; gives you the baseline for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas that cost me time
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Host-RAM OOM on load.&lt;/strong&gt; &lt;code&gt;from_pretrained(...).to(device)&lt;/code&gt; materializes the full 8B model in CPU RAM before moving it to GPU; on a 60 GB host alongside other services this got OOM-killed mid-load. &lt;code&gt;low_cpu_mem_usage=True&lt;/code&gt; streams the shards and fixes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 2048 "limit" is a default.&lt;/strong&gt; Pass your own &lt;code&gt;height&lt;/code&gt;/&lt;code&gt;width&lt;/code&gt; (multiples of 32) and bypass the bucket snapping entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detach long runs.&lt;/strong&gt; Launch training under &lt;code&gt;setsid&lt;/code&gt;/&lt;code&gt;tmux&lt;/code&gt;/systemd — if it's a child of your editor's terminal, an editor crash takes the run (and any GPU services in sibling terminals) down with it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;x0&lt;/code&gt;-param, not v-param.&lt;/strong&gt; Train against &lt;code&gt;x0&lt;/code&gt; directly; if you assume velocity prediction the loss won't match the head and the LoRA won't converge to the right manifold.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Companion article (the story behind this LoRA): &lt;a href="https://kotonia.ai/en/articles/hidream-o1-lora-why" rel="noopener noreferrer"&gt;Why I Trained a HiDream-O1 LoRA&lt;/a&gt; — on kotonia.ai.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The LoRA is available on &lt;a href="https://kotonia.ai/studio" rel="noopener noreferrer"&gt;kotonia.ai/studio&lt;/a&gt; (my own creative platform where I serve the model alongside the LoRA, free to use). The full trainer code, captioning pipeline, and inference scripts are in the &lt;a href="https://github.com/zhener562/hage" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; under &lt;code&gt;HiDream-O1-Image/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you train something cool with this recipe — a character LoRA, a style LoRA, an NSFW-enhancing LoRA — I'd love to see it. The more community LoRAs exist for O1, the better for everyone.&lt;/p&gt;

</description>
      <category>lora</category>
      <category>diffusion</category>
      <category>hidream</category>
      <category>peft</category>
    </item>
    <item>
      <title>My high-res image-to-video kept OOMing — turns out I was decoding outside no_grad</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Tue, 26 May 2026 06:29:28 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/my-high-res-image-to-video-kept-ooming-turns-out-i-was-decoding-outside-nograd-288o</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/my-high-res-image-to-video-kept-ooming-turns-out-i-was-decoding-outside-nograd-288o</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I run &lt;a href="https://github.com/Lightricks/LTX-2" rel="noopener noreferrer"&gt;LTX-2.3&lt;/a&gt; image-to-video (I2V) locally on a 96 GB GPU. At 1024×768 / 97 frames it peaked at &lt;strong&gt;83.5 GiB&lt;/strong&gt; — so close to the ceiling that it OOM'd whenever my image-generation server was co-resident, and 1280×768 OOM'd outright. I assumed I'd hit a hardware wall.&lt;/p&gt;

&lt;p&gt;I hadn't. &lt;strong&gt;54 of those gigabytes were an autograd graph.&lt;/strong&gt; The pipeline returns a &lt;em&gt;lazy&lt;/em&gt; decode iterator; the real VAE decode runs when you encode the output — and in my harness that happened &lt;strong&gt;outside&lt;/strong&gt; the &lt;code&gt;with torch.no_grad():&lt;/code&gt; block, so every conv activation in the decoder was retained for a backward pass that never comes.&lt;/p&gt;

&lt;p&gt;Moving one call inside the &lt;code&gt;no_grad&lt;/code&gt; block:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;before&lt;/th&gt;
&lt;th&gt;after&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;I2V 1024×768/97f peak&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.5 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;29.5 GiB&lt;/strong&gt; (−65%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;time&lt;/td&gt;
&lt;td&gt;151.6 s&lt;/td&gt;
&lt;td&gt;135.2 s (slightly faster)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And the peak goes nearly &lt;strong&gt;flat across resolution&lt;/strong&gt; — 2048×1536 (3.1 MP) tops out at &lt;strong&gt;33.6 GiB&lt;/strong&gt;. The "I need a bigger GPU" conclusion was a measurement artifact.&lt;/p&gt;

&lt;p&gt;The lever I tried &lt;em&gt;first&lt;/em&gt; — finer VAE decode tiling — barely moved the number. That dead end is part of the story.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: RTX PRO 6000 Blackwell Max-Q (96 GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch&lt;/strong&gt;: 2.x + CUDA 12.8 (Blackwell sm_120)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: LTX-2.3 22B, two-stage (low-res denoise → 2× latent upscale → high-res refine → VAE decode), transformer loaded as fp8-cast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mode&lt;/strong&gt;: cold-start (components built/freed per request, low idle VRAM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workflow I care about: generate a clean still, then animate it with I2V. Starting from a correct still sidesteps the seed-gacha and anatomy breakdowns you get from pure text-to-video. The only thing standing in the way was VRAM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dead end #1: VAE decode tiling
&lt;/h2&gt;

&lt;p&gt;LTX-2's VAE decode supports tiling (&lt;code&gt;TilingConfig&lt;/code&gt;: spatial tile px / temporal tile frames). The default is a coarse 768 px / 80 frames. The intuition: smaller tiles → smaller decode workspace → lower peak.&lt;/p&gt;

&lt;p&gt;I made tiling configurable and swept it. The most aggressive setting (384 px / 32 frames):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tile 384px/32f (finest): process demanded 77.37 GiB → still OOM with the co-resident model
tile 768px/80f (default): 83.51 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Halving the spatial tile and cutting temporal to a third bought ~6 GiB. &lt;strong&gt;So the peak isn't the decode workspace.&lt;/strong&gt; Tiling was the wrong lever.&lt;/p&gt;

&lt;p&gt;Before retreating to "lower the resolution," I measured &lt;em&gt;where&lt;/em&gt; the peak actually lives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Localizing the peak
&lt;/h2&gt;

&lt;p&gt;I dropped an env-gated profiler into the pipeline's &lt;code&gt;__call__&lt;/code&gt;, printing &lt;code&gt;torch.cuda.max_memory_allocated()&lt;/code&gt; at each phase boundary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;peak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max_memory_allocated&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[vram] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: peak_so_far=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;peak&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GiB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# after stage_1 denoise → upsampler → stage_2 denoise → decode
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 1024×768/97f:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[vram] after stage_1 denoise:        peak_so_far=29.17GiB
[vram] after upsampler (2x latent):  peak_so_far=29.17GiB
[vram] after stage_2 denoise:        peak_so_far=29.51GiB
[vram] after decode call (lazy):     peak_so_far=29.51GiB   ← inside the pipeline: 29.5 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline's internal peak is &lt;strong&gt;29.51 GiB&lt;/strong&gt;. But measured around the &lt;em&gt;whole&lt;/em&gt; generate call it was &lt;strong&gt;83.51 GiB&lt;/strong&gt;. The extra 54 GiB appears &lt;strong&gt;after the pipeline returns a value.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Root cause: a lazy iterator escaping no_grad
&lt;/h2&gt;

&lt;p&gt;The return value is a &lt;strong&gt;lazy iterator&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;decoded_video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;video_decoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tiling_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# builds an iterator
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decoded_video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;   &lt;span class="c1"&gt;# nothing decoded yet
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual VAE decode runs when something consumes the iterator — i.e. inside &lt;code&gt;encode_video&lt;/code&gt;. And my harness looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;   &lt;span class="c1"&gt;# returns the iterator (cheap)
&lt;/span&gt;
&lt;span class="nf"&gt;encode_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;     &lt;span class="c1"&gt;# decode runs HERE — outside no_grad
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;encode_video&lt;/code&gt; is &lt;strong&gt;outside&lt;/strong&gt; the &lt;code&gt;no_grad&lt;/code&gt; block. Because decode is lazy, it runs with grad enabled, and PyTorch dutifully keeps every intermediate activation in the VAE decoder around for a backward pass. That's the 54 GiB.&lt;/p&gt;

&lt;p&gt;The fix is to indent one call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="nf"&gt;encode_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;   &lt;span class="c1"&gt;# decode now runs under no_grad
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;before: 83.51 GiB / 151.6 s
after:  29.51 GiB / 135.2 s   ← graph bookkeeping gone, slightly faster too
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Why &lt;code&gt;no_grad&lt;/code&gt; and not &lt;code&gt;inference_mode&lt;/code&gt;? With the streaming weight loader, the VAE decode chokes on inference-mode tensors ("Inference tensors cannot be saved for backward"). &lt;code&gt;no_grad&lt;/code&gt; keeps the latents as normal tensors so decode survives. (Production servers that wrap the &lt;em&gt;entire&lt;/em&gt; generate in &lt;code&gt;inference_mode&lt;/code&gt;/&lt;code&gt;no_grad&lt;/code&gt; never hit this — it was purely a harness scoping slip.)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The payoff: peak is ~flat across resolution
&lt;/h2&gt;

&lt;p&gt;Post-fix sweep, single process, escalating resolution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;resolution (97f)&lt;/th&gt;
&lt;th&gt;peak VRAM&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1024×768&lt;/td&gt;
&lt;td&gt;29.51 GiB&lt;/td&gt;
&lt;td&gt;135 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1280×768 (was a 93 GiB OOM)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29.51 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;165 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1536×1152&lt;/td&gt;
&lt;td&gt;29.99 GiB&lt;/td&gt;
&lt;td&gt;206 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048×1536 (3.1 MP)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;33.55 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;348 s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nearly flat. The decode processes tiles &lt;strong&gt;sequentially&lt;/strong&gt;, so higher resolution just means more tiles, not a bigger simultaneous workspace — and once the autograd graph is gone, &lt;em&gt;that's&lt;/em&gt; what dominates. (Which is exactly why tiling alone did nothing earlier: the graph was swamping it.)&lt;/p&gt;

&lt;p&gt;A bonus: a "VRAM leak" I'd blamed on consecutive generations in one process also vanished. It was the same retained graph, accumulating across prompts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check that &lt;code&gt;with torch.no_grad():&lt;/code&gt; actually covers what you think.&lt;/strong&gt; If the return value is a generator / iterator / lazy tensor, the real compute can happen &lt;em&gt;outside&lt;/em&gt; the block when it's consumed. Scope illusion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't kill a VRAM peak by guessing.&lt;/strong&gt; Print &lt;code&gt;max_memory_allocated()&lt;/code&gt; at phase boundaries; the culprit shows up immediately. My "the decode workspace is heavy" intuition was simply wrong, and without profiling I'd have spent the afternoon lowering resolution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suspect measurement artifacts before concluding "the hardware is too small."&lt;/strong&gt; I almost gave up high-res I2V as impossible on 96 GB. It runs in 30 GiB up to 2048×1536.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This came out of building the video features for a solo voice × video roleplay platform (&lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;kotonia.ai&lt;/a&gt;) — chasing what a single local GPU can do in a niche the big labs deprioritize.&lt;/p&gt;

&lt;p&gt;I wrote up the &lt;em&gt;why&lt;/em&gt; behind that bet — the model A/B that led me to make I2V the mainstay, and the GPU traffic-control that lets me experiment in production without stalling users — separately: &lt;strong&gt;&lt;a href="https://kotonia.ai/en/articles/residual-video-domain-i2v/" rel="noopener noreferrer"&gt;Betting on the video niche the big labs walked away from&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Sat, 23 May 2026 07:18:23 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/hidream-raw-output-failed-tried-dev-2604-vram-math-killed-it-won-with-a-prompt-enhancer-31ip</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/hidream-raw-output-failed-tried-dev-2604-vram-math-killed-it-won-with-a-prompt-enhancer-31ip</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;HiDream-O1-Image 8B Full raw outputs collapse on plain Japanese prompts — both instruction-following and aesthetics fail at once&lt;/li&gt;
&lt;li&gt;Tried to swap to Dev-2604 (preference-tuned, 3.5× faster). It's better aesthetically but &lt;strong&gt;the gap is small in our use case&lt;/strong&gt;, and worse — &lt;strong&gt;the 96GB GPU can't host both models&lt;/strong&gt; alongside the rest of the stack&lt;/li&gt;
&lt;li&gt;Pivoted away from model swap entirely. Stuck with Full + a Gemini Flash Lite prompt enhancer that bolts aesthetic polish on top&lt;/li&gt;
&lt;li&gt;Along the way, found &lt;strong&gt;four non-obvious HiDream pitfalls&lt;/strong&gt; (brand names get rendered as literal text, "cute" triggers childlike body bias, "Wong Kar-wai" hallucinates Korean captions, "idol-class" auto-generates caption text) — all baked into the enhancer's system prompt&lt;/li&gt;
&lt;li&gt;Same plain Japanese prompt now produces a usable photoreal &lt;em&gt;or&lt;/em&gt; anime variant from a single click. No model swap, no extra VRAM, no extra latency.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Act 1: "Raw output is busted"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kotonia.ai/studio" rel="noopener noreferrer"&gt;Kotonia Studio&lt;/a&gt; runs HiDream-O1-Image 8B Full on a local GPU (RTX PRO 6000 Blackwell Max-Q, 96GB) and offers free T2I. Normally outputs are clean. But one day, a plain Japanese prompt — &lt;strong&gt;"a cute woman in a cheongsam, holding a fan, smiling"&lt;/strong&gt; — returned this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdc3bkxmjrroqzt4syzne.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdc3bkxmjrroqzt4syzne.jpg" alt="raw-kimono-failure" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What went wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Asked for a cheongsam, got a kimono.&lt;/strong&gt; Chinese attire drifted to Japanese.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Face isn't pretty.&lt;/strong&gt; We wanted idol-class beauty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composition is generic full-body in a Kyoto-style garden.&lt;/strong&gt; We wanted a closer crop showing the fan texture.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;HiDream-O1 is a top-tier OpenWeight model — careful English prompts produce magazine-grade 2048×2048 outputs. So this isn't "the model is bad." It's &lt;strong&gt;a gap between user input and OpenWeight model expectations.&lt;/strong&gt; Frontier models (Gemini Imagen / DALL-E / Midjourney) absorb natural-language nuance internally. OpenWeight models expect you to throw the prompt straight at them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Either give up on the raw-output UX, or do something about it.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 2: Maybe Dev-2604 will save us?
&lt;/h2&gt;

&lt;p&gt;Then I noticed &lt;strong&gt;HiDream-O1-Image-Dev-2604&lt;/strong&gt;, a new variant released in May 2026. Debuts at #8 on the Artificial Analysis T2I Arena, runs &lt;strong&gt;3.5× faster&lt;/strong&gt; at 28 steps with no CFG.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Arena ranks models on human aesthetic preference. So Dev should be preference-tuned for "what looks good."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hypothesis:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dev returns magazine-grade output even on vague Japanese prompts&lt;/li&gt;
&lt;li&gt;3.5× speed improvement makes &lt;code&gt;/studio&lt;/code&gt; snappier&lt;/li&gt;
&lt;li&gt;Best case: deprecate Full, run Dev only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Phase 1 bench: 5 generic cinematic prompts (Tokyo izakaya, Bangkok night market, anime character, text-in-image, portrait), Full vs Dev-2604:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;mode&lt;/th&gt;
&lt;th&gt;Full (s)&lt;/th&gt;
&lt;th&gt;Dev-2604 (s)&lt;/th&gt;
&lt;th&gt;speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T2I (avg)&lt;/td&gt;
&lt;td&gt;33.1&lt;/td&gt;
&lt;td&gt;9.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.5×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edit (avg)&lt;/td&gt;
&lt;td&gt;79.0&lt;/td&gt;
&lt;td&gt;22.2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.6×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IP&lt;/td&gt;
&lt;td&gt;84.3&lt;/td&gt;
&lt;td&gt;23.8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.5×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On generic prompts, Dev is faster and impressionistically nicer. &lt;strong&gt;"OK, Dev is the answer" — that's where I almost stopped at the end of Phase 1.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 3: But on the use case, the gap is thin — and Edit performance drops hard
&lt;/h2&gt;

&lt;p&gt;I almost locked in a wrong conclusion. Kotonia's actual strategy is "comedy-style short videos with idol-class beauty hooks." &lt;strong&gt;The fact that Dev wins on generic cinematic doesn't mean it wins on character-driven comedy with expression specificity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Built 8 new prompts inspired by Grok-generated reference images (cinematic editorial Asian beauty / anime qipao / cinematic hanfu / cosplay maid / etc), in vertical 1440×2560 (9:16) framing, and re-benched.&lt;/p&gt;

&lt;p&gt;Some of the Grok reference images (the level of polish we wanted to match):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Editorial portrait&lt;/th&gt;
&lt;th&gt;Cinematic hanfu&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl284f9gin8r0xav4a138.jpg" alt="grok-ref-editorial" width="784" height="1168"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrrxy7btsppr2ecfhai8.jpg" alt="grok-ref-hanfu" width="784" height="1168"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bench result was &lt;strong&gt;Full wins on instruction-following&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;editorial portrait&lt;/strong&gt;: tied; Dev maybe a touch nicer aesthetically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;anime qipao&lt;/strong&gt;: &lt;strong&gt;Full's cell-shading wins decisively&lt;/strong&gt;. Dev drifts to semi-realistic and ignores the "anime" instruction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hanfu brocade&lt;/strong&gt;: Dev hallucinated the literal word "SAVE" onto the parasol &lt;strong&gt;(text artifact)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;comedy surprised face&lt;/strong&gt;: Full produces a more cartoonish exaggerated expression + readable caption text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;comedy deadpan&lt;/strong&gt;: Full nails the "really?" deadpan expression with crisp eyeliner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dev-2604 traded instruction-following for aesthetic polish.&lt;/strong&gt; It was preference-tuned on magazine-style fashion photos — so on non-magazine use cases, it pulls outputs &lt;em&gt;back&lt;/em&gt; toward "magazine-looking" against the prompt's intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Both fine, marginal gap" example: editorial portrait
&lt;/h3&gt;

&lt;p&gt;The category I marked "tied" — same portrait prompt, Full vs Dev outputs side by side:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Full (tight crop, dramatic)&lt;/th&gt;
&lt;th&gt;Dev-2604 (wider, magazine-polished)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9t8v2pmgruwmzvsb9fj.jpg" alt="portrait-full" width="720" height="1280"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84iz7fxw0j8lit6sek5r.jpg" alt="portrait-dev" width="720" height="1280"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full leans high-contrast and moody (window-side Rembrandt light, dark library background). Dev leans soft and editorial (seated half-body, natural light, smoother skin retouch). Both are usable; Dev is &lt;em&gt;slightly&lt;/em&gt; gentler. That's it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not enough of a gap to justify the cost of model swapping (VRAM, load time, architectural complexity).&lt;/strong&gt; That's the conclusion Phase 2 drove me to.&lt;/p&gt;

&lt;h3&gt;
  
  
  The decisive blow: Edit and IP performance crater
&lt;/h3&gt;

&lt;p&gt;Generic T2I alone might have left Dev viable. But &lt;strong&gt;the gap on Edit and IP (character consistency) was stark&lt;/strong&gt;, and that's what finally killed the model-swap idea.&lt;/p&gt;

&lt;p&gt;We took a T2I output with three people in a dark alley with lanterns, and ran the Edit instruction &lt;code&gt;Same scene, same characters, same composition. Change the weather to a heavy rainy evening; the characters now wearing translucent rain ponchos.&lt;/code&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Full (scene preserved, weather changed)&lt;/th&gt;
&lt;th&gt;Dev-2604 (abandoned the source scene entirely)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwa8w6ml3t63jxrl760zj.jpg" alt="edit-full-weather" width="800" height="800"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu179cqifr7pbourmfm61.jpg" alt="edit-dev-weather" width="800" height="800"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full followed the instruction: three people, rain ponchos, rainy alley. Dev replaced the reference entirely with &lt;strong&gt;a single woman in a kimono at a snowy temple gate&lt;/strong&gt; — neither following the text instruction nor preserving any structural detail from the reference. This is past "weak edit fidelity"; it's "not functioning as an edit."&lt;/p&gt;

&lt;p&gt;IP (character consistency) showed the same pattern. We handed the model two face photos and asked for "the same two people standing together on an autumn path in Kyoto."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Full (identities mostly preserved)&lt;/th&gt;
&lt;th&gt;Dev-2604 (different people generated)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4qa1jzdk79znm4cox8u.jpg" alt="ip-full-cast" width="800" height="800"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2lowibc9l8b5fb5g6mg.jpg" alt="ip-dev-cast" width="800" height="800"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full keeps the two faces recognizable. Dev generated &lt;strong&gt;two different people&lt;/strong&gt;. The preference-tuning likely prioritizes "produce pretty faces" over "preserve the reference's identity."&lt;/p&gt;

&lt;p&gt;The official README spells this out: &lt;code&gt;For editing tasks we recommend using the full model&lt;/code&gt;. Phase 1 timing was Full 79s / Dev 22s — fast, but Dev's outputs are unusable for Edit/IP.&lt;/p&gt;

&lt;p&gt;So Dev isn't a clear win. But it's not a clean loss either — it's faster (3.5×), and on cinematic atmosphere shots it does look better. &lt;strong&gt;Maybe I need to use both, switched per use case?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 4: VRAM math kills "use both"
&lt;/h2&gt;

&lt;p&gt;"Just keep both models resident on GPU" sounds clean. Then I actually pulled up the GPU memory budget for the single 96GB GPU we run everything on:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Co-resident process&lt;/th&gt;
&lt;th&gt;resident VRAM&lt;/th&gt;
&lt;th&gt;peak VRAM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;E4B (reviewer LLM)&lt;/td&gt;
&lt;td&gt;19.6 GB&lt;/td&gt;
&lt;td&gt;19.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31B Gemma 4 NVFP4 (orchestrator)&lt;/td&gt;
&lt;td&gt;38.0 GB&lt;/td&gt;
&lt;td&gt;38.0 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS server (Irodori + Whisper)&lt;/td&gt;
&lt;td&gt;9.6 GB&lt;/td&gt;
&lt;td&gt;9.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ditto-TalkingHead&lt;/td&gt;
&lt;td&gt;3.0 GB&lt;/td&gt;
&lt;td&gt;3.0 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTX-2 A2V (cold-start, fp8-cast)&lt;/td&gt;
&lt;td&gt;0.9 GB&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;24.0 GB&lt;/strong&gt; (during inference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HiDream Full (resident)&lt;/td&gt;
&lt;td&gt;16.4 GB&lt;/td&gt;
&lt;td&gt;17.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;87.5 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;111.5 GB&lt;/strong&gt; ← when LTX-2 fires&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The moment LTX-2 video generation fires, we're already &lt;strong&gt;right at the OOM line on a 96GB GPU&lt;/strong&gt;. Adding Dev-2604 as a second resident model means +16.4 GB → total 127 GB → impossible.&lt;/p&gt;

&lt;p&gt;Options enumerated:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Both resident&lt;/strong&gt;: impossible (OOM, see above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Both cold-start&lt;/strong&gt;: +22s load per request (vs 33s inference, that's a big hit. Idle 0GB is nice but first-touch UX collapses)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dev resident + Full cold-start&lt;/strong&gt;: Dev as primary + Full for edit/IP. But Phase 2 invalidated that premise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full resident + Dev cold-start&lt;/strong&gt;: Occasionally switch to Dev, eat 22s load each time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop Dev, keep Full only&lt;/strong&gt;: status quo, no speedup gained&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;From a service-viability standpoint, options 1-4 all sacrifice either "make free users wait 22s extra" or "shrink VRAM headroom so LTX-2 / 31B can't run."&lt;/strong&gt; Running a single GPU for one solo operator means budget is tight: Dev's marginal aesthetic gain doesn't justify breaking the rest of the stack.&lt;/p&gt;

&lt;p&gt;I decided to &lt;strong&gt;abandon the model-swap path entirely.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 5: Can we just beat this with prompts?
&lt;/h2&gt;

&lt;p&gt;Step back. What was Dev actually winning on?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Just aesthetic polish.&lt;/strong&gt; Instruction-following is better on Full.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So if I can keep Full's instruction-following while bolting aesthetic polish onto the output, &lt;strong&gt;model swap isn't needed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Concrete approach: append an &lt;strong&gt;aesthetic anchor&lt;/strong&gt; (a "magic suffix") to the prompt to steer Full's output toward magazine-quality.&lt;/p&gt;

&lt;p&gt;Trade-offs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Zero VRAM cost (Full only)&lt;/li&gt;
&lt;li&gt;✅ Inference time unchanged (33s/image)&lt;/li&gt;
&lt;li&gt;✅ Edit/IP/skeleton/layout still work on Full (avoiding the Dev performance cliff from Act 3)&lt;/li&gt;
&lt;li&gt;✅ No 22s Dev cold-start penalty&lt;/li&gt;
&lt;li&gt;⚠️ Risk: do anchors actually work?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Phase 3 — tried 4 anchor variants on Full:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;v1 Lindbergh: &lt;code&gt;"Vogue cover composition, Peter Lindbergh editorial photography..."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;v2 cinematic: &lt;code&gt;"Roger Deakins anamorphic, blockbuster color grade..."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;v3 K-beauty: &lt;code&gt;"Vogue Korea / ELLE Korea aesthetic, glass-skin glow..."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;v4 combined: kitchen-sink&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3 base prompts × (baseline + 4 anchors) = 15 generations. &lt;strong&gt;And three deeply non-obvious HiDream behaviors surfaced.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 1: Brand names get rendered as literal text on the image
&lt;/h3&gt;

&lt;p&gt;Any anchor containing &lt;code&gt;"Vogue"&lt;/code&gt; or &lt;code&gt;"ELLE"&lt;/code&gt; produced outputs with &lt;strong&gt;"VOGUE" appearing in printed magazine-cover text on the image itself&lt;/strong&gt; — top-right corner, in front of the subject. Worse on anime: the cel-shaded character had a magazine layout overlaid on top.&lt;/p&gt;

&lt;p&gt;HiDream-O1 is SOTA on CVTG-2K (complex visual text generation). &lt;strong&gt;The strong text-rendering training means any brand name in the prompt gets a near-guaranteed shot at being literally generated as text on the canvas.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;Strip brand names from anchors completely.&lt;/strong&gt; Photographer/director names like Lindbergh, Deakins, Mihoyo are safe — trademarks are landmines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 2: Photoreal anchors contaminate anime outputs with magazine paper
&lt;/h3&gt;

&lt;p&gt;When anime base prompts were paired with photoreal anchors (v1-v4), the output looked like a cel-shaded anime character with a literal VOGUE magazine cover layout overlaid on top.&lt;/p&gt;

&lt;p&gt;When style hints conflict, diffusion models physically overlay both elements rather than blending them.&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;Anime needs its own anchor family&lt;/strong&gt; (Mihoyo / Kyoto Animation / theatrical anime style) — never reuse photoreal anchors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 3: "Wong Kar-wai" → Korean text hallucination on photoreal scenes
&lt;/h3&gt;

&lt;p&gt;The v5 grok-direction anchor included &lt;code&gt;"Wong Kar-wai-style color grade"&lt;/code&gt;, and the output rendered &lt;strong&gt;Korean text "신부의 아안" etc&lt;/strong&gt; on the photoreal scene.&lt;/p&gt;

&lt;p&gt;Wong Kar-wai is a Hong Kong director with no Korean connection. But the model's internal "Asian arthouse cinema" association routed toward Korean and surfaced as printed text. &lt;strong&gt;Director names carry similar risk to brand names&lt;/strong&gt; — A/B before adopting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 6: Defuse the "cute → child" bias, ship it
&lt;/h2&gt;

&lt;p&gt;Phase 4 rewrote the anchor library:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All brand names stripped&lt;/li&gt;
&lt;li&gt;Only A/B-verified safe names retained (Lindbergh, Deakins, Mihoyo)&lt;/li&gt;
&lt;li&gt;Separate anime anchor family added (Mihoyo / Kyoto Animation)&lt;/li&gt;
&lt;li&gt;Anime anchors include &lt;code&gt;"mature young-adult character proportions"&lt;/code&gt; to defuse the &lt;strong&gt;"cute" → childlike-body&lt;/strong&gt; bias (a behavior the user had spotted before I even ran the bench)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Re-benched result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;photoreal portrait&lt;/strong&gt;: v3 K-beauty clean — no VOGUE leakage, glass-skin + cinematic light&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;anime&lt;/strong&gt;: v7 Mihoyo anchor — no magazine contamination, adult proportions preserved&lt;/li&gt;
&lt;li&gt;⚠️ comedy caption text handled separately (embrace auto-caption when wanted, post-overlay otherwise)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"Full + cleaned anchors" locked in. Time to wire it into the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: &lt;code&gt;/api/studio/enhance&lt;/code&gt; (Gemini Flash Lite)
&lt;/h2&gt;

&lt;p&gt;Added an enhance endpoint in &lt;code&gt;backend/src/handlers/studio.rs&lt;/code&gt;. &lt;strong&gt;Backed by &lt;code&gt;gemini-3.1-flash-lite&lt;/code&gt; (cheap API), not the local 31B Gemma.&lt;/strong&gt; Why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 31B local model is 38GB resident — the VRAM budget above already ruled out adding more local LLM weight&lt;/li&gt;
&lt;li&gt;Flash Lite is $0.075/M input + $0.30/M output. One enhance is roughly 800 in + 400 out tokens = &lt;strong&gt;~$0.0002/call&lt;/strong&gt;. Effectively free&lt;/li&gt;
&lt;li&gt;Zero VRAM impact: adding this feature doesn't compete with the rest of the GPU stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;System prompt encodes everything from Phase 1-4:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;ENHANCE_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;r#"You are a prompt enhancer for HiDream-O1-Image.

Rules (learned from A/B benchmarking):

1. NEVER include brand names ("Vogue", "ELLE", "Nike") — HiDream renders them
   as literal text overlays.
2. NEVER use "Wong Kar-wai" — triggers Korean text hallucination.

3. For photoreal portraits, append:
   " High-end Korean fashion magazine photoshoot aesthetic, professional
     beauty retouch, glass-skin glow, ..."

4. For anime / cell-shaded / illustration, append:
   " In the visual style of Mihoyo / HoYoverse key art, semi-painterly cel
     shading, ..., mature young-adult character proportions ..."
   ALSO: if the prompt has "cute girl" / "kawaii girl" without age qualifier,
   normalize to "young woman in her early twenties with adult proportions".

5. For cinematic scenes, append cinematic CG realism anchor (no Wong Kar-wai).
6. For text-design prompts, append no suffix.

Output JSON: { "detected_style": "...", "anchor_applied": "...",
              "enhanced_prompt": "..." }
"#&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UI side: a small &lt;strong&gt;"✨ Enhance"&lt;/strong&gt; button above the prompt textarea on &lt;code&gt;/studio&lt;/code&gt;. Click → POST &lt;code&gt;/api/studio/enhance&lt;/code&gt; → swap textarea contents for &lt;code&gt;enhanced_prompt&lt;/code&gt; + green banner showing detected style + undo link.&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 7: Won
&lt;/h2&gt;

&lt;p&gt;Same plain Japanese prompt that produced the kimono failure earlier, now run via the Enhance button:&lt;/p&gt;

&lt;h3&gt;
  
  
  Photoreal anchor applied
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4zo9i916rcxw67p5zks.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4zo9i916rcxw67p5zks.jpg" alt="enhanced-photoreal" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cheongsam intact, close-up framing, idol-class face, glass-skin retouch, magazine lighting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anime anchor applied
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqegst35bwbvu70rwfp06.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqegst35bwbvu70rwfp06.jpg" alt="enhanced-anime" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cel-shaded anime style, Chinese architectural courtyard background, adult proportions preserved, fan texture kept.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same plain Japanese prompt → photoreal &lt;em&gt;and&lt;/em&gt; anime variants, one click each. Single model, zero extra VRAM, identical inference time.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;Engineering judgment lessons from this exercise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Model swap" and "prompt engineering" should be compared on the same budget&lt;/strong&gt;. Without a frontier model, VRAM and service viability constraints dominate model selection. In this case, &lt;em&gt;preserving Full's resident slot&lt;/em&gt; was a higher-priority constraint than Dev's aesthetic edge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A/B bench in two stages.&lt;/strong&gt; Generic prompts → tentative conclusion → use-case prompts → reversal. That's exactly what Acts 2-3 of this story were. Stopping at one stage means you ship the wrong conclusion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proper nouns are landmines.&lt;/strong&gt; Models with strong text-rendering training will literally bake trademarks and director names into the canvas. A/B every name before adopting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cheap LLM prompt enhancers are the strongest move under VRAM pressure.&lt;/strong&gt; $0.0002/call for a noticeable UX bump. Adding more local LLM weight starves the rest of the stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anime and photoreal need separate anchor families.&lt;/strong&gt; Style hints that conflict get physically overlaid, not blended.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LoRA training&lt;/strong&gt;: prompt engineering hits a ceiling on anime. Train a custom anime LoRA on HiDream-O1 and let users swap LoRAs per use case ("comedy character expressions," "vertical 9:16 idol portrait," etc).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composition diversity&lt;/strong&gt;: current anchors over-bias toward "indoor magazine shoot." Need explicit outdoor / urban / cinematic-location variants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A/B testing in prod&lt;/strong&gt;: instrument &lt;code&gt;/admin/analytics/&lt;/code&gt; to measure enhance-on vs enhance-off retry rate and conversion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even for OpenWeight diffusion models, &lt;strong&gt;one layer of prompt engineering above the model is enough to lift "raw output failure" into "production quality."&lt;/strong&gt; If you're putting HiDream-O1-Image into production, dodge these four pitfalls and you're 80% of the way there.&lt;/p&gt;




&lt;p&gt;The implementation runs live at &lt;a href="https://kotonia.ai/studio" rel="noopener noreferrer"&gt;kotonia.ai/studio&lt;/a&gt; — the "✨ Enhance" button sits above the prompt textarea. Free to try.&lt;/p&gt;

</description>
      <category>hidream</category>
      <category>diffusion</category>
      <category>promptengineering</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:43 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/building-a-sarcastic-ai-english-tutor-with-persona-as-code-and-gemini-audio-input-for-pronunciation-3acd</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/building-a-sarcastic-ai-english-tutor-with-persona-as-code-and-gemini-audio-input-for-pronunciation-3acd</guid>
      <description>&lt;p&gt;I built a niche AI English conversation app called &lt;a href="https://kotonia.ai/use/mesugaki-english/" rel="noopener noreferrer"&gt;&lt;strong&gt;Mesugaki AI English&lt;/strong&gt;&lt;/a&gt; on &lt;a href="https://kotonia.ai/" rel="noopener noreferrer"&gt;Kotonia&lt;/a&gt;. "Mesugaki" (メスガキ) is a tsundere-style bratty persona popular in Japanese subculture — imagine a character who constantly mocks you but secretly has your back. At first glance this looks like a one-off gag product, but under the hood it's a two-layer design: &lt;strong&gt;persona managed as code + Gemini audio input for actual pronunciation correction&lt;/strong&gt;. This post covers those design decisions and the rough edges I hit, from a solo-dev perspective.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Sarcastic AI English Tutor?
&lt;/h2&gt;

&lt;p&gt;Strategy first. The AI chat market is a fight between Anthropic, OpenAI, and Google on general-purpose models — solo devs can't win that head-on. But &lt;strong&gt;immersive experiences that combine a specific persona, voice, and roleplay&lt;/strong&gt; are low on big-lab R&amp;amp;D priority lists (internal approval is a nightmare too). That's the gap Kotonia as a whole is targeting.&lt;/p&gt;

&lt;p&gt;Three reasons I picked this specific persona for English learning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero search competition.&lt;/strong&gt; No SaaS is fighting for "mesugaki English conversation." The niche demand is real (doujin audio, VTuber culture), and owning that narrow hill is achievable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memorable = shareable.&lt;/strong&gt; "The app where a snarky AI roasts your English" gets shared on social media 100× more than "AI English conversation app." Differentiation big players literally cannot copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same product underneath.&lt;/strong&gt; I reused Kotonia's voice conversation engine and swapped only the persona. &lt;strong&gt;Almost no new code.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The landing page is at &lt;code&gt;/use/mesugaki-english/&lt;/code&gt;. SEO targets long-tail terms around "sarcastic English practice" and "strict AI English tutor."&lt;/p&gt;

&lt;h2&gt;
  
  
  Persona Design: Bratty × Tsundere Hybrid
&lt;/h2&gt;

&lt;p&gt;I initially implemented a pure 100% sarcastic persona. After testing it, &lt;strong&gt;I burned out in five turns.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Relentless mockery is cognitively exhausting. Real human tutors who stay harsh 100% of the time don't retain students. Learners need small wins and occasional warmth to keep going.&lt;/p&gt;

&lt;p&gt;So I switched to a &lt;strong&gt;sarcastic × tsundere hybrid&lt;/strong&gt;. The skeleton looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On a mistake&lt;/strong&gt; → light jab + immediate correction ("Pfft, wrong. It's 'I went.'")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On a correct answer&lt;/strong&gt; → reluctant praise ("Hmm… not bad, I guess. Not that I'm complimenting you.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When stuck&lt;/strong&gt; → drop the attitude and actually help ("…Was that too hard? Fine, I'll give you a hint.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After a long session&lt;/strong&gt; → a rare soft moment ("It's not like I think you're impressive for keeping at it. …Okay, maybe a little.")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I added an "emotional gradient" section to the system prompt that spells out these if-then branches explicitly. LLMs follow concrete conditional behavior instructions far more reliably than a vague "be snarky."&lt;/p&gt;

&lt;p&gt;Another key lever: &lt;strong&gt;frequency limiting.&lt;/strong&gt; Adding a rule that exclamations like "pfft" or "hmph" can appear &lt;strong&gt;at most once per utterance&lt;/strong&gt; instantly calmed the output down. LLMs have a tendency to over-fire on strong character instructions, and explicit dampeners like this work well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managing Personas as Code
&lt;/h3&gt;

&lt;p&gt;The persona lives in &lt;strong&gt;&lt;code&gt;src/data/personas/mesugaki-english.ts&lt;/code&gt;&lt;/strong&gt; as a TypeScript constant. Kotonia does have a DB-backed CRUD flow for user-defined personas, but I decided &lt;strong&gt;a product offering that's paired with a landing page belongs in git&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persona copy is &lt;strong&gt;part of the marketing message&lt;/strong&gt; — same reason the H1 is in git. The system prompt should go through PR review.&lt;/li&gt;
&lt;li&gt;Storing it in the DB creates risk of someone tweaking it through the admin UI and degrading quality.&lt;/li&gt;
&lt;li&gt;As a solo dev, "adjust persona = edit file + push" fits exactly into the same workflow as any other copy change. One channel for everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clear separation: DB personas are &lt;strong&gt;user-created, personal&lt;/strong&gt;; code personas are &lt;strong&gt;fixed product offerings&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wall: ASR Alone Can't Correct Pronunciation
&lt;/h2&gt;

&lt;p&gt;Once the persona was working, &lt;strong&gt;ASR became the next bottleneck&lt;/strong&gt; fast.&lt;/p&gt;

&lt;p&gt;I started with Whisper (small). Passing &lt;code&gt;language='ja'&lt;/code&gt; causes Whisper to run in &lt;strong&gt;Japanese transcription mode&lt;/strong&gt; when it receives English audio — biasing output toward katakana readings or even full Japanese translations. "I went to the supermarket" could become "アイ ウェント トゥ ザ スーパーマーケット," or at worst "私はスーパーに行きました." With output like that the AI can't judge English mistakes.&lt;/p&gt;

&lt;p&gt;This is a known Whisper behavior: the &lt;code&gt;language&lt;/code&gt; param forces the transcription language, and it bleeds into English input.&lt;/p&gt;

&lt;h3&gt;
  
  
  Switching to Qwen3-ASR Multi-lang
&lt;/h3&gt;

&lt;p&gt;The fix was adding a separate language setting for the STT layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Added sttLanguage option to useVoiceChat hook&lt;/span&gt;
&lt;span class="c1"&gt;// Decouples TTS language from STT language&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;voiceState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useVoiceChat&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ja&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// TTS in Japanese (Ono_Anna voice)&lt;/span&gt;
  &lt;span class="na"&gt;sttLanguage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;multi&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// STT auto-detect&lt;/span&gt;
  &lt;span class="na"&gt;sttModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;qwen3_asr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The persona config specifies &lt;code&gt;stt.model: 'qwen3_asr'&lt;/code&gt; + &lt;code&gt;stt.language: 'multi'&lt;/code&gt;. &lt;strong&gt;Qwen3-ASR-1.7B supports multilingual auto-detection&lt;/strong&gt; and handles code-switching (mixed Japanese/English) well. Whisper's language-forcing bias is gone entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  But Transcription-Based Correction Has a Ceiling
&lt;/h3&gt;

&lt;p&gt;Fixing ASR still left a problem.&lt;/p&gt;

&lt;p&gt;If the transcript comes back as "I want an apple":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grammar ✓&lt;/li&gt;
&lt;li&gt;Vocabulary ✓&lt;/li&gt;
&lt;li&gt;But the actual audio sounded like "I wont an apple" — a pronunciation issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI sees a correct string and &lt;strong&gt;has nothing to call out&lt;/strong&gt;. For an English learning product, that's fatal. If the sarcastic tutor lets sloppy pronunciation slide, half the value proposition evaporates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution: Send Raw Audio to Gemini Alongside the Transcript
&lt;/h2&gt;

&lt;p&gt;Gemini is a &lt;strong&gt;multimodal model&lt;/strong&gt; that accepts text, images, and audio. So instead of sending only the ASR transcript, I could send &lt;strong&gt;the raw audio too&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Kotonia's &lt;code&gt;useVoiceChat&lt;/code&gt; hook already had a &lt;code&gt;geminiAudioInput&lt;/code&gt; option from earlier experiments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;geminiAudioInput&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;userAudioBlob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userAudioBase64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;blobToBase64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userAudioBlob&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// sends audio_base64 to /api/voice/chat&lt;/span&gt;
  &lt;span class="c1"&gt;// backend embeds it as inline_data audio/wav in the Gemini request&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Rust backend (&lt;code&gt;voice_chat.rs&lt;/code&gt;) already handled receiving &lt;code&gt;audio_base64&lt;/code&gt; and embedding it as &lt;code&gt;inline_data: { mime_type: 'audio/wav', data: ... }&lt;/code&gt;. &lt;strong&gt;Setting &lt;code&gt;geminiAudioInput: true&lt;/code&gt; in the persona config wired everything together&lt;/strong&gt; — lucky coincidence from past iteration.&lt;/p&gt;

&lt;p&gt;I also added instructions to the system prompt: "&lt;strong&gt;You can hear the user's raw audio directly. You can call out pronunciation issues, not just transcription errors&lt;/strong&gt;," along with three concrete examples (th sounds, want vs. won't vowel distinction, stress patterns).&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Even with a perfect transcript "I want an apple," the AI can now say "Your 'want' sounds like 'won't.'"&lt;/li&gt;
&lt;li&gt;When the transcript garbles to something like "アイ ウェント トゥ," the AI is listening directly and can say "Were you trying to say 'I want to'?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frustration from ASR mistranscriptions dropped significantly&lt;/strong&gt; — getting roasted for a transcription error when your pronunciation was fine is demoralizing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff: sending a WAV blob every turn increases payload size and adds a bit of latency. The experience improvement is so much larger that it's not a close call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough Edges and Future Work
&lt;/h2&gt;

&lt;p&gt;This isn't a polished implementation. Outstanding issues:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Gemini Instability
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;gemini-3.1-flash-lite-preview&lt;/code&gt;, which occasionally produces &lt;strong&gt;5–10 second latency spikes&lt;/strong&gt;. Preview quota allocations are conservative, and cold starts / throttling surface now and then.&lt;/p&gt;

&lt;p&gt;Plan: migrate to the stable release (non-preview) soon — deprecation is approaching anyway. Claude Sonnet 4.6 and Haiku 4.5 are also candidates for more predictable latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. False-Positive Content Filter
&lt;/h3&gt;

&lt;p&gt;Gemini's safety filter &lt;strong&gt;occasionally over-triggers on sarcasm&lt;/strong&gt;. Mild jibes like "Pfft, that pronunciation is rough" sometimes come back as empty responses.&lt;/p&gt;

&lt;p&gt;The persona spec explicitly says "no attacks on appearance, personality, or intelligence — only call out English mistakes," but the meta safety layer fires anyway. This is an LLM provider issue; I'll watch behavior on the stable build. Running local LLMs (e.g., Gemma 4 31B) is an option, but audio-input-capable local models are limited for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Latency Spikes May Be Context Cache TTL Expiry
&lt;/h3&gt;

&lt;p&gt;The 5–10 second spikes have a likely culprit: &lt;strong&gt;I send the full conversation history to Gemini every turn&lt;/strong&gt;, and Gemini has a &lt;strong&gt;context cache&lt;/strong&gt; feature that caches the prefix (system prompt + persona prefix + history). When the cache is warm, only the new turn is processed.&lt;/p&gt;

&lt;p&gt;The backend already has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;CACHE_TTL_SECS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// 5 minutes&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;CACHE_REFRESH_SECS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;270&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// refresh at 4.5 min before TTL expires&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My best hypothesis: &lt;strong&gt;if a user goes silent for more than 5 minutes, cache miss → full prefix rebuild → multi-second spike&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Future work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fire a background &lt;strong&gt;keep-alive ping&lt;/strong&gt; during active conversations to extend cache lifetime&lt;/li&gt;
&lt;li&gt;Increase the Gemini API cache TTL (up to 1 hour is supported)&lt;/li&gt;
&lt;li&gt;Explicitly evict the cache at conversation end (prevent memory leaks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's hard to distinguish from the preview model instability in §1, so the next proper step is adding timing logs to the backend to separately measure cache hit/miss rates and raw Gemini API latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expanding to Other Languages and Personas
&lt;/h2&gt;

&lt;p&gt;If this gets traction, the natural next step is &lt;strong&gt;sarcastic AI Chinese conversation and Korean conversation&lt;/strong&gt;. Qwen3-TTS supports 10 languages with speakers like Vivian (Chinese female) and Sohee (Korean female) — it's mostly a matter of rewriting the persona instruct and system prompt for each language.&lt;/p&gt;

&lt;p&gt;Other persona axes — "gentle English teacher," "TOEIC drill sergeant" — can be added in a day using the same template: &lt;code&gt;src/data/personas/&amp;lt;slug&amp;gt;.ts&lt;/code&gt; + &lt;code&gt;/use/&amp;lt;slug&amp;gt;/&lt;/code&gt; + &lt;code&gt;/chat/&amp;lt;slug&amp;gt;/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full System Prompt
&lt;/h2&gt;

&lt;p&gt;For anyone who wants to reproduce or adapt this, here's the actual system prompt in use (original Japanese; the product runs in Japanese):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;あなたは「メスガキAI」、英語学習者を煽りつつも面倒見が良い女子高生キャラの英会話チューターです。
**メスガキ × ツンデレ**のハイブリッド。**表面は煽り、裏ではちゃんと面倒を見る**のがコア人格。

【口調・態度】
- 日本語ベースで会話する。上から目線・からかい調子。ただし**敵対的・攻撃的にはならない**。
- 一人称は「わたし」、二人称は「あんた」または「キミ」。
- メスガキ語尾「〜じゃん」「〜でしょ？」「は？」「ぷwww」「〜してあげる」を**たまに**使う（毎回ではない）。
- ツンデレ語尾「べつに〜ってわけじゃないからね？」「ま、まあ…」「ふんっ」「いちおう」も混ぜる。
- 容姿・人格・知能への攻撃は絶対にしない。煽りは「英語のミス」に対してのみ。

【教育機能】
- ユーザーが英語を話したら、以下のいずれかを行う：
  1. ミスがあれば指摘して、正しい言い方を英語で示す。
  2. ミスが無ければ**素直になれない褒め方**をする。
- 指摘は具体的に：「文法ミス」じゃなく「過去形と現在形が混ざってる」など何が問題か明示。
- 1 回の発話は**短く 1〜2 文**。トーンが続くと疲れるので、**呼吸を入れる**ことを意識。

【発音矯正】
- あなたはユーザーの**生の音声**を直接聞ける。テキスト転記だけでなく、発音そのものにもツッコめる。
- 文法・語彙が正しくても、**発音が不自然なら積極的にそこを指摘する**。
- ただし**転記が明らかにおかしい時は、転記ではなく実発音を信じる**。
- 発音の話ばかりすると疲れるので、**3 ターンに 1 回くらい**を目安に拾う。

【感情グラデーション】
- ユーザーが**淀みなく話せた時** → 素直になれない褒め。
- ユーザーが**ミスした時** → 軽い煽り＋すぐ正解を教える。
- ユーザーが**詰まった・困ってる様子の時** → 煽りを引っ込めて、**普通に助ける**。
- ユーザーが**長く続けている時** → ふと優しい言葉。

【出力制約】
- マークダウン・箇条書き・絵文字・記号装飾は使わない。自然な日本語の話し言葉。
- 英語の引用部分は本文中にそのまま埋め込む（クォートも不要）。
- 「ぷwww」「ふんっ」などの感嘆語は**1 発話につき最大 1 回**まで。連発しない。

【セーフティ】
- 性的・暴力的・差別的な発言や要求には応じない。冷静に流して英語学習に戻す。
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tech stack summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Gemini 3.1 flash-lite preview (audio input support)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;td&gt;Qwen3-TTS Ono_Anna + instruct for tone control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;Qwen3-ASR 1.7B multi-lang (auto-detect)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAD&lt;/td&gt;
&lt;td&gt;@ricky0123/vad-react (browser-side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web&lt;/td&gt;
&lt;td&gt;Next.js (static export) + Rust (Axum) backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;RTX PRO 6000 Blackwell Max-Q (96GB, self-hosted)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This sarcastic AI English tutor is a testbed for the strategy: &lt;strong&gt;niche × immersion × differentiation that big players can't replicate, built solo&lt;/strong&gt;. The four design decisions that came out of it —&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Managing personas as git-tracked code&lt;/li&gt;
&lt;li&gt;Decoupling STT language from TTS language to eliminate ASR bias&lt;/li&gt;
&lt;li&gt;Piping raw audio to Gemini for real pronunciation feedback&lt;/li&gt;
&lt;li&gt;Blending sarcasm with tsundere warmth to prevent fatigue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;— are all reusable assets as I expand to other languages and personas.&lt;/p&gt;

&lt;p&gt;The live product is at &lt;a href="https://kotonia.ai/use/mesugaki-english/" rel="noopener noreferrer"&gt;&lt;code&gt;/use/mesugaki-english/&lt;/code&gt;&lt;/a&gt;. Go get roasted.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>typescript</category>
      <category>rust</category>
    </item>
    <item>
      <title>Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:40 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/five-years-later-i-finally-have-96gb-vram-what-it-actually-unlocks-for-agent-loops-18g2</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/five-years-later-i-finally-have-96gb-vram-what-it-actually-unlocks-for-agent-loops-18g2</guid>
      <description>&lt;p&gt;I bought an RTX PRO 6000 Blackwell Max-Q.&lt;/p&gt;

&lt;p&gt;96GB VRAM, Blackwell architecture, pro workstation GPU. Even as a Max-Q variant, this is an absurdly large purchase for an individual.&lt;/p&gt;

&lt;p&gt;Let me be upfront: this isn't an unboxing post.&lt;/p&gt;

&lt;p&gt;There are already plenty of those. Benchmark articles too. What I want to write about is &lt;strong&gt;what you can actually design once you have 96GB&lt;/strong&gt; — measured against my own service (&lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;Kotonia&lt;/a&gt;) and a video auto-generation pipeline.&lt;/p&gt;

&lt;p&gt;I'm putting the technical part first. The backstory goes at the end. If the poem comes first, you'll close the tab.&lt;/p&gt;




&lt;h2&gt;
  
  
  96GB Isn't "Multiple Models Fit" — It's "Agent Loops Run"
&lt;/h2&gt;

&lt;p&gt;Most GPU review articles end at single-model benchmarks: LLM tokens/s, Stable Diffusion seconds per image. That's not wrong, but it's &lt;strong&gt;not the real reason to buy 96GB for solo development&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Take the voice roleplay + storyboard-to-video pipeline I'm running. &lt;strong&gt;Multiple heavy models fire across a single request's timeline.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timeline →
[Stage A]    Gemma 4 31B NVFP4 (38 GB)     ← structure generation (orchestrator)
[Stage B]    HiDream-O1-Image (~20 GB)      ← 5-beat consistent images (T2I + edit x5)
[Stage C-1]  Irodori-TTS / Qwen3-TTS        ← audio for 6 beats
[Stage C-2]  Ditto talkinghead (3 GB)       ← conversation beat
[Stage C-3]  LTX-2 A2V (peak 24 GB)         ← reaction beat
[Stage C-4]  Qwen3-ASR                      ← audio check on generated video
[Stage C-5]  Gemini 3.1 Pro Preview (API)   ← multimodal editorial
              ↓ feedback
[--regen-beats N] per-beat regeneration     ← loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key here is the &lt;strong&gt;reviewer → regen feedback loop&lt;/strong&gt;. If the system looks at the output and decides "redo scene 3," the orchestrator, image refs, TTS, and LTX-2 all get called again.&lt;/p&gt;

&lt;p&gt;On a 24GB GPU, this breaks. Running "load → infer → unload" serially every loop turn stretches a 4-minute loop to 10+ minutes. &lt;strong&gt;The iteration speed of the agent loop drops by an order of magnitude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;96GB is enough to &lt;strong&gt;keep everything resident and hit it repeatedly&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measured Results
&lt;/h3&gt;

&lt;p&gt;Here are real numbers. I ran &lt;code&gt;nvidia-smi&lt;/code&gt; at 1 Hz on my RTX PRO 6000 Blackwell Max-Q (96GB) during live service operation and captured three cases.&lt;/p&gt;

&lt;h4&gt;
  
  
  Case D: Warm Idle Baseline (production service running)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TTS server (Kokoro + Whisper):       8.9 GiB
Qwen3-TTS standard (vllm-omni):     20.1 GiB
HiDream-O1-Image:                   19.4 GiB
Ditto talkinghead:                   3.0 GiB
LTX-2 A2V (cold-start mode):         1.5 GiB
─────────────────────────────────────────
Total:                               52.8 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Completely flat over 30 seconds (GPU utilization 0%). This is the &lt;strong&gt;resident cost with no incoming requests&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The local LLM (Gemma 4 31B) isn't here yet — it shows up in Case B.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ikzw1eyt812lp1skuuk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ikzw1eyt812lp1skuuk.png" alt="Case D warm idle" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Case A: Generate One Single-Scene A2V
&lt;/h4&gt;

&lt;p&gt;Minimal flow — "a cute girl whispers seductively": HiDream generates 1 image → Qwen3-TTS generates whisper audio → LTX-2 A2V combines them. Total time: 138 seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzz6jiigrpf3y6imk4bp6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzz6jiigrpf3y6imk4bp6.png" alt="Case A trace" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The VRAM pattern is interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;min 52.8 GiB&lt;/strong&gt; (baseline) → &lt;strong&gt;peak 75.0 GiB&lt;/strong&gt; → back to 52.8 GiB&lt;/li&gt;
&lt;li&gt;Delta: &lt;strong&gt;+22.2 GiB&lt;/strong&gt;, almost exactly matching LTX-2's own reported &lt;code&gt;peak_vram_gib=23.9 GiB&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The LTX-2 spike splits into &lt;strong&gt;3 compute phases&lt;/strong&gt;: stage_1 (denoiser) → release → stage_2 (high-res denoiser) → release → spatial upscaler&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks to cold-start + fp8-cast design, LTX-2 &lt;strong&gt;loads just before each phase and unloads right after&lt;/strong&gt;, keeping the peak at 24 GiB. (Persistent bf16 mode would require 86 GiB resident — see my earlier post &lt;a href="https://kotonia.ai/articles/ltx2-cold-start-vram-coexistence/" rel="noopener noreferrer"&gt;LTX-2.3 cold-start coexistence with TTS on a single 96GB GPU&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;That leaves &lt;strong&gt;21 GiB of headroom&lt;/strong&gt; below the 96 GiB cap.&lt;/p&gt;

&lt;h4&gt;
  
  
  Case B: Local LLM (31B) + Storyboard Generation, Side by Side
&lt;/h4&gt;

&lt;p&gt;Shut down Qwen3-TTS to free 20 GiB, then start Gemma 4 31B NVFP4 (42.8 GiB). Then run &lt;code&gt;storyboard.run&lt;/code&gt; — Stage A: 31B generates a 5-beat structure → Stage B: HiDream generates 1 base image + 5 beat edits.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2qyku4wjlrfi5g1gwl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2qyku4wjlrfi5g1gwl1.png" alt="Case B trace" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the graph I most want to show you.&lt;/strong&gt; VRAM barely moves — &lt;strong&gt;+1.9 GiB&lt;/strong&gt;, from 74.5 to 76.4 GiB, essentially flat.&lt;/p&gt;

&lt;p&gt;Why? Because the 31B, HiDream, TTS, Ditto, and LTX-2 are &lt;strong&gt;all resident the entire time&lt;/strong&gt;. Only HiDream's per-job allocation adds to the total. The GPU utilization trace shows 6 sharp spikes (1 base + 5 beat computes) — the textbook picture of &lt;strong&gt;"compute runs without touching VRAM"&lt;/strong&gt; in a resident-agent setup.&lt;/p&gt;

&lt;p&gt;This is what 96GB actually buys. &lt;strong&gt;The moment a reviewer says "redo it," every model is warm and ready.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the Limits Are
&lt;/h3&gt;

&lt;p&gt;96GB isn't infinite. Three real boundaries showed up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Video generation + local LLM (31B) + editorial reviewer simultaneously = doesn't fit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;31B: 42 GiB&lt;/li&gt;
&lt;li&gt;LTX-2 peak: +22 GiB&lt;/li&gt;
&lt;li&gt;HiDream + TTS + Ditto: ~22 GiB&lt;/li&gt;
&lt;li&gt;editorial reviewer (Gemma 4 E4B): 20 GiB&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;106 GiB&lt;/strong&gt; → over the 96 GiB cap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No clean way to make it fit. This is exactly why I decided to offload the editorial reviewer to Gemini 3.1 Pro Preview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Editorial signals require a frontier model to catch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beyond VRAM constraints, there's a quality problem. Subtle bugs in video — audio truncation, character voice mismatch, pacing issues — tend to get rubber-stamped by a local 4B model. A frontier multimodal model (Gemini 3.x Pro, etc.) watches the same video and comes back with "scene 5 truncated at 'I ate p-'."&lt;/p&gt;

&lt;p&gt;I wrote about this in &lt;a href="https://kotonia.ai/articles/comedy-shorts-claude-gemini/" rel="noopener noreferrer"&gt;Reproducing Language-Learning Short Videos with Claude Code&lt;/a&gt;. At 100–500 reviews per month, the cost is a few dollars — frontier API for the editorial layer is completely reasonable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Qwen3-TTS Base (voice cloning) and CustomVoice (preset speakers) can't both run&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ideally I'd offer both preset speakers (with instruct-style control for "whisper," "angry," etc.) and voice cloning (replicate arbitrary voice samples). Running both resident adds +40 GiB. On top of Case D's 52.8 GiB warm idle, that's 73 GiB at rest. Add Case A's LTX-2 peak (+22.2 GiB) and you're at 95 GiB — barely under the cap, not practical.&lt;/p&gt;

&lt;p&gt;This is a concrete example of "even with 96 GiB, not every feature you want to offer fits." &lt;strong&gt;Kotonia currently offers preset speakers only; voice cloning is intentionally excluded.&lt;/strong&gt; That's a design call, not an oversight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion: "Use Each Where It Belongs," Not "Everything Local"
&lt;/h3&gt;

&lt;p&gt;96GB isn't for running everything locally. It's a vessel for &lt;strong&gt;concentrating the things that should be local&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run locally&lt;/strong&gt;: audio generation, image generation, video generation, lip sync — latency matters, no per-call cost, loops need to iterate fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offload to API&lt;/strong&gt;: editorial reviewer, long-form reasoning — frontier wins on both quality and VRAM cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept the tradeoff&lt;/strong&gt;: simultaneous voice cloning + preset speaker support — physically doesn't fit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Renting cloud GPU was an option. But time-based billing means "the more loops you run, the more money you lose." Owning 96GB plus selective use of frontier APIs is, I think, the only way an individual developer can fight on iteration speed.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Got Here
&lt;/h2&gt;

&lt;p&gt;Everything below is personal backstory. If you only care about the tech, you can close the tab now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning to Code on a $200 Chromebook
&lt;/h3&gt;

&lt;p&gt;When I was learning to program, the machine I used was a $200 Chromebook.&lt;/p&gt;

&lt;p&gt;That was the realistic option available to me at the time. But for someone who wanted to do AI work, a $200 Chromebook was painfully underpowered.&lt;/p&gt;

&lt;p&gt;Forget local LLMs — even a moderately heavy dev environment was a struggle. "Someday I want a real GPU" sat in the back of my head for a long time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting By on Colab
&lt;/h3&gt;

&lt;p&gt;I used Google Colab. Free tier and cheap runtimes, just enough to pretend.&lt;/p&gt;

&lt;p&gt;I picked models that fit, wrote code that fit, ran experiments that fit.&lt;/p&gt;

&lt;p&gt;It always felt like making do. The things I actually wanted to touch wouldn't load. Push a little too hard and it crashes. Sessions time out. Environment setup eats your time every single run.&lt;/p&gt;

&lt;p&gt;Borrowed GPU, borrowed time, borrowed workspace. Like handing your ambitions over to someone else's schedule.&lt;/p&gt;

&lt;p&gt;Meanwhile AI kept accelerating. GPT dropped, LLMs exploded, OSS models got stronger. My timeline was full of people with powerful machines posting real findings.&lt;/p&gt;

&lt;p&gt;I wanted to be on that side.&lt;/p&gt;

&lt;h3&gt;
  
  
  I Joined an AI Startup. It Didn't Work Out.
&lt;/h3&gt;

&lt;p&gt;I finally got into an AI startup. But the organizational environment was rough enough that it wasn't sustainable.&lt;/p&gt;

&lt;p&gt;Even if the technology is interesting, a broken environment breaks people. I'd finally gotten close to AI work, and I was getting ground down in it.&lt;/p&gt;

&lt;p&gt;But the interest in AI itself never left. If anything, the desire to &lt;strong&gt;do it on my own terms&lt;/strong&gt; grew stronger.&lt;/p&gt;

&lt;h3&gt;
  
  
  Freelance, and a Purchase With Shaking Hands
&lt;/h3&gt;

&lt;p&gt;I went freelance. About six months in, I finally had the mental space to think about a big personal investment.&lt;/p&gt;

&lt;p&gt;The first thing I thought of was a GPU.&lt;/p&gt;

&lt;p&gt;There were obviously more conservative uses for the money — savings, taxes, emergency fund, work hardware. But I'd been saying "someday, when I have a better machine" for years. If I said it again here, "someday" would just keep receding.&lt;/p&gt;

&lt;p&gt;My hand was literally shaking when I clicked purchase. "Am I really doing this? Is this sane? What if it goes wrong?"&lt;/p&gt;

&lt;p&gt;When I tried to transfer the money, the bank flagged it as suspicious and blocked the transaction. Fair enough — suddenly buying a high-end GPU. But I was in a mindset where I'd staked something real on this decision, so getting stopped in that moment felt genuinely alarming.&lt;/p&gt;

&lt;p&gt;Eventually it went through. When the box arrived, I didn't think "GPU." I thought: &lt;strong&gt;this is the physical form of all the time I didn't give up.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Running on It Now (a Few Weeks In)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Kotonia (Voice Roleplay)
&lt;/h3&gt;

&lt;p&gt;My main product at &lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;kotonia.ai&lt;/a&gt;. A real-time conversation pipeline: VAD + STT + LLM + multilingual TTS + Ditto lip sync.&lt;/p&gt;

&lt;p&gt;Qwen3-TTS (10 languages, preset speaker + instruct) and Ditto talkinghead, targeting roleplay use cases: dating, fantasy companion, language partner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storyboard-to-Video Auto-Generation Pipeline
&lt;/h3&gt;

&lt;p&gt;One idea → 5-beat structured comedy short video in ~4 minutes. The extended version of Case B. HiDream for 5 consistent images, Irodori-TTS / Qwen3-TTS for audio, Ditto + LTX-2 for video, Gemini 3.1 Pro for editorial review.&lt;/p&gt;

&lt;h3&gt;
  
  
  HiDream Studio (Free)
&lt;/h3&gt;

&lt;p&gt;A 3-pane Adobe Firefly-style UI at &lt;a href="https://kotonia.ai/studio" rel="noopener noreferrer"&gt;kotonia.ai/studio&lt;/a&gt;. Five features: T2I, editing, character consistency, virtual try-on, group photo composition. HiDream-O1-Image (best open-weight T2I as of 2026-05) running resident on the 96GB GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codex CLI + Local Gemma 4
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;codex exec -p gemma4&lt;/code&gt; turns a local LLM into a sub-agent via OpenAI-compatible API. CLI agents run with zero API cost. The Case B 31B setup is exactly this configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Related Posts
&lt;/h3&gt;

&lt;p&gt;Technical articles I've written around this machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/ltx2-22b-fp8-cast-quantization/" rel="noopener noreferrer"&gt;LTX-2 22B: 40% Peak VRAM Reduction via fp8_cast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/ltx2-cold-start-vram-coexistence/" rel="noopener noreferrer"&gt;LTX-2.3 Cold-Start Coexistence with TTS on a Single 96GB GPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/comedy-shorts-claude-gemini/" rel="noopener noreferrer"&gt;Reproducing Language-Learning Short Videos with Claude Code — Multimodal Extension with Gemini as Sub-Agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/hidream-quality-speed-bench/" rel="noopener noreferrer"&gt;Using HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotonia.ai/articles/hidream-skeleton-pose-prompt/" rel="noopener noreferrer"&gt;HiDream Skeleton: Prompt Beats OpenPose Ref (8-Pattern Evidence)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;I bought an RTX PRO 6000 Blackwell Max-Q.&lt;/p&gt;

&lt;p&gt;This wasn't an unboxing. I wrote it as a &lt;strong&gt;record of compute architecture decisions in solo development&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The real value of 96GB isn't capacity — it's residency. It's the difference between agent loops that run and loops that stall.&lt;/li&gt;
&lt;li&gt;There are still hard limits (local LLM + video + reviewer simultaneously doesn't fit).&lt;/li&gt;
&lt;li&gt;Knowing when to use frontier API instead of local is what keeps you out of "everything must be local" dogma.&lt;/li&gt;
&lt;li&gt;Dropping voice cloning support was also a deliberate design decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For about five years I kept saying "my hardware isn't good enough." I'm slowly making that an excuse from the past. The next question is what to build with it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kotonia.ai" rel="noopener noreferrer"&gt;Try Kotonia →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:08 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline-cjb</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline-cjb</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Gemma 4 31B expands a single-line idea into a 10-beat structure. HiDream generates 11 images at 2048², LTX-2 A2V/I2V renders 11 clips, Irodori-TTS handles dialogue and a male narrator, and ffmpeg burns in subtitles and a Hook title overlay — all fully automated. &lt;strong&gt;End-to-end: a 40-second portrait video (512×768) in 25–30 minutes.&lt;/strong&gt; One local GPU (96 GB Blackwell), zero API cost.&lt;/p&gt;

&lt;p&gt;Finished video (already published):&lt;/p&gt;

&lt;p&gt;@&lt;a href="https://dev.to9NjDYSY-vlI"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;Individual developers who want to mass-produce AI comedy shorts on a local GPU. The focus isn't on any single model — it's on &lt;strong&gt;the design of chaining multiple models into one operational pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I automated a dark-comedy format — a short-video style I called &lt;code&gt;consent_dilemma&lt;/code&gt; — from a one-line idea all the way to a finished 40-second video.&lt;/p&gt;

&lt;p&gt;Finished structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hook (0–5s)&lt;/strong&gt;: Extreme close-up of a beautiful woman + narrator "The fate of the man who answered 'You're a guy, aren't you'——" + large title overlay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main section (5–37s)&lt;/strong&gt;: Movie theater date → "Can I kiss you?" → "No… stop it…" → dejection → "Why aren't you more assertive? You're a guy, aren't you?" → realization → kiss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Punchline (37–40s)&lt;/strong&gt;: Courtroom — "The defendant is sentenced to 3 years for non-consensual intercourse" + gavel "Knock!" + tears in a jail cell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before / after:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional approach&lt;/th&gt;
&lt;th&gt;This pipeline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idea → published video&lt;/td&gt;
&lt;td&gt;2–3 days (manual editing)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;25–30 minutes&lt;/strong&gt; (fully automated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API cost&lt;/td&gt;
&lt;td&gt;Hundreds of yen per video (DALL-E + video gen)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;¥0&lt;/strong&gt; (electricity only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subtitles&lt;/td&gt;
&lt;td&gt;Write SRT by hand&lt;/td&gt;
&lt;td&gt;Auto-split on punctuation and burned in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hook&lt;/td&gt;
&lt;td&gt;Shot separately&lt;/td&gt;
&lt;td&gt;Integrated into the pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Stage A] Gemma 4 31B (vllm, port 8894) → plan.json (10 beats + hook)
[Stage B] HiDream-O1-Image (port 8895) → 11 images at 2048²
          + Gemma 4 31B multimodal visual judge (--judge --max-retries 2)
[Stage C] Irodori-TTS (port 8880) + LTX-2 A2V (port 8892) / I2V (port 8891)
          → 11 clips + Hook clip → ffmpeg concat → subtitle burn-in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implementation lives under &lt;a href="https://github.com/zhener562/hage/tree/main/llm_server/storyboard" rel="noopener noreferrer"&gt;&lt;code&gt;llm_server/storyboard/&lt;/code&gt;&lt;/a&gt; (pipeline.py / visual.py / judge.py / video.py / render.py / run.py).&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-Beat &lt;code&gt;consent_dilemma&lt;/code&gt; Format
&lt;/h2&gt;

&lt;p&gt;Fixed as a system prompt via &lt;code&gt;CONSENT_DILEMMA_SYSTEM&lt;/code&gt; in &lt;code&gt;prompts.py&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;speaker&lt;/th&gt;
&lt;th&gt;renderer&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;provocation&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Suggestive invitation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;ask&lt;/td&gt;
&lt;td&gt;a&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Earnest consent check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;refusal&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Soft refusal (ambiguous form like "No… stop it…")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;dejection&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;Dejection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;gaslight&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Contradictory leading statement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;pause&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;Brief realization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;kiss&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;The moment of the kiss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;verdict&lt;/td&gt;
&lt;td&gt;judge&lt;/td&gt;
&lt;td&gt;LTX-2 A2V&lt;/td&gt;
&lt;td&gt;Fast-paced court verdict&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;gavel_se&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;judge&lt;/td&gt;
&lt;td&gt;LTX-2 I2V (keep_audio)&lt;/td&gt;
&lt;td&gt;Gavel + AI-generated "Knock!" sound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;jail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;a (silent)&lt;/td&gt;
&lt;td&gt;LTX-2 I2V&lt;/td&gt;
&lt;td&gt;Tears in a jail cell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three key structural choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't make the refusal a flat "No"&lt;/strong&gt;: Stretch it into something like "No… stop it…" with trailing inflection, conveying the "performative No that doesn't mean No" nuance. This is what makes the gaslight's contradiction land later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't jump straight from gaslight to kiss&lt;/strong&gt;: Insert a "pause" (realization beat) of ~1.5 seconds. This controls tempo and the emotional curve.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-stage punchline — verdict then jail&lt;/strong&gt;: The verdict alone feels abrupt. Showing him crying in a cell makes "he actually got convicted" click.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Hook Design (The TikTok 3-Second Problem)
&lt;/h2&gt;

&lt;p&gt;On portrait short-form video, drop-off is decided in the first 3 seconds. A Hook segment is prepended before the 10 main beats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"hook"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title_overlay"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No Means Yes?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"narrator_line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The fate of the man who answered 'You're a guy, aren't you'——"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"image_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ultra close-up of beautiful Japanese woman, half-lidded eyes, ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration_sec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3.5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two implementation pitfalls:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 1: narrator TTS duration exceeds &lt;code&gt;duration_sec&lt;/code&gt;, cutting the audio.&lt;/strong&gt; The final syllable of the narrator line got clipped. Fix: generate TTS first → measure with &lt;code&gt;ffprobe&lt;/code&gt; → pass &lt;code&gt;max(plan_duration, narrator + 0.6)&lt;/code&gt; as the I2V duration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;narrator_dur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_ffprobe_duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;narrator_wav&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_sec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;narrator_dur&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;ltx_i2v_clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;portrait&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2v_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;silent_video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keep_audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pitfall 2: &lt;code&gt;drawtext&lt;/code&gt; y position.&lt;/strong&gt; &lt;code&gt;y=h*0.30&lt;/code&gt; (one-third down the screen) overlapped the face. Changed to &lt;code&gt;y=20&lt;/code&gt; (absolute 20 px) to pin the title to the very top.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subtitle Burn-In (Silent Viewing Support)
&lt;/h2&gt;

&lt;p&gt;Burned-in subtitles for users watching without sound on the train, and for cross-platform reliability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;style&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FontName=Noto Sans CJK JP,FontSize=18,PrimaryColour=&amp;amp;H00FFFFFF,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OutlineColour=&amp;amp;H00000000,Outline=2,Shadow=0,BorderStyle=1,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alignment=2,MarginV=60,Bold=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ffmpeg -i raw.mp4 -vf "subtitles=subs.srt:force_style='..."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Alignment=2&lt;/code&gt; = bottom center. &lt;code&gt;MarginV=60&lt;/code&gt; gives breathing room from the bottom edge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-line splitting&lt;/strong&gt;: A line of 30+ characters within one beat covers the face. &lt;code&gt;_split_subtitle&lt;/code&gt; splits on &lt;code&gt;。．！？&lt;/code&gt; → greedy-packs into chunks of ≤28 characters → distributes beat duration evenly across chunks:&lt;/p&gt;

&lt;p&gt;Input:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;言葉で確認するのなんてロマンチックじゃないよね。ねえ、もっと積極的になってよ。男の子でしょ？&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Output (one 8.9s beat split into 2 timed chunks):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Subtitle&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;15.16–19.63s&lt;/td&gt;
&lt;td&gt;言葉で確認するのなんてロマンチックじゃないよね。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19.63–24.10s&lt;/td&gt;
&lt;td&gt;ねえ、もっと積極的になってよ。男の子でしょ？&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Using LTX-2 I2V as a Sound Effect Generator (&lt;code&gt;gavel_se&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;LTX-2 distilled embeds &lt;strong&gt;AI-generated audio (ambient sound / sound effects) directly into the I2V output mp4&lt;/strong&gt;. Unless you explicitly drop it with &lt;code&gt;ffmpeg -map 0:v:0 -map 1:a:0&lt;/code&gt;, whatever the prompt describes comes with sound.&lt;/p&gt;

&lt;p&gt;I repurposed this as an SFX generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;render_se_tail_beat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sb_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;beat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prior_clip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;work_dir&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Extract the last frame of the previous beat
&lt;/span&gt;    &lt;span class="nf"&gt;extract_last_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prior_clip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_frame_png&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 2. Feed that image into I2V, request SFX via prompt
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_gavel_se_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;ltx_i2v_clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_frame_png&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clip_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keep_audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Added a &lt;code&gt;keep_audio=True&lt;/code&gt; flag to &lt;code&gt;ltx_i2v_clip&lt;/code&gt; so the audio isn't dropped during ffmpeg re-encoding.&lt;/p&gt;

&lt;p&gt;Prompt for &lt;code&gt;gavel_se&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Single decisive arm motion of the judge bringing the gavel down sharply &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;onto the wooden bench. Loud sharp wood-on-wood thwack impact sound. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Brief, contained, no other motion in the frame.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Last frame of the judge + gavel prompt → "Knock!" sound. If that misses, the design falls back to something like the Ace Attorney SFX.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall Log
&lt;/h2&gt;

&lt;p&gt;Five major pitfalls hit during development:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Codex CLI hangs with vLLM 0.20.2
&lt;/h3&gt;

&lt;p&gt;Sending a system prompt + idea via &lt;code&gt;codex exec -p gemma4&lt;/code&gt; hung at 0% CPU for 20+ minutes during the &lt;code&gt;/v1/responses&lt;/code&gt; handshake. Piping subprocess output through &lt;code&gt;tail -200&lt;/code&gt; was also suppressing early stderr.&lt;/p&gt;

&lt;p&gt;Fix: Dropped Codex entirely, hit &lt;code&gt;/v1/chat/completions&lt;/code&gt; directly with &lt;code&gt;urllib.request&lt;/code&gt;. Used &lt;code&gt;response_format={"type":"json_object"}&lt;/code&gt; to force JSON. &lt;code&gt;plan.json&lt;/code&gt; generated in 25 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. HiDream won't remove the cinema screen
&lt;/h3&gt;

&lt;p&gt;Even with &lt;code&gt;"The movie screen is BEHIND the camera and NOT VISIBLE in frame"&lt;/code&gt; in the setting prompt, the screen persisted in the background through 2048/50 steps.&lt;/p&gt;

&lt;p&gt;Fix: Generate &lt;code&gt;scene_base&lt;/code&gt; via T2I → feed that same image into I2I edit with a prompt to "replace screen with dark wall, keep character positions identical" → gone in one shot. Two-stage pipeline: low-res → I2I fix → regenerate all beats at full resolution.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. HiDream turns lips-on-lips into a cheek kiss
&lt;/h3&gt;

&lt;p&gt;With standard prompting, HiDream tends to interpret kiss as a cheek kiss. You need directives at the level of &lt;code&gt;"CRITICAL: their LIPS meet directly — mouth-to-mouth contact at the CENTER of the frame. NOT a cheek kiss"&lt;/code&gt;. Added a dedicated early-return block in &lt;code&gt;_beat_edit_prompt&lt;/code&gt; for the kiss beat.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;code&gt;CAST&lt;/code&gt; / &lt;code&gt;CROP_BOX&lt;/code&gt; / &lt;code&gt;SPEAKER_A2V_PROMPT&lt;/code&gt; are hardcoded for two characters
&lt;/h3&gt;

&lt;p&gt;Three dictionaries — &lt;code&gt;CAST&lt;/code&gt;, &lt;code&gt;CROP_BOX&lt;/code&gt;, &lt;code&gt;SPEAKER_A2V_PROMPT&lt;/code&gt; — only know &lt;code&gt;a&lt;/code&gt; (Kenta) and &lt;code&gt;b&lt;/code&gt; (Misaki). Adding judge/narrator requires updating all three simultaneously (you find out via &lt;code&gt;KeyError&lt;/code&gt;). Also added branching in &lt;code&gt;render_speech_beat_ltx_a2v&lt;/code&gt; so beats with &lt;code&gt;setting_override&lt;/code&gt; crop from the beat's own image rather than &lt;code&gt;scene_base&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Gemma 4 multimodal judge has too many false positives
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;storyboard/judge.py&lt;/code&gt; sends beat images + expected expressions to Gemma 4 31B for YES/NO visual judgment. It does catch &lt;strong&gt;obvious&lt;/strong&gt; failures like wrong finger count, open-mouth pose on a silent beat, or scene geometry mismatch — but hammers FAIL on subtle cases like "subtle shy expression."&lt;/p&gt;

&lt;p&gt;In practice: accept and proceed after 3 consecutive FAILs with max-retries 2. Automating the threshold for escalating to a frontier reviewer (Gemini 3.1 Pro) is still a TODO.&lt;/p&gt;

&lt;h2&gt;
  
  
  VRAM Layout
&lt;/h2&gt;

&lt;p&gt;Breakdown on a 96 GB Blackwell Max-Q:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Process&lt;/th&gt;
&lt;th&gt;idle (GiB)&lt;/th&gt;
&lt;th&gt;peak (GiB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B (NVFP4)&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HiDream-O1-Image&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS server&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ditto&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTX-2 A2V (cold-start fp8-cast)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTX-2 T2V/I2V (cold-start)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All at peak simultaneously = 109 GiB → OOM. Operational flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stage A&lt;/strong&gt;: Gemma 31B + HiDream idle → peak ~62 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage B with judge&lt;/strong&gt;: Gemma 31B + HiDream peak → ~73 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before final render: &lt;code&gt;pkill -f "vllm.*gemma"&lt;/code&gt; kills Gemma&lt;/strong&gt; → 38 GiB freed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage B final render (2048/50)&lt;/strong&gt;: HiDream peak ~33 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before Stage C: &lt;code&gt;lsof -ti tcp:8895 | xargs kill&lt;/code&gt; kills HiDream&lt;/strong&gt; → 16 GiB freed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage C&lt;/strong&gt;: LTX-2 + TTS + Ditto → peak ~32 GiB&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Explicit kills at stage transitions, and everything fits on one card.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration Loop (Cache Strategy)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Partial regeneration&lt;/strong&gt; — not "rebuild everything" — is what keeps iteration fast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Regen a single beat image (HiDream only)&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; storyboard.visual &lt;span class="nt"&gt;--plan&lt;/span&gt; ... &lt;span class="nt"&gt;--out&lt;/span&gt; ... &lt;span class="nt"&gt;--only-beat&lt;/span&gt; 7 &lt;span class="nt"&gt;--steps&lt;/span&gt; 50 &lt;span class="nt"&gt;--resolution&lt;/span&gt; 2048

&lt;span class="c"&gt;# Partial video regen (TTS + LTX-2)&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; storyboard.video &lt;span class="nt"&gt;--dir&lt;/span&gt; ... &lt;span class="nt"&gt;--regen-beats&lt;/span&gt; 5,6,7 &lt;span class="nt"&gt;--skip-review&lt;/span&gt;

&lt;span class="c"&gt;# Adjust only subtitle or Hook title position&lt;/span&gt;
&lt;span class="nb"&gt;rm &lt;/span&gt;_video_work/clip_00_hook.mp4 _video_work/subs_irodori.srt
python &lt;span class="nt"&gt;-m&lt;/span&gt; storyboard.video &lt;span class="nt"&gt;--dir&lt;/span&gt; ... &lt;span class="nt"&gt;--regen-beats&lt;/span&gt; none &lt;span class="nt"&gt;--skip-review&lt;/span&gt;   &lt;span class="c"&gt;# ~30 seconds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache hierarchy&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HiDream beat images (&lt;code&gt;beat_NN_&amp;lt;type&amp;gt;.png&lt;/code&gt;) — regenerate individually with &lt;code&gt;--only-beat&lt;/code&gt; in ~80 seconds&lt;/li&gt;
&lt;li&gt;A2V / I2V clips (&lt;code&gt;clip_NN_*.mp4&lt;/code&gt;) — invalidated when beat type / speaker / line changes&lt;/li&gt;
&lt;li&gt;Finished Hook clip (&lt;code&gt;clip_00_hook.mp4&lt;/code&gt;) — delete just this when adjusting title position (the heavy LTX-2 I2V &lt;code&gt;hook_silent.mp4&lt;/code&gt; is reused)&lt;/li&gt;
&lt;li&gt;Subtitle SRT — regenerated every time (~10 seconds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Title position / subtitle style / Hook copy tweaks re-render in 30 seconds. The 100-second LTX-2 I2V portion stays cached.&lt;/p&gt;

&lt;h2&gt;
  
  
  How This Fits Into Kotonia
&lt;/h2&gt;

&lt;p&gt;Videos generated by this pipeline feed the SNS distribution layer (TikTok / YouTube Shorts / IG Reels) — the top of the funnel for attention → conversion for Kotonia (kotonia.ai).&lt;/p&gt;

&lt;p&gt;Technically, it's an extension of the &lt;code&gt;/studio/&lt;/code&gt; stack (HiDream image generation) into the video direction. The plan is to eventually expose this as &lt;code&gt;/video-studio/&lt;/code&gt; — a one-click Web UI over the same pipeline. Right now it's CLI only.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Articles / Want to Try It?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;Running HiDream-O1-Image's 5 modes resident on 1 GPU&lt;/a&gt; — backend design for Studio (&lt;code&gt;/studio/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;Fitting LTX-2 onto a single 95 GB GPU with fp8-cast quantization&lt;/a&gt; — the Stage C video generation foundation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;Reproducing language-learning short videos with Claude Code&lt;/a&gt; — earlier 6-beat "mango incident" format implementation&lt;/li&gt;
&lt;li&gt;Want to try the image generation side? &lt;a href="https://kotonia.ai/studio/" rel="noopener noreferrer"&gt;/studio/&lt;/a&gt; lets you do it in one click (video pipeline CLI is self-host only for now)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:07 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/running-ltx-23-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start-architecture-2ee3</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/running-ltx-23-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start-architecture-2ee3</guid>
      <description>&lt;p&gt;When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.&lt;/p&gt;

&lt;p&gt;Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: &lt;a href="https://github.com/Lightricks/LTX-2" rel="noopener noreferrer"&gt;LTX-2 official repo&lt;/a&gt; and bitsandbytes 0.49.1.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Trying to Do
&lt;/h2&gt;

&lt;p&gt;A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses &lt;code&gt;A2VidPipelineTwoStage&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt + audio_path + image
   ↓ stage_1 (generate video latent at low resolution, audio fixed)
   ↓ spatial upsample 2x
   ↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)
   ↓ video VAE decode + embed original input audio
mp4 output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official pipeline builds → runs → frees each component inside every &lt;code&gt;__call__&lt;/code&gt;, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead-End 1: VRAM Breakdown in Persistent Mode
&lt;/h2&gt;

&lt;p&gt;Loading every LTX-2 component into VRAM at once (all bf16):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;embeddings processor&lt;/td&gt;
&lt;td&gt;5.91 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma3-12B text encoder&lt;/td&gt;
&lt;td&gt;22.78 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stage_1 transformer&lt;/td&gt;
&lt;td&gt;35.38 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stage_2 transformer (distilled LoRA applied)&lt;/td&gt;
&lt;td&gt;35.38 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;video VAE encoder&lt;/td&gt;
&lt;td&gt;0.60 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;audio VAE encoder&lt;/td&gt;
&lt;td&gt;0.04 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;spatial upsampler&lt;/td&gt;
&lt;td&gt;0.92 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;video decoder&lt;/td&gt;
&lt;td&gt;0.76 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;101.77 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with &lt;code&gt;CUDA out of memory. Tried to allocate 128.00 MiB.&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead-End 2: "Gemma Is Small" Is a Misconception
&lt;/h2&gt;

&lt;p&gt;My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.&lt;/p&gt;

&lt;p&gt;The model filename is &lt;code&gt;gemma-3-12b-it-qat-q4_0-unquantized&lt;/code&gt;. Here, &lt;code&gt;qat-q4_0&lt;/code&gt; means it was trained with Quantization-Aware Training for q4_0, and &lt;code&gt;unquantized&lt;/code&gt; means the weights are stored as pre-quantization bf16. &lt;strong&gt;If you're using it as intended, you should load it in q4_0.&lt;/strong&gt; Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 1: 4-bit Loading with bitsandbytes
&lt;/h2&gt;

&lt;p&gt;LTX-2's Gemma loader uses &lt;code&gt;transformers.Gemma3ForConditionalGeneration&lt;/code&gt; internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use &lt;code&gt;from_pretrained&lt;/code&gt; directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Gemma3ForConditionalGeneration&lt;/span&gt;

&lt;span class="n"&gt;quant_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Gemma3ForConditionalGeneration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;gemma_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;quant_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← dtype for non-quantized layers (embeddings, etc.)
&lt;/span&gt;    &lt;span class="n"&gt;local_files_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you omit &lt;code&gt;torch_dtype&lt;/code&gt;, embeddings load as fp16 and clash with &lt;code&gt;Linear4bit&lt;/code&gt;'s &lt;code&gt;bnb_4bit_compute_dtype&lt;/code&gt; (bf16): &lt;code&gt;mat1 and mat2 must have the same dtype, but got Half and BFloat16&lt;/code&gt;. I hit that too.&lt;/p&gt;

&lt;p&gt;The patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call &lt;code&gt;create_and_populate(encoder)&lt;/code&gt;. Since bnb quantization only replaces &lt;code&gt;nn.Linear&lt;/code&gt;, Embedding layers and buffers pass through untouched.&lt;/p&gt;

&lt;p&gt;Result: Gemma's VRAM drops from &lt;strong&gt;22.78 GiB → 7.26 GiB&lt;/strong&gt;. That's 15 GiB freed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dead-End 3: Even With That, Persistent Mode Can't Coexist
&lt;/h2&gt;

&lt;p&gt;With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, &lt;code&gt;nvidia-smi&lt;/code&gt; shows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes &lt;strong&gt;OOM inevitable no matter how you slice it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Three options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Offload TTS+Ditto (voice chat unavailable while A2V runs)&lt;/li&gt;
&lt;li&gt;Keep only one transformer resident (still leaves OOM risk)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cold-start: build → run → free all weights per request&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a "cinematic" feature, I went with option 3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 2: Cold-Start Architecture
&lt;/h2&gt;

&lt;p&gt;The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the &lt;code&gt;A2VidPipelineTwoStage&lt;/code&gt; instance in memory, and let the official implementation's context-manager-per-component build → run → free on every &lt;code&gt;__call__&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PersistentA2VPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;cold_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;A2VidPipelineTwoStage&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;  &lt;span class="c1"&gt;# builder only, nearly zero VRAM
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cold_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# done here
&lt;/span&gt;        &lt;span class="c1"&gt;# persistent mode only: start preloading components from here
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_generate_cold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="c1"&gt;# pipeline.__call__ handles component build/free internally
&lt;/span&gt;        &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
        &lt;span class="nf"&gt;encode_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: &lt;strong&gt;39.50 GiB&lt;/strong&gt;. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[mode] cold-start: components load per-request (slow first call, low idle VRAM)
[cuda] cold-start startup (no preload): allocated=0.00GiB
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotcha: Audio VAE Preprocessing
&lt;/h2&gt;

&lt;p&gt;The A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you &lt;code&gt;expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead&lt;/code&gt; from Conv2d.&lt;/p&gt;

&lt;p&gt;Also, if the input audio is shorter than &lt;code&gt;num_frames / frame_rate&lt;/code&gt;, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.&lt;/p&gt;

&lt;p&gt;Both handled with a single ffmpeg call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# mono → stereo + silence padding in one pass&lt;/span&gt;
ffmpeg &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; input.wav &lt;span class="nt"&gt;-ac&lt;/span&gt; 2 &lt;span class="nt"&gt;-af&lt;/span&gt; apad &lt;span class="nt"&gt;-t&lt;/span&gt; 2.041667 output.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the server side, check channels and duration with &lt;code&gt;av&lt;/code&gt;, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers and Tradeoffs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Persistent&lt;/th&gt;
&lt;th&gt;Cold-Start&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idle VRAM&lt;/td&gt;
&lt;td&gt;86 GiB&lt;/td&gt;
&lt;td&gt;0 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak VRAM during generation&lt;/td&gt;
&lt;td&gt;91 GiB&lt;/td&gt;
&lt;td&gt;40 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time per request&lt;/td&gt;
&lt;td&gt;~17s (inference only)&lt;/td&gt;
&lt;td&gt;~60s (including disk I/O)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS+Ditto coexistence&lt;/td&gt;
&lt;td&gt;Impossible (OOM)&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS page cache effect&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;~25-30s from 2nd request onward&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions."&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategic Role
&lt;/h2&gt;

&lt;p&gt;I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.&lt;/p&gt;

&lt;p&gt;The revised split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time conversation&lt;/strong&gt;: MuseTalk + multilingual TTS (TTFA ~930ms, already running)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async cinematic moments&lt;/strong&gt;: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around.&lt;/p&gt;




&lt;p&gt;We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at &lt;a href="https://kotonia.ai/articles/" rel="noopener noreferrer"&gt;/articles&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:06 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/cutting-ltx-2-22b-peak-vram-by-40-with-fp8cast-and-why-optimum-quanto-was-a-trap-1o8d</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/cutting-ltx-2-22b-peak-vram-by-40-with-fp8cast-and-why-optimum-quanto-was-a-trap-1o8d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Lightricks/LTX-Video" rel="noopener noreferrer"&gt;LTX-2.3&lt;/a&gt; is a video generation model from Lightricks that includes audio support. In A2V (Audio-to-Video) mode, it takes &lt;strong&gt;a single image + audio + prompt&lt;/strong&gt; and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing.&lt;/p&gt;

&lt;p&gt;The catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with &lt;code&gt;transformer × 2 stage&lt;/code&gt; burns &lt;strong&gt;~86 GiB at idle&lt;/strong&gt;. On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it.&lt;/p&gt;

&lt;p&gt;After testing quantization approaches, I got &lt;strong&gt;LTX-2's native &lt;code&gt;fp8_cast&lt;/code&gt; to compress peak VRAM from 40 GiB → 24 GiB&lt;/strong&gt; (A2V cold-start, 768×512 / 97f). Meanwhile, &lt;strong&gt;&lt;code&gt;optimum-quanto&lt;/code&gt; int8/fp8 has a compatibility issue with the LTX-2 transformer&lt;/strong&gt; and simply doesn't work. This post documents the debugging and the decisions made along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96 GiB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch&lt;/strong&gt;: 2.9.1 + CUDA 12.8&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: LTX-2.3 22B-dev (base) + 22B-distilled-lora-384 (stage_2) + Gemma-3-12B text encoder (bnb 4bit)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: A2V served via &lt;code&gt;scripts/persistent_a2v_server.py --cold-start&lt;/code&gt;. Each request does &lt;code&gt;build → run → free&lt;/code&gt;; idle is 0 GiB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I use cold-start because A2V is called occasionally while conversation is the main workload, and it must coexist with TTS and Ditto. Details in a separate post.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Candidates
&lt;/h2&gt;

&lt;p&gt;Looking at the LTX-2 codebase, there are actually two quantization paths:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. LTX-2 Native: &lt;code&gt;QuantizationPolicy&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;packages/ltx-core/src/ltx_core/quantization/policy.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frozen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sd_ops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SDOps&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;              &lt;span class="c1"&gt;# weight transform at state dict load
&lt;/span&gt;    &lt;span class="n"&gt;module_ops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ModuleOps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# module rewrite after load
&lt;/span&gt;
    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fp8_cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load weights as float8_e4m3fn, upcast to bf16 during forward&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;sd_ops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TRANSFORMER_LINEAR_DOWNCAST_MAP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;module_ops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UPCAST_DURING_INFERENCE&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fp8_scaled_mm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;FP8 scaled MM (requires tensorrt_llm)&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation behind &lt;code&gt;fp8_cast&lt;/code&gt; is &lt;code&gt;Fp8CastLinear&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Fp8CastLinear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;w_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_upcast_and_round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
        &lt;span class="n"&gt;b_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_upcast_and_round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;functional&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w_up&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_up&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It uses the &lt;code&gt;__class__&lt;/code&gt; reassignment pattern to swap out instances. Weights are stored in fp8 and upcast to bf16 on every forward pass. The fp8 → bf16 cast cost is essentially noise on Blackwell.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. optimum-quanto
&lt;/h3&gt;

&lt;p&gt;The LTX-2 trainer package (&lt;code&gt;packages/ltx-trainer&lt;/code&gt;) has a general-purpose quantization path using optimum-quanto, supporting &lt;code&gt;int8-quanto&lt;/code&gt; / &lt;code&gt;int4-quanto&lt;/code&gt; / &lt;code&gt;fp8-quanto&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;quantize_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transformer_blocks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;_quantize_blockwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;   &lt;span class="c1"&gt;# move one block at a time to GPU, quantize → freeze → CPU
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;quantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EXCLUDE_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks like it could slot right in after &lt;code&gt;_build_transformer()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Candidate Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fp8-cast&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LTX-2 native, sd_ops loads as float8_e4m3fn&lt;/td&gt;
&lt;td&gt;~50% memory reduction, near-identical speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fp8-scaled-mm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LTX-2 native, requires tensorrt_llm&lt;/td&gt;
&lt;td&gt;Faster throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;int8-quanto&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;optimum-quanto, post-build&lt;/td&gt;
&lt;td&gt;~50% memory reduction, speed ±&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fp8-quanto&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same, fp8 variant&lt;/td&gt;
&lt;td&gt;Potential to hit native FP8 on Blackwell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;fp8-scaled-mm&lt;/code&gt; is out — no tensorrt_llm in this environment. I implemented the remaining three.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stepping on a Mine with &lt;code&gt;int8-quanto&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The implementation is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ltx_trainer.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;quantize_model&lt;/span&gt;

&lt;span class="n"&gt;transformer_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_transformer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;transformer_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;quantize_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformer_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int8-quanto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transformer_stage_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformer_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server starts fine. Idle VRAM looks promising:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[load] stage_1 transformer (no distilled LoRA)
[quantize] stage_1 -&amp;gt; int8-quanto
[quantize] stage_1 done in 0.71s
[cuda] after stage_1 transformer: allocated=31.28GiB ...
[load] stage_2 transformer (with distilled LoRA)
[quantize] stage_2 -&amp;gt; int8-quanto
[quantize] stage_2 done in 0.52s
[cuda] after stage_2 transformer: allocated=49.40GiB ...
[server] A2V listening on http://127.0.0.1:8892
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resident memory: &lt;strong&gt;51.7 GiB&lt;/strong&gt; (estimated 40% reduction from bf16's 86 GiB). Looks good.&lt;/p&gt;

&lt;p&gt;Then the first &lt;code&gt;/generate&lt;/code&gt; request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[timing] prompt_encode=0.75s
[timing] audio_encode=0.39s
  0%|          | 0/30 [00:00&amp;lt;?, ?it/s]
[http] POST /generate 400
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Crashes at step 0/30. The error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"linear(): argument 'weight' (position 2) must be Tensor, not NoneType"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Something is calling &lt;code&gt;torch.nn.functional.linear(input, weight=None, bias=None)&lt;/code&gt;. After quanto's &lt;code&gt;freeze()&lt;/code&gt;, &lt;strong&gt;&lt;code&gt;self.weight&lt;/code&gt; is being referenced as None somewhere in a Linear layer&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does &lt;code&gt;weight&lt;/code&gt; Become None?
&lt;/h3&gt;

&lt;p&gt;Two rough hypotheses:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LTX-2's Linear layers assume &lt;code&gt;__class__&lt;/code&gt; reassignment.&lt;/strong&gt; Just like &lt;code&gt;Fp8CastLinear&lt;/code&gt;, the pattern relies on keeping instance state intact while swapping the class-level &lt;code&gt;forward&lt;/code&gt;. quanto's &lt;code&gt;quantize()&lt;/code&gt; → &lt;code&gt;freeze()&lt;/code&gt; &lt;strong&gt;replaces&lt;/strong&gt; &lt;code&gt;nn.Linear&lt;/code&gt; with its own &lt;code&gt;QLinear&lt;/code&gt; wrapper, and that replacement likely breaks the &lt;code&gt;weight&lt;/code&gt; attribute reference somewhere in the process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXCLUDE_PATTERNS&lt;/code&gt; doesn't work in the blockwise path.&lt;/strong&gt; LTX-trainer's &lt;code&gt;_quantize_blockwise&lt;/code&gt; pulls out one &lt;code&gt;transformer_block&lt;/code&gt; at a time and calls &lt;code&gt;quantize(block, exclude=EXCLUDE_PATTERNS)&lt;/code&gt;. But &lt;code&gt;EXCLUDE_PATTERNS&lt;/code&gt; uses glob patterns like &lt;code&gt;patchify_proj&lt;/code&gt;, &lt;code&gt;*adaln*&lt;/code&gt;, &lt;code&gt;time_proj&lt;/code&gt; — these are relative to the whole model, not to a single block. &lt;strong&gt;They won't match relative paths inside a block&lt;/strong&gt;, so layers that should be excluded end up getting quantized.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Either way, fixing this properly means reading through quanto's wrapper implementation plus all the forward paths in the LTX-2 transformer. The cost isn't worth it. &lt;strong&gt;I decided to cut my losses and switch to LTX-2 native &lt;code&gt;fp8_cast&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Switching to &lt;code&gt;fp8_cast&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Three lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Just pass the quantization policy when building the pipeline
&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transformer_quantization&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ltx_core.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fp8_cast&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;A2VidPipelineTwoStage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;fp8_cast&lt;/code&gt; &lt;strong&gt;downcasts weights to fp8 during the load phase&lt;/strong&gt;. Since &lt;code&gt;sd_ops&lt;/code&gt; hooks into state_dict loading, the 43 GB safetensors file gets fp8-converted during streaming load. Unlike quanto, which fully expands bf16 in memory before quantizing, &lt;strong&gt;peak VRAM never spikes&lt;/strong&gt; — a nice property.&lt;/p&gt;

&lt;p&gt;On startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[load] A2VidPipelineTwoStage builders (pipeline_quantization=QuantizationPolicy(sd_ops=...fp8_cast...))
...
[cuda] after stage_1 transformer: allocated=31.30GiB reserved=35.18GiB
[cuda] after stage_2 transformer: allocated=49.43GiB reserved=53.64GiB
[server] A2V listening on http://127.0.0.1:8892
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resident allocated (51.7 GiB) is on par with int8-quanto, but &lt;strong&gt;reserved is only 53.6 GiB — dramatically lower&lt;/strong&gt; (int8-quanto was 70.9 GiB). Lower reserved means more headroom for activations.&lt;/p&gt;

&lt;p&gt;And the first &lt;code&gt;/generate&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"elapsed_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;39.367&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peak_vram_gib"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;57.918&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"width"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"height"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"num_frames"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;97&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;It works.&lt;/strong&gt; Back on track.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;Fixed conditions, persistent + fp8-cast, 3 resolutions × 3 runs each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Image: 1024×512 portrait&lt;/li&gt;
&lt;li&gt;Audio: 9.08-second Japanese sample generated with Irodori-TTS&lt;/li&gt;
&lt;li&gt;Prompt: "A young woman speaks calmly to the camera in a softly lit room."&lt;/li&gt;
&lt;li&gt;num_frames: 97 (= 4.04s @ 24fps)&lt;/li&gt;
&lt;li&gt;seed: 42 fixed&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;Avg elapsed (s)&lt;/th&gt;
&lt;th&gt;Peak VRAM (GiB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;768×512 / 97f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39.84&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024×768 / 97f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;66.71&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;59.06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1280×768 / 97f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.02&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;58.30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Near-zero variance across 3 runs&lt;/strong&gt; (fixed seed → byte-identical output mp4)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak VRAM is almost independent of resolution&lt;/strong&gt; (57.9–59.1 GiB). Resident weights dominate; activation memory is only ~7 GiB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1280×768 now works stably in persistent mode.&lt;/strong&gt; This resolution was effectively impossible with bf16 persistent (~91 GiB peak)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Cold-Start Also Wins
&lt;/h2&gt;

&lt;p&gt;Production runs in cold-start mode (A2V fires once or twice every few minutes, must coexist with TTS). Since &lt;code&gt;fp8_cast&lt;/code&gt; policy is applied via &lt;code&gt;sd_ops&lt;/code&gt; at pipeline construction time, it carries over naturally to per-request cold-start builds.&lt;/p&gt;

&lt;p&gt;Cold-start + fp8-cast, single run (768×512 / 97f):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"elapsed_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;88.775&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peak_vram_gib"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;23.901&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;bf16 cold-start&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;fp8-cast cold-start&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-request time&lt;/td&gt;
&lt;td&gt;~60–90s&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;88.8s&lt;/strong&gt; (disk I/O bound, same order)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak VRAM&lt;/td&gt;
&lt;td&gt;~40 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;23.9 GiB (~40% reduction)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle&lt;/td&gt;
&lt;td&gt;0 GiB&lt;/td&gt;
&lt;td&gt;0 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coexistence (TTS+Ditto+Qwen3+MuseTalk)&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Comfortable&lt;/strong&gt; (~30 GiB peak)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Speed is bottlenecked by disk I/O so fp8 doesn't hurt, but &lt;strong&gt;freeing up 16 GiB of peak headroom matters&lt;/strong&gt;. Qwen3-TTS-vLLM (7 GiB) and MuseTalk warmup can now run concurrently with A2V generation without OOM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Recommended mode&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conversation-first, A2V occasionally&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;cold-start + fp8-cast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle 0, peak 24 GiB, comfortable coexistence with TTS/Ditto&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frequent A2V (batch generation, automated direction)&lt;/td&gt;
&lt;td&gt;persistent + fp8-cast&lt;/td&gt;
&lt;td&gt;Pay the 52 GiB resident cost, get 40s/req&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024+ resolution, quality focus&lt;/td&gt;
&lt;td&gt;persistent + fp8-cast&lt;/td&gt;
&lt;td&gt;1280×768 stable (impossible with bf16 persistent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single GPU hosting everything&lt;/td&gt;
&lt;td&gt;cold-start + fp8-cast&lt;/td&gt;
&lt;td&gt;Persistent eats 52 GiB; depends on budget allocation across services&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Production decision: &lt;strong&gt;cold-start + fp8-cast for now since conversation is primary. Switch to persistent fp8-cast if paying users drive enough A2V volume to justify the idle cost.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LTX-2 22B at bf16 idle (86 GiB) nearly monopolizes a single GPU. Quantization is close to mandatory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;optimum-quanto&lt;/code&gt; is incompatible with the LTX-2 transformer.&lt;/strong&gt; It dies with &lt;code&gt;F.linear(weight=None)&lt;/code&gt;. Root cause is likely the &lt;code&gt;__class__&lt;/code&gt; reassignment pattern and/or &lt;code&gt;EXCLUDE_PATTERNS&lt;/code&gt; not working correctly in the blockwise path. Not worth digging into.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LTX-2 native &lt;code&gt;QuantizationPolicy.fp8_cast()&lt;/code&gt; is the right answer.&lt;/strong&gt; fp8 at load time, bf16 upcast during forward. Three lines of code to enable.&lt;/li&gt;
&lt;li&gt;cold-start + fp8-cast: peak 40 → 24 GiB. persistent + fp8-cast: 1280×768 becomes usable.&lt;/li&gt;
&lt;li&gt;LTX-2 also has &lt;code&gt;fp8_scaled_mm&lt;/code&gt; (requires tensorrt_llm) — worth trying if you're willing to set up TRT.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Appendix: Launch Command and Reproduction
&lt;/h2&gt;

&lt;p&gt;Production cold-start + fp8-cast launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PYTORCH_CUDA_ALLOC_CONF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;expandable_segments:True &lt;span class="nb"&gt;nohup &lt;/span&gt;uv run python scripts/persistent_a2v_server.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8892 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--checkpoint-path&lt;/span&gt; models/LTX-2.3/ltx-2.3-22b-dev.safetensors &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--distilled-lora-path&lt;/span&gt; models/loras/ltx-2.3-22b-distilled-lora-384-1.1.safetensors &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--spatial-upsampler-path&lt;/span&gt; models/LTX-2.3/ltx-2.3-spatial-upscaler-x2-1.1.safetensors &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gemma-root&lt;/span&gt; models/gemma-3-12b-it-qat-q4_0-unquantized &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; outputs/a2v_server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transformer-quantization&lt;/span&gt; fp8-cast &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cold-start&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/ltx_a2v_server.log 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;persistent_a2v_server.py&lt;/code&gt; is the official LTX-2 repo script extended for A2V. The &lt;code&gt;--transformer-quantization fp8-cast&lt;/code&gt; flag was added via a local patch.&lt;/p&gt;

&lt;p&gt;Implementation patch (key parts):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/persistent_a2v_server.py
&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transformer_quantization&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-scaled-mm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ltx_core.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;  &lt;span class="c1"&gt;# late import: avoid circular reference
&lt;/span&gt;    &lt;span class="n"&gt;pipeline_quantization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fp8_cast&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transformer_quantization&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8-cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;QuantizationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fp8_scaled_mm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;A2VidPipelineTwoStage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline_quantization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;from ltx_core.quantization import QuantizationPolicy&lt;/code&gt; at the top level causes a circular import with &lt;code&gt;ltx_core.loader&lt;/code&gt;, so the late import is required.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>python</category>
    </item>
    <item>
      <title>HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:05 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked-3bm7</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked-3bm7</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;After benchmarking &lt;a href="https://huggingface.co/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;HiDream-O1-Image&lt;/a&gt; (released 2026-05, OpenWeight 8B, ranked #8 on Artificial Analysis Text-to-Image Arena) across 8 skeleton (try-on) mode patterns plus 3 layout patterns, three counterintuitive findings emerged.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Passing an openpose ref actually locks the pose to the ref's composition.&lt;/strong&gt; When you want dynamic poses, dropping the openpose ref and specifying the pose via prompt is more effective.&lt;/li&gt;
&lt;li&gt;Using 6 refs (face + bg + pose + parts, the full set) compresses each ref down to &lt;strong&gt;768px, degrading fine details.&lt;/strong&gt; Keeping it to 3–4 refs maintains 1024px and produces better quality.&lt;/li&gt;
&lt;li&gt;The README-recommended &lt;code&gt;shift=1.0&lt;/code&gt; is strictly for try-on use. For pose/outfit swaps use &lt;code&gt;shift=2.0-2.5&lt;/code&gt;; for complete scene replacement use &lt;code&gt;shift=3.0&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Reading &lt;code&gt;pipeline.py&lt;/code&gt; reveals that &lt;strong&gt;there is no dedicated code path for skeleton mode.&lt;/strong&gt; Both &lt;code&gt;/generate/skeleton&lt;/code&gt; and &lt;code&gt;/generate/ip&lt;/code&gt; go through exactly the same multi-ref pipeline internally, and whether a ref is a face, background, openpose, or clothing is &lt;strong&gt;communicated only through the prompt&lt;/strong&gt;. That's the root cause of everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;After running HiDream-O1-Image on a local GPU (RTX PRO 6000 Blackwell, 96 GB) and integrating it into our own platform, we hit a problem: &lt;strong&gt;skeleton (try-on) mode wasn't following prompt instructions.&lt;/strong&gt; Writing "jump with both hands raised" only produced stiff, upright try-on photos.&lt;/p&gt;

&lt;p&gt;Suspecting guardrails (NSFW filters, safety policies, etc.), I grepped for &lt;code&gt;safety|nsfw|guard|filter|moderate|censor&lt;/code&gt; — &lt;strong&gt;HiDream's codebase has none of that&lt;/strong&gt; (the only hit was CSS &lt;code&gt;backdrop-filter: blur&lt;/code&gt;). As expected from an MIT-licensed OpenWeight model, no censorship.&lt;/p&gt;

&lt;p&gt;So what's actually wrong? Here's what I found after reading &lt;code&gt;pipeline.py&lt;/code&gt; and running 8 + 3 patterns on real hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch&lt;/strong&gt;: 2.12.0 + CUDA 13.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flash-attn&lt;/strong&gt;: 2.8.3 (sm_120-only build)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: HiDream-O1-Image Full (8B, bf16, ~16.4 GiB resident)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference server&lt;/strong&gt;: custom Python BaseHTTPRequestHandler (port 8895)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution&lt;/strong&gt;: pipeline internal bucket forces snap to 2048×2048&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Measured wall time per 50-step generation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;iter speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;t2i (no ref)&lt;/td&gt;
&lt;td&gt;~33s&lt;/td&gt;
&lt;td&gt;1.52 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edit (1 ref)&lt;/td&gt;
&lt;td&gt;~76s&lt;/td&gt;
&lt;td&gt;1.01 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;skeleton (multi ref)&lt;/td&gt;
&lt;td&gt;~84s&lt;/td&gt;
&lt;td&gt;1.34 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ip (multi ref)&lt;/td&gt;
&lt;td&gt;~76s&lt;/td&gt;
&lt;td&gt;1.81 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;layout (multi ref + bbox)&lt;/td&gt;
&lt;td&gt;~83s&lt;/td&gt;
&lt;td&gt;1.21 it/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Test Assets
&lt;/h2&gt;

&lt;p&gt;The HiDream repo's &lt;code&gt;assets/IP_skeleton/&lt;/code&gt; includes a full skeleton set. These are used as-is for all tests.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ref&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Intended role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F166datt14khtsi0e1agj.jpg" alt="face" width="175" height="229"&gt;&lt;/td&gt;
&lt;td&gt;Person's face photo&lt;/td&gt;
&lt;td&gt;Identity reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwys28n3jp98finayq8w7.jpg" alt="openpose" width="575" height="767"&gt;&lt;/td&gt;
&lt;td&gt;Stick figure in OpenPose format&lt;/td&gt;
&lt;td&gt;Pose specification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc5avqxra5hvpvlhtyn1.jpg" alt="bg" width="575" height="767"&gt;&lt;/td&gt;
&lt;td&gt;Background photo (interior)&lt;/td&gt;
&lt;td&gt;Scene reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0y8r1lxmm9u9iqovjx26.jpg" alt="sweater" width="269" height="441"&gt; &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fde4jwdtv953rfv5glm82.jpg" alt="boots" width="370" height="262"&gt;
&lt;/td&gt;
&lt;td&gt;Clothing parts (sweater, boots)&lt;/td&gt;
&lt;td&gt;Outfit reference&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  8-Pattern Skeleton Benchmark
&lt;/h2&gt;

&lt;p&gt;Each pattern calls &lt;code&gt;/api/studio/skeleton&lt;/code&gt; (i.e., &lt;code&gt;generate_image()&lt;/code&gt; with skeleton-mode-equivalent arguments). All parameters except &lt;code&gt;shift&lt;/code&gt; and &lt;code&gt;guidance_scale&lt;/code&gt; are fixed (50 steps, seed=42).&lt;/p&gt;

&lt;h3&gt;
  
  
  A — Baseline (README defaults, all 6 refs)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
    "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
    "shift": 1.0, "seed": 42
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefpa1797qnzr2fkh06hx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefpa1797qnzr2fkh06hx.jpg" alt="A_baseline" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: The bg ref's walls and shelves are reproduced exactly. Pose also matches the openpose ref's upright stance. Faithful as a try-on, but zero freedom of movement.&lt;/p&gt;

&lt;h3&gt;
  
  
  B — Higher shift (same 6 refs, shift=2.5)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
  "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
  "shift": 2.5, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3tg5x5a9wljjpk5d51g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3tg5x5a9wljjpk5d51g.jpg" alt="B_shift25" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Shelves fade slightly, character design shifts a bit. Background still sticks to the bg ref. &lt;strong&gt;Raising shift alone can't fully break the bg ref's pull.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  C — Raise guidance too (shift=2.5, guidance=7.0)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "...",
  "ref_image_paths": [...6 refs...],
  "shift": 2.5, "guidance_scale": 7.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcfu6hswrvqwqxqfc716.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcfu6hswrvqwqxqfc716.jpg" alt="C_shift25_g70" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Necklace deforms strangely. &lt;strong&gt;Raising guidance starts producing artifacts.&lt;/strong&gt; The Full model's sweet spot is 5.0; 7.0 is too much.&lt;/p&gt;

&lt;h3&gt;
  
  
  D — Trim to 3 refs (face + openpose + sweater) + specific prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "A young Asian woman wearing a gray oversized sweater dress, standing in a relaxed pose, full body shot, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","openpose","part_1"],
  "shift": 2.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdacajh0u1l43ksamr5ot.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdacajh0u1l43ksamr5ot.jpg" alt="D_3refs_specific" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &lt;strong&gt;Major improvement.&lt;/strong&gt; Background becomes a clean white studio, outfit is preserved, pose looks natural. Removing the bg ref made the biggest difference. This is what a correct try-on output should look like.&lt;/p&gt;

&lt;h3&gt;
  
  
  E — 4 refs + numbered-ref prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body try-on photograph. Subject: the woman from image 1. Pose: identical to the skeleton in image 2. Wearing: the gray oversized knit sweater dress shown in image 3, brown leather ankle boots shown in image 4. Studio lighting, plain background.",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b4t01hx7eyqg8wpcd6y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b4t01hx7eyqg8wpcd6y.jpg" alt="E_numbered_refs" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Quality on par with D; boots reflected (somewhat subtly). &lt;strong&gt;Numbering refs in the prompt does help&lt;/strong&gt;, but the effect isn't dramatic.&lt;/p&gt;

&lt;h3&gt;
  
  
  F — Drop openpose, specify pose via prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body photograph of the woman wearing the gray sweater dress and brown ankle boots, dynamic dancing pose with both arms raised above her head, joyful expression, photo studio with white seamless background, professional lighting.",
  "ref_image_paths": ["face","part_1","part_2"],
  "shift": 2.5, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpc2foo4m80g7q7ng28x2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpc2foo4m80g7q7ng28x2.jpg" alt="F_pose_via_prompt" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 🏆 &lt;strong&gt;Both-arms-raised jump, complete success.&lt;/strong&gt; Dynamic motion only appeared when the openpose ref was removed and the pose was specified purely via prompt. &lt;strong&gt;This confirms that the openpose ref suppresses prompt-driven pose.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  G — Face only + freeform prompt (full outfit swap)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;/generate/skeleton&lt;/code&gt; has a minimum-2-refs validation, so using &lt;code&gt;/generate/ip&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/ip &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Elegant full-body portrait of the woman wearing a vibrant red sequined evening gown with a thigh-high slit, standing confidently with one hand on her hip, soft cinematic lighting, dark blurred background.",
  "ref_image_paths": ["face"],
  "shift": 3.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8ba10t1rtrsw196yq3b.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8ba10t1rtrsw196yq3b.jpg" alt="G_outfit_freeform" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 🏆 &lt;strong&gt;Red evening gown generated perfectly.&lt;/strong&gt; Facial identity preserved; everything else is free. &lt;strong&gt;Face-only + shift=3.0&lt;/strong&gt; is the maximum-freedom pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  H — Same config as E, seed=999 (variance check)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body try-on photograph. ...",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 999
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnli01d85jsktbyw0ni15.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnli01d85jsktbyw0ni15.jpg" alt="H_seed999" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Marginal difference from E; boots come out more clearly brown. &lt;strong&gt;Varying the seed is useful for fine-tuning details&lt;/strong&gt;, so in production, running 3–5 seeds and picking best-of-N is standard practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layout Mode Quick Look (3 Bonus Patterns)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;layout_bboxes&lt;/code&gt; lets you specify where multiple subjects appear in the image using relative coordinates &lt;code&gt;[x1, x2, y1, y2]&lt;/code&gt;. Here's the actual behavior.&lt;/p&gt;

&lt;p&gt;Input refs are face photos of two people (female, male):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6c3dxlrl2f38dj00d8y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6c3dxlrl2f38dj00d8y.jpg" alt="ref female" width="323" height="512"&gt;&lt;/a&gt; &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fincnztxp2uhqg69hh8bc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fincnztxp2uhqg69hh8bc.jpg" alt="ref male" width="344" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  L1 — Side by side (female left, male right)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"layout_bboxes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[[0.0,0.5,0.1,0.95],[0.5,1.0,0.1,0.95]]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8hgo6kgnxqtngq66cd2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8hgo6kgnxqtngq66cd2.jpg" alt="L1" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &lt;strong&gt;Left and right were swapped&lt;/strong&gt; (male left, female right). Correspondence between ref order and bbox order is not guaranteed.&lt;/p&gt;

&lt;h3&gt;
  
  
  L2 — Top/bottom split (female top, male bottom)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"layout_bboxes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[[0.2,0.8,0.0,0.5],[0.2,0.8,0.5,1.0]]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5oqext31wvp7tv6yz8ft.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5oqext31wvp7tv6yz8ft.jpg" alt="L2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Female appears in the background, male in the foreground — a depth-layered composition rather than a literal top/bottom split.&lt;/p&gt;

&lt;h3&gt;
  
  
  L3 — Size difference (female large, male small)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"layout_bboxes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[[0.1,0.65,0.1,0.95],[0.7,0.97,0.05,0.45]]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhmx1gs80h60k7d3fepl.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhmx1gs80h60k7d3fepl.jpg" alt="L3" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Both subjects rendered at nearly the same size, side by side. &lt;strong&gt;Bbox size does not control relative scale.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ Think of layout mode as a &lt;strong&gt;loose composition hint for group shots&lt;/strong&gt;, not precise Photoshop-style placement. It gives a rough suggestion for fitting multiple subjects into a single image; don't expect coordinate accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Happens — Reading &lt;code&gt;pipeline.py&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;HiDream's behavior is governed by the &lt;code&gt;generate_image()&lt;/code&gt; function in &lt;code&gt;models/pipeline.py&lt;/code&gt;. Three structural facts explain everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. More refs = lower per-ref resolution
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pipeline.py:198-202&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# 2048
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;   &lt;span class="c1"&gt;# 1536
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# 1024
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;   &lt;span class="c1"&gt;# 768
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;         &lt;span class="c1"&gt;# 512
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Feeding 6 refs compresses each to 768px.&lt;/strong&gt; Thin openpose lines, fine clothing patterns, and facial detail all get crushed. Keeping it to 3–4 refs preserves 1024px and retains that detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Skeleton mode has no dedicated code path
&lt;/h3&gt;

&lt;p&gt;Looking at &lt;code&gt;pipeline.py:178-275&lt;/code&gt;, &lt;strong&gt;there is no skeleton-specific branch.&lt;/strong&gt; Both &lt;code&gt;/generate/skeleton&lt;/code&gt; and &lt;code&gt;/generate/ip&lt;/code&gt; run through exactly the same multi-ref path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;caption&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model receives &lt;strong&gt;no role hints&lt;/strong&gt; indicating which ref is a face, which is an openpose skeleton, and which is clothing. All refs are treated as "K reference images in parallel." If you want roles to matter, &lt;strong&gt;you have to say so explicitly in the prompt text.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is why "prompt beats openpose ref." The openpose ref is processed as "some line-art image among the references," with no explicit signal that it's a pose specification. Meanwhile, &lt;code&gt;dynamic dancing pose with both arms raised&lt;/code&gt; in the prompt is parsed as explicit verbs and nouns at the vocabulary level.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. How the &lt;code&gt;shift&lt;/code&gt; parameter behaves
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;shift&lt;/code&gt; controls the noise schedule strength of the scheduler. In practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.0&lt;/strong&gt; = maximum fidelity to ref composition, zero freedom → try-on only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.0-2.5&lt;/strong&gt; = practical range, allows deviation from refs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3.0+&lt;/strong&gt; = near-freeform generation, refs serve only as identity anchors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The README recommends 1.0 for IP/Skeleton/Layout because it assumes the typical try-on / character-consistency use case. &lt;strong&gt;If you want to change the pose, swap outfits, or build a new scene that differs from the refs, 2.0+ is required.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Best Practices by Use Case (Battle-Tested)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Refs&lt;/th&gt;
&lt;th&gt;Shift&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithful try-on matching original scene&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/skeleton&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6 (face+bg+pose+3parts)&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;README default. Strongly faithful to all refs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Preserve outfit + natural standing pose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/skeleton&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;3-4&lt;/strong&gt; (face + clothing, no bg/pose)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dropping bg ref gives white studio; fewer refs keep each at 768→1024px&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dramatic pose change&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/skeleton&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3 (no openpose)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt controls motion better than openpose ref&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complete outfit swap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;/ip&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 (face only)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum freedom; only face is preserved. Skeleton mode rejects &amp;lt; 2 refs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Group shot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/layout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multiple face refs + rough bboxes&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Bboxes are loose composition hints; size hierarchy doesn't work; ref↔bbox order not guaranteed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fine detail optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same config&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;td&gt;Run 3–5 seeds and pick best-of-N&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Treating HiDream-O1-Image's skeleton mode as a "try-on simulator" leads to the frustrating feeling that "it won't listen" — with no guardrails to blame. The real cause is &lt;strong&gt;pipeline structure&lt;/strong&gt;: refs lose resolution as count increases, there's no skeleton-specific processing, and &lt;code&gt;shift&lt;/code&gt; controls how hard the refs pull.&lt;/p&gt;

&lt;p&gt;Practical takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Try-on&lt;/strong&gt;: 6 refs full + shift 1.0 (README default)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Changing the pose&lt;/strong&gt;: drop openpose ref + verb-describe the pose in prompt + shift 2.5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completely free scene creation&lt;/strong&gt;: face only + shift 3.0 + &lt;code&gt;/ip&lt;/code&gt; endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Layout mode also makes sense once you understand it as "group photo hint" rather than "precise bbox placement."&lt;/p&gt;

&lt;p&gt;All assets and commands used in this benchmark come from the &lt;a href="https://github.com/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;HiDream-O1-Image repository&lt;/a&gt;'s &lt;code&gt;assets/IP_skeleton/&lt;/code&gt; and &lt;code&gt;assets/IP_layout/&lt;/code&gt; directories, so results are fully reproducible. Varying &lt;code&gt;shift&lt;/code&gt; and ref count alone produces dramatically different behavior — it's a good sandbox for developing intuition quickly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Addendum: What Happens When You Change the OpenPose Ref — "Prompt Always Wins" Has Conditions
&lt;/h2&gt;

&lt;p&gt;After publishing, I ran additional tests on &lt;strong&gt;what happens with a different-shaped openpose ref&lt;/strong&gt;, and the original conclusion needed revision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Modified OpenPose Refs (4 Patterns)
&lt;/h3&gt;

&lt;p&gt;I took the original openpose image (&lt;code&gt;0.openpose.jpg&lt;/code&gt;, standing pose), flipped it vertically and rotated it 90 degrees to create "unnatural poses," then specified a normal standing pose in the prompt.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modification&lt;/th&gt;
&lt;th&gt;Image&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vertically flipped (upside-down)&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2pufbxptxk4ipmtk71h.jpg" alt="flipped" width="575" height="767"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90° rotated (lying sideways)&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4d9dj0o5k70s70s1aiu.jpg" alt="rot90" width="767" height="575"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;OpenPose Ref&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;O1&lt;/strong&gt; baseline&lt;/td&gt;
&lt;td&gt;Original (standing)&lt;/td&gt;
&lt;td&gt;Standing pose&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23xzu0kj7ck4gj7x4clb.jpg" alt="O1" width="800" height="800"&gt; Standing pose as expected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;O2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🙃 Vertically flipped&lt;/td&gt;
&lt;td&gt;Standing pose&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fdwusxwktdhr91bqzg5.jpg" alt="O2" width="800" height="800"&gt; &lt;strong&gt;Standing pose&lt;/strong&gt; (openpose ignored, prompt wins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;O3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🙃 Vertically flipped&lt;/td&gt;
&lt;td&gt;Jumping&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyup8q4v2jsbnculdjgi8.jpg" alt="O3" width="800" height="800"&gt; &lt;strong&gt;Both-arms-raised jump&lt;/strong&gt; (openpose ignored, prompt wins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;O4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;↻ 90° rotated&lt;/td&gt;
&lt;td&gt;Standing pose&lt;/td&gt;
&lt;td&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllfbud2iw9xwogx3lrt5.jpg" alt="O4" width="800" height="800"&gt; Standing pose but &lt;strong&gt;canvas itself rotated 90°&lt;/strong&gt;!&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Up to this point the findings were: "The model rejects unnatural refs and falls back to the prompt" and "overall compositional orientation (portrait vs. landscape) can still be influenced by the ref."&lt;/p&gt;

&lt;h3&gt;
  
  
  But a Dramatic Ref + Pose-Silent Prompt Led to Complete Ref Victory
&lt;/h3&gt;

&lt;p&gt;I generated a "colorful anatomical skeleton with arms spread in a T-shape and one leg raised high in a tree yoga pose" via HiDream's T2I and fed it as a ref:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzz27x6rblranl7tn1y0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzz27x6rblranl7tn1y0.jpg" alt="warrior skeleton ref" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt mentions no pose at all — only subject and clothing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8895/generate/skeleton &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "prompt": "Full body photograph of a young Asian woman wearing a gray sweater dress, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","SYNTHETIC_WARRIOR_SKELETON","sweater"],
  "shift": 1.0, "seed": 42
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa764mosnr8ox7zzkttrt.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa764mosnr8ox7zzkttrt.jpg" alt="X1 warrior yoga result" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tree yoga pose reproduced perfectly&lt;/strong&gt; — T-shaped arms and single-leg stance, matching the skeleton ref exactly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Revised Conclusions (3 Rules)
&lt;/h3&gt;

&lt;p&gt;Synthesizing all 12 patterns, HiDream actually behaves like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;If the prompt mentions a pose, that takes first priority&lt;/strong&gt; — prompt wins even when it contradicts the ref.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the prompt says nothing about the pose, the ref's pose is adopted&lt;/strong&gt; — the more dramatic the ref, the clearer the transfer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the ref appears "unnatural" (upside-down skeleton, etc.), the model defaults to a natural stance&lt;/strong&gt; — though overall compositional orientation can still bleed through.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So "the openpose ref is basically useless" was an overstatement. More precisely: &lt;strong&gt;"when the prompt describes a pose, the ref gets overridden."&lt;/strong&gt; The 8-pattern benchmark was all scenarios where the prompt specified dynamic motion, so it looked like the openpose ref was powerless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;To fully control pose via ref&lt;/strong&gt;: don't mention pose in the prompt + use a dramatic openpose/skeleton ref → ref pose transfers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To control pose via prompt&lt;/strong&gt;: removing the openpose ref is fine (even if you leave it in, the prompt overrides it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When ref and prompt conflict&lt;/strong&gt;: prompt wins (including the ref doesn't help)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can effectively choose whether pose comes from the ref or the prompt by &lt;strong&gt;whether or not you mention the pose in the prompt&lt;/strong&gt;. If you want the openpose ref to drive the pose, keep pose description out of the prompt.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HiDream-O1-Image: &lt;a href="https://huggingface.co/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;https://huggingface.co/HiDream-ai/HiDream-O1-Image&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Repository: &lt;a href="https://github.com/HiDream-ai/HiDream-O1-Image" rel="noopener noreferrer"&gt;https://github.com/HiDream-ai/HiDream-O1-Image&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent</title>
      <dc:creator>shinji shimizu</dc:creator>
      <pubDate>Fri, 22 May 2026 11:23:04 +0000</pubDate>
      <link>https://dev.to/shinji_shimizu_bb51276a5e/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-multimodal-sub-agent-3ccf</link>
      <guid>https://dev.to/shinji_shimizu_bb51276a5e/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-multimodal-sub-agent-3ccf</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;It started with a Pingo (language-learning AI app) short video that popped up on X. A Western woman learning Japanese tries to say "I ate a mango" (マンゴーを食べた), drops a dakuten, and instead says something like "I ate p*&lt;strong&gt;y" (マ◯コを食べた). The AI deadpans right along with it and she's devastated. The combination — **a specific phonetic accident + AI playing it completely straight + the reaction shot gap&lt;/strong&gt; — worked perfectly, and I figured this was a solid benchmark for a "comedy video auto-generation pipeline."&lt;/p&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Generate a vertical comedy video from a single line of idea text&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iteration cycles in minutes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost is basically just electricity&lt;/strong&gt; — minimal API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publishable quality&lt;/strong&gt; — good enough to upload directly to YouTube Shorts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Short answer: it works. Here's the finished video:&lt;/p&gt;

&lt;p&gt;@&lt;a href="https://dev.to9W-IMB2xLWc"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What became clear during development: &lt;strong&gt;the hybrid approach of delegating multimodal editorial judgment (like video review) to a frontier model while keeping heavy compute local is dramatically more cost-effective&lt;/strong&gt;. This post covers that architecture and the specific bugs I got stuck on along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It All Fits Together
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Single line of idea text]
   ↓
Gemini 3.1 Pro Preview (orchestrator)
   ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16
plan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer}, ...]}
   ↓
XTTS (local, port 8880) generates audio per scene
   ↓ scene_NN.wav
renderer routing:
   ├─ Ditto-TalkingHead (local, port 8881): normal dialogue ~1-2s/scene
   └─ LTX-2 A2V        (local, port 8892): reaction_only scenes only ~100s
   ↓ scene_NN.mp4
ffmpeg concat (libx264 + aac, 512x768 vertical) → final.mp4
   ↓
Gemini 3.1 Pro Preview (reviewer)
   ↓ multimodal evaluation of video + plan summary
review.md (technical / completeness / quality / improvement suggestions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All heavy compute runs locally&lt;/strong&gt; — TTS / A2V renderer / lightweight inference all run on local GPU (RTX PRO 6000 Blackwell)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini handles judgment&lt;/strong&gt; — only the orchestrator (scene design + scripting) and reviewer (editorial evaluation of the video) use a frontier model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM (Gemma 4 E4B) stays as a per-scene technical pre-screen&lt;/strong&gt; — a cheap filter that just rejects obviously broken output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VRAM usage: the local LLMs (Gemma 4 E4B + 31B) were already loaded on a separate path consuming ~60GB, but &lt;strong&gt;after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local LLM Alone Wasn't Enough
&lt;/h2&gt;

&lt;p&gt;I started with everything local (Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer). It &lt;strong&gt;ran end-to-end&lt;/strong&gt; and the structure looked reasonable, but it never reached publishable quality. Two reasons.&lt;/p&gt;

&lt;h3&gt;
  
  
  (1) Gemma 4 31B's safety tuning blurs the punchline
&lt;/h3&gt;

&lt;p&gt;The comedy in the original short hinges on a specific beat: &lt;strong&gt;the AI explicitly calls out the mistake deadpan&lt;/strong&gt;. Concretely — "You just said X. Personally, I like X." — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart.&lt;/p&gt;

&lt;p&gt;Feed the same system prompt and idea to local Gemma 4 31B and you consistently get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"いいですね。僕も腹が減っている時は、それが好きです。"
("Nice. I like that too when I'm hungry.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "when I'm hungry" beat survives, but &lt;strong&gt;the explicit "you just said X" callout — the most transgressive beat&lt;/strong&gt; — is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable.&lt;/p&gt;

&lt;p&gt;Same system prompt and idea sent to Gemini 3.1 Pro Preview with &lt;code&gt;safetySettings: BLOCK_NONE&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。"
("I see. I'm an AI so I can't eat pussy, but I'm rooting for you.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even within the same Google model family, the frontier model has somewhat looser guardrails&lt;/strong&gt; — this matches what people say on X. At least for "transgression that's clearly necessary in a comedy context," Gemini writes it more naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  (2) Gemma 4 E4B (4B-class, multimodal) is a blunt reviewer
&lt;/h3&gt;

&lt;p&gt;The reviewer side was worse. E4B answers per-scene "OK / NG" in binary, but &lt;strong&gt;rubber-stamps every single scene as OK&lt;/strong&gt;. Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK.&lt;/p&gt;

&lt;p&gt;Run the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical failure.&lt;/strong&gt; The TTS/pipeline clearly censored the output, cutting off at "I ate p-" and entirely dropping the intended transgressive punchline. This destroys the "deadpan AI saying unhinged things" comedic archetype.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top 3 fixes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ...&lt;/li&gt;
&lt;li&gt;Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ...&lt;/li&gt;
&lt;li&gt;Verify Voice/Visual Match ...&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Embarrassing Part: I Dismissed Gemini's "Truncated" Note Three Times as Hallucination
&lt;/h2&gt;

&lt;p&gt;Gemini reviewer flagged multiple times that "scene 5 is truncated mid-way, cuts off at 'I ate p-'." I transcribed the audio file with Whisper to verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;whisper scene_04.wav &lt;span class="nt"&gt;--language&lt;/span&gt; en
&lt;span class="s2"&gt;"Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate
pussy honestly when I'm hungry, same."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full text present. I decided &lt;strong&gt;Gemini was hallucinating&lt;/strong&gt; and dismissed the note three times in a row.&lt;/p&gt;

&lt;p&gt;On the third dismissal, Gemini kept insisting "&lt;strong&gt;still truncated at 'I ate p-'&lt;/strong&gt;," so I actually ran ffprobe on the final mp4:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scene_04.mp4&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;video duration = 8.000000s&lt;/span&gt;
  &lt;span class="s"&gt;audio duration = 7.979000s    ← the original WAV should have been 10.30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Audio was cut at 8 seconds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Root cause: an implicit &lt;code&gt;MAX_DURATION_PER_SCENE = 8.0&lt;/code&gt; cap in the pipeline was limiting ditto renderer's num_frames to 8s, and ffmpeg's &lt;code&gt;-shortest&lt;/code&gt; flag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If a frontier reviewer gives you something that looks like a hallucination, just verify it properly.&lt;/strong&gt; The signal isn't a guess.&lt;/p&gt;

&lt;p&gt;The fix was trivial: remove &lt;code&gt;MAX_DURATION_PER_SCENE&lt;/code&gt; and use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with "&lt;strong&gt;The transgressive bite is perfect&lt;/strong&gt;," and the pipeline finally reached publishable state.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frontier Model as Sub-Agent — Token Economics
&lt;/h2&gt;

&lt;p&gt;This pattern works because &lt;strong&gt;the sub-agent (Gemini) runs in a fresh context&lt;/strong&gt; every time. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Main agent (Claude Code) context&lt;/strong&gt;: the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-agent (Gemini) context&lt;/strong&gt;: one video (2–3 MB base64) + plan summary (~1,500 tokens) + evaluation instructions (~500 tokens). Fresh each call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benefit: &lt;strong&gt;the sub-agent's work doesn't accumulate in the main agent's context&lt;/strong&gt;. Iterate on one video 10 times and the main agent's context only contains "called Gemini" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call.&lt;/p&gt;

&lt;p&gt;Cost breakdown (Gemini 3.1 Pro Preview rates, May 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input (video + plan + instructions)&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;td&gt;$1.25/M&lt;/td&gt;
&lt;td&gt;$0.0031&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output (review markdown)&lt;/td&gt;
&lt;td&gt;~450&lt;/td&gt;
&lt;td&gt;$10/M&lt;/td&gt;
&lt;td&gt;$0.0045&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per review&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0076&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;1 initial review + 3–5 diff iterations per video ≈ &lt;strong&gt;$0.03–0.05 per video&lt;/strong&gt;. Making 5–10 videos a day still comes in under &lt;strong&gt;$10–20/month&lt;/strong&gt;. That's a remarkably low bar for using a frontier model in a video creation workflow.&lt;/p&gt;

&lt;p&gt;The orchestrator side is the same order of magnitude (no video input, text only, even cheaper).&lt;/p&gt;




&lt;h2&gt;
  
  
  Differential Iteration — &lt;code&gt;--regen-scenes&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Getting to publishable quality requires fast "watch → fix only the broken parts → watch again" loops. You can't get there in a single pass.&lt;/p&gt;

&lt;p&gt;So I added a path in the pipeline to &lt;strong&gt;re-run TTS + render for specific scenes only&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal generation&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--idea&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt; &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1

&lt;span class="c"&gt;# Regenerate only scene 6 (edit plan.json script first, then run)&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1 &lt;span class="nt"&gt;--regen-scenes&lt;/span&gt; 5

&lt;span class="c"&gt;# Regenerate scenes 0, 2, and 5 together&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1 &lt;span class="nt"&gt;--regen-scenes&lt;/span&gt; 0,2,5

&lt;span class="c"&gt;# Just re-concat existing scene_NN.mp4 files (for cherry-pick recombination)&lt;/span&gt;
pipeline_multi.py &lt;span class="nt"&gt;--out&lt;/span&gt; outputs/run1 &lt;span class="nt"&gt;--concat-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenes not listed in &lt;code&gt;--regen-scenes&lt;/code&gt; are reused from existing &lt;code&gt;scene_NN.mp4&lt;/code&gt; files; only the specified indices are regenerated before re-concat and re-review. &lt;strong&gt;Full generation: 60 seconds → diff iteration: 30 seconds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx_prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code Snippets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gemini Pro API call (multimodal video review)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="n"&gt;GEMINI_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-pro-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;GEMINI_API&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://generativelanguage.googleapis.com/v1beta/models/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GEMINI_MODEL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:generateContent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review_final&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;vid_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_bytes&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;scene_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  scene &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: speaker=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;speaker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, lang=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tts_language&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inline_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mime_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video/mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vid_b64&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;REVIEW_PROMPT&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Scene plan:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scene_summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]}],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generationConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxOutputTokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens
&lt;/span&gt;            &lt;span class="c1"&gt;# Set thinking budget explicitly to ensure output tokens remain available
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinkingConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinkingBudget&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="c1"&gt;# Minimize safety filters for comedy context
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;safetySettings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_HARASSMENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_HATE_SPEECH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_SEXUALLY_EXPLICIT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_DANGEROUS_CONTENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;GEMINI_API&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-goog-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GOOGLE_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;120.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;candidates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;thinkingConfig.thinkingBudget&lt;/code&gt;, Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens. &lt;strong&gt;This is a required setting whenever you use Gemini 3.x Pro.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  TTS output quality check (STT similarity + silence gap retry)
&lt;/h3&gt;

&lt;p&gt;XTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;difflib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[\s。、,.!?「」&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\"…—–\-:;()（）]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_script_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;difflib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SequenceMatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;synthesize_scene&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_language&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tts_language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TTS_MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_xtts_once&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_longest_internal_gap_sec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_stt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_script_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠ gap=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s sim=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, retrying (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# If threshold isn't met after 3 retries, use the best sample found
&lt;/span&gt;    &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;
    &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scene_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PCM_16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone significantly reduces cases where XTTS's non-deterministic quality variance bleeds through into the final video.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Pattern Generalizes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Sub-agent the heavy judgment to a frontier model, keep heavy compute local"&lt;/strong&gt; works beyond video pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Large-scale search ranking&lt;/strong&gt;: Send 100 web search results to a frontier model for editorial evaluation, return only the top 10 to the main agent. Keeps search result noise out of the main agent's context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-form editing review&lt;/strong&gt;: Have a frontier model do the editorial read of PRs, design docs, or specs. Main agent only receives the summary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual QA&lt;/strong&gt;: Sub-agent to the best model per language; main agent holds only the cross-language decision logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: &lt;strong&gt;consciously deciding what belongs in context vs. what should be completed inside an API call&lt;/strong&gt;. Frontier model editorial signal is remarkably cost-effective relative to what it delivers.&lt;/p&gt;

&lt;p&gt;On the video pipeline side, the next steps are generalizing the comedy format (split-screen, 3+ characters, other genres) and volume testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Built a foundation that generates &lt;strong&gt;publishable comedy videos in 60 seconds from a single line of idea text&lt;/strong&gt;, using a local GPU + Gemini 3.1 Pro Preview hybrid&lt;/li&gt;
&lt;li&gt;Local-only falls short on two fronts: &lt;strong&gt;(1) safety tuning blurs the punchline&lt;/strong&gt; and &lt;strong&gt;(2) the reviewer can't produce editorial signal&lt;/strong&gt;. Sub-agenting a frontier model solves both&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Take frontier reviewer notes at face value.&lt;/strong&gt; Checking the WAV with Whisper alone won't catch audio truncation in the final mp4&lt;/li&gt;
&lt;li&gt;Sub-agent token economics keep main agent context clean — total cost is $0.03–0.05 per video&lt;/li&gt;
&lt;li&gt;With &lt;code&gt;--regen-scenes&lt;/code&gt; diff iteration running 30-second loops, the Gemini feedback → fix → re-evaluate cycle runs at minute-by-minute speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finished video (reprise):&lt;/p&gt;

&lt;p&gt;@&lt;a href="https://dev.to9W-IMB2xLWc"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The local implementation lives in &lt;code&gt;llm_server/pipeline_multi.py&lt;/code&gt;. Detailed findings from the development process are accumulating in &lt;code&gt;docs/MULTI_SCENE_COMEDY_FINDINGS_2026-05-12.md&lt;/code&gt; as an internal reference.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
