DEV Community: shinji shimizu

Implementing Claude Code's Memory Model as a Dreaming Layer on 58 Articles

shinji shimizu — Wed, 10 Jun 2026 20:03:53 +0000

I built a pipeline in a single session that consolidates the 58 tech-blog articles of my service Kotonia (ja/en/zh) into a semantic index, then uses that index to detect duplicates for new article mining. Raw articles → semantic index → TF-IDF dedup → chunked draft generation — full path running on local Gemma 4 26B driven by Codex CLI. Design and implementation notes follow.

The motivation and "how solo developer accumulated assets compound" framing is in the companion piece: The Day a Solo Developer's Accumulated Assets Finally Started to Compound

This piece keeps the technical notes.

1. The Problem — When Title-Only Dedup Broke

Mining v1 produced a draft and I (the user) noticed "this overlaps with an existing article." The overlap target was voice-first-local-llm (importance=9 flagship).

New draft thesis: "tokens per chunk is a hidden voice-chat latency driver"
Existing article §3.3: "★ Streaming granularity — the structural difference that decides voice experience"

Same numbers (Local Gemma 1.0 tok/chunk, Haiku 10-16, Gemini 8-24). A perfect duplicate.

The mining agent had called art-done-list (title + description) for the dedup check. But the existing article's title is "Cutting short-form LLM latency from 600ms to 22ms," with TTFB as the headline sales pitch; §3.3 streaming granularity is buried in an H2 subsection. At title level, nothing overlapped, so the check came back clean.

That's the starting point for this article.

2. The Design — Three Layers: episodic ↔ semantic ↔ procedural

Breaking down why Claude Code's memory system works:

Entries are small (1-3KB, one topic each) → subtopics don't get buried
Hooks are retrieval-tuned and curated → search terms re-appear in the hook
A smart model writes hooks semi-autonomously → past me distills for future me

Articles have the opposite shape. Each 5-15KB, important subtopics buried in subsection bodies, descriptions are SEO summaries rather than retrieval-tuned, too heavy for an agent.

I bridged them with an intermediate layer named the Dreaming layer. Literally the biological "memory consolidation during sleep — hippocampus to cortex" metaphor.

episodic (raw articles + memory files)
    ↓ Dreaming agent (periodic digestion)
semantic (concepts_covered_ja[] / importance / data_points / sections)
    ↓ agent reverse-lookup (art-concepts-find / TF-IDF cosine)
procedural (mining / drafting / publishing)

A semantic entry for an article looks like:

{
  "slug": "voice-first-local-llm",
  "locale": "ja",
  "thesis_ja": "Ditching API, building voice-first with self-hosted local 26B",
  "importance": {
    "score": 9,
    "factors": {
      "pv_count_30d": 6,
      "avg_scroll": 67.0,
      "avg_dwell_sec": 170,
      "has_bench_data": true,
      "novelty_high": true
    }
  },
  "concepts_covered_ja": [
    "TTFB (time-to-first-byte): local vs API",
    "Streaming granularity (tokens per chunk)",
    "Gemma 4 26B model selection rationale",
    "Ditto + LLM co-residency GPU design"
  ],
  "data_points": [
    {"name": "TTFB Local", "value": "17-25ms"},
    {"name": "Streaming granularity Local", "value": "1.0 tok/chunk"}
  ],
  "sections": [
    {"id": "3.3", "title": "Streaming granularity — the structural difference that decides voice experience"}
  ]
}

The key point: concepts_covered_ja[] must be normalized to Japanese canonical names. Translated EN/ZH articles use the same JP concept strings. That single normalization becomes the dedup primitive downstream.

3. Tools — Thin CLIs the Agent Calls

Codex CLI drives Gemma 4 26B locally. Tool calling via --enable-auto-tool-choice --tool-call-parser gemma4 gives an OpenAI-compatible surface. Each tool is ~50-100 lines of Python (stdlib only), art- prefix:

tool	role
`art-articles-list --needs-dreaming`	DB ∪ FS articles + dreaming state
`art-pv-count --slug X`	analytics_events → PV / scroll / dwell
`art-source-pull <slug> [--section N]`	pull just one H2/H3 section of an article
`art-dream-write`	upsert a semantic entry into articles_index.jsonl
`art-concepts-find <pattern>`	concept → article reverse-lookup (the mining dedup primitive)
`art-ideas-check`	evaluate a candidate idea via TF-IDF (the core of this article)
`art-ideas-add`	push an idea to the pool (calls art-ideas-check internally)
`art-draft-append`	append a chunk of draft body to a buffer
`art-draft-commit`	finalize buffer → `articles/_drafts/<slug>.md`

The Dreaming agent semantically encodes one article at a time using these. Importance scoring uses this rubric:

+2: PV >= 100 (sigmoid log-scale)
+1: avg_scroll >= 0.7 AND avg_dwell_sec >= 60
+2: bench numbers / failure root cause / named decision
+2: novel concept not yet in index
+1: evergreen value (not time-sensitive)
-2: redundant with an already-indexed flagship

PV comes from a homegrown analytics_events table (cookie-less first-party tracker). The fact that the article platform and analytics co-reside in one DB you can hit directly is a solo-dev win.

4. TF-IDF Dedup — Substituting Tool Structure for Agent Self-Discipline

At mining v1 the prompt instructed the agent to call art-concepts-find for dedup. The agent slipped through three duplicates anyway (details: Don't Trust an Agent's Self-Discipline).

The fix: embed a dedup gate directly inside art-ideas-add. The guts of evaluate_idea():

def evaluate_idea(title, angle, sources, ...):
    articles, ideas = load_corpus()
    # infer the candidate's concepts from the canonical vocab
    pseudo = {"concepts": _infer_concepts(title, angle, sources, articles)}

    # IDF (rare concepts weighted more)
    idf = build_idf(articles + ideas)
    new_vec = vectorize(pseudo["concepts"], idf)

    conflicts = []
    for a in articles:
        sim = cosine(new_vec, vectorize(a["concepts"], idf))
        if a["importance_score"] >= 7 and sim >= 0.25:
            conflicts.append({"kind": "flagship_concept", ...})
    for i in ideas:
        sim = cosine(new_vec, vectorize(i["concepts"], idf))
        if sim >= 0.35:
            conflicts.append({"kind": "pool_dup", ...})

    return {"allow": not conflicts, "conflicts": conflicts}

Three traps along the way in _infer_concepts():

Trap 1: substring-match false positives

The ASCII term "check" matches inside "checkout"; "PRO" inside "prod_". The Stripe idea was falsely matched into "品質チェック (quality check/retry)" or "Blackwell Max-Q (RTX PRO 6000)" and rejected.

Fix: ASCII terms require word boundary; JP terms can stay substring.

def _term_matches(term: str, text: str) -> bool:
    if _ASCII_RE.match(term):
        pattern = r"(?<![A-Za-z0-9_])" + re.escape(term.lower()) + r"(?![A-Za-z0-9_])"
        return re.search(pattern, text) is not None
    return term.lower() in text  # JP substring is fine

Trap 2: generic JP noun noise

"モデル" "システム" "アーキテクチャ" "サービス" appear in many concept names; they get picked up from arbitrary idea titles. Registered ~30 generic words in _NOISE_TERMS.

Trap 3: threshold tuning

Started flagship sim >= 0.30, but a binary vector with 4 concepts and 1 shared concept maxes around cosine 0.25. Even with IDF weighting, 0.27-0.30 was the borderline. Dropped to 0.25 and instead tightened the precision of the substring matcher (the false-positive engine).

Regression test: 4/4 across the known 4 cases (OpenWeight NSFW / streaming-granularity / CodeFormer / Stripe).

5. Small-Model Specific Traps — Codex CLI + 26B Uncensored

Driving local 26B (Gemma 4 26B A4B Uncensored MAX) through Codex CLI, I observed 4 failure modes and their fixes:

Trap 4: descriptive prompt → "I will begin by surveying..." then exit

The first mining run had the agent summarize "what I'll do next" and exit with zero tool calls. Fix:

**Critical: do not narrate, plan, or describe what you will do. Just call tools.**
The first action **must** be `shell({"command": "art-..."})` — start there.

Imperative + first-action explicit, and it starts moving.

Trap 5: huge tool output triggers a generation loop

art-commits-recent --since "60 days ago" --include-files returned ~1300 lines of JSON including bodies; the agent then emitted ~25K tokens of output continuously, never stopping. Fix: art-commits-recent defaults to subject-only; body via --include-body opt-in.

Trap 6: 5KB+ heredoc in tool_call.arguments JSON breaks the escape

Sending art-draft-save <slug> <<'EOF' ... 5KB body ... EOF as a single shell tool_call reliably breaks 26B's string escaping inside the arguments JSON (Unterminated string at column 5083).

Fix: split into chunked append + commit. ~200-800 chars per chunk, 4-8 appends, final commit:

art-draft-append my-slug <<'KOTONIA_EOF'
---
title: "..."
---
KOTONIA_EOF

art-draft-append my-slug <<'KOTONIA_EOF'
## 1. First section
...
KOTONIA_EOF

# ...repeat per section...

art-draft-commit my-slug

Each tool_call's arguments JSON stays small, escape break vanishes.

Trap 7: Codex exec self-terminates after ~4 articles

There seems to be an implicit constraint where one codex exec invocation finishes with a summary message after ~25K tokens / ~4 articles. Codex's Goals feature (thread_goals.objective) could prevent that, but you can't set it via exec (only the interactive TUI as of v0.133).

Fix: wrap dispatcher.sh in an external loop. Restart codex exec until pending == 0.

max_cycles=30
cycle=0
while (( cycle < max_cycles )); do
  pending=$(art-articles-list --needs-dreaming --count-only)
  if (( pending == 0 )); then break; fi
  run_codex dream
  cycle=$((cycle + 1))
done

That gets 58 articles digested in 2-3 cycles.

6. What Landed

The working pipeline:

58 articles → semantic index, importance bell-shaped (median 5-6), flagship recognition correct (voice-first-local-llm at score 9 across all locales)
70 memory files mined for unexplored concepts, 4 ideas land in the pool as survivors
4 drafts generated, ~3.6-4.6KB each, publish-ready after 10-20 minutes of human polish
TF-IDF dedup gate at the tool layer blocks any agent self-discipline violation

Repo: [github coming soon]

7. Generalization

The structure — raw assets → semantic compression → agent reverse-lookup — generalizes beyond articles:

Test generation: semantically compress existing tests, mine uncovered branches, draft new tests
PR descriptions: semantically compress the codebase delta, dedupe against unrelated PRs, draft a description
Support FAQs: semantically compress past support tickets, surface uncovered topics, draft new FAQs
Personal knowledge base: Scrapbox / Notion accumulation → semantic compression → mechanically discover unexplored concepts

Common design principles:

Raw assets are heavy. Don't load them directly — insert a consolidation layer.
The canonical vocabulary is the semantic-layer primitive. Without normalization, dedup doesn't work.
Enforcement belongs at the tool layer. Agent self-discipline is unstable; bake the rule into the structure.

Knowing this opened up application to other domains in kotonia (persona generation in character chat, TTS prompt accumulation, etc.).

Aside: Development Time

One session (~6h). Dreaming layer design → 5 new tools → Codex prompts → first-time consolidation → TF-IDF dedup → chunked draft → 4 article drafts generated, all in one stretch.

Local 26B as the "runs on electricity only" agent absorbed the grinding labor; the human only had to make judgment calls and steering corrections. Doing this on frontier APIs would have cost $50-100.

Kotonia is a voice-first AI character chat platform. The drafts revived by this pipeline live on the same blog if you're curious.

Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output

shinji shimizu — Sun, 31 May 2026 09:51:11 +0000

I wanted to add reply suggestions to a voice roleplay chat — the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit.

I ended up going with the unglamorous move of embedding inline markers in the response and stripping them out afterward. The path to that decision was interesting enough to write up.

What I wanted to build: three "you could say this" chips per AI response — no structured output, no stream interruption, no cache invalidation.

Two Hard Constraints

1. The conversation is built around prompt caching

Keeping token costs down in an LLM chat comes down to caching, and every provider does it differently.

Gemini: explicit cache. A cache object is created per session, containing the persona prompt and conversation history. Each turn sends only the diff. When history grows too long, the cache is rebuilt.
DeepSeek / Cerebras (OpenAI-compatible): send system + full history + user every time and ride the server's implicit prefix cache (measurable via prompt_cache_hit_tokens etc.).
Grok (xAI): the x-grok-conv-id header ties requests to the same conversation, keeping them pinned to the cache.

The common thread: the conversation prefix (persona + history) should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency.

2. Structured output is off the table

The natural-looking approach to fetching three suggestions would be something like {"reply": "...", "suggestions": ["...", "...", "..."]}. I ruled it out for two reasons.

Gemini flash-lite class models show noticeable latency increases with structured output. The lighter the model, the heavier schema compliance costs are relative to the task.
It directly conflicts with sentence-level TTS streaming. This chat is designed to start speaking from the very first sentence. While the model is outputting JSON, there's no way to pull out that first sentence. Structured output means waiting for full generation before any audio plays.

Three Approaches I Considered

A. Separate API call to generate suggestions
Fire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem.

B. Structured output, bundled in the main turn
No second request, so cache consistency is trivial. But ruled out for the reasons above (latency + streaming conflict).

C. Inline markers, bundled in the main turn (chosen)
Ask the model to append {{SUGGEST: option1 | option2 | option3}} at the very end of its response, and extract it server-side.

Why C Works

It's the same request. There is no "second request." Whether it's an explicit cache or an implicit prefix cache, that turn is already on the cache — alignment is automatic. No per-provider logic needed.
No structured output. Plain text generation all the way through.
Zero perceived latency increase. TTS is already playing from the first sentence while {{SUGGEST}} trickles out at the end. Generation finishes while the user is listening.
Reuses the existing marker infrastructure. This chat already has inline markers like {{SHOW: label}}, {{POSE: ...}}, and {{IMAGE: ...}}, plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.

The Key Implementation Detail: Strip From Both Places

The important part: once extracted, the marker must be removed from both the TTS/display text and the DB history. Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns.

// Extract {{SUGGEST: a | b | c}} and remove it entirely from the body
static RE_SUGGEST: Lazy<Regex> =
    Lazy::new(|| Regex::new(r"(?is)\{\{\s*SUGGEST\s*:\s*([\s\S]*?)\}\}").unwrap());

fn extract_suggest(text: &str) -> (String, Vec<String>) {
    match RE_SUGGEST.captures(text) {
        Some(cap) => {
            let suggestions = cap[1]
                .split('|')
                .map(|s| s.trim().to_string())
                .filter(|s| !s.is_empty())
                .take(3)
                .collect();
            let clean = RE_SUGGEST.replace_all(text, "").trim().to_string();
            (clean, suggestions)
        }
        None => (text.to_string(), Vec::new()),
    }
}

This is where the existing "store annotated / display clean" separation pays off. In this chat:

ai_text returned to the client (display + TTS) is fully stripped of all markers.
What gets saved to DB re-attaches {{SHOW}}/{{POSE}} markers (so the model keeps seeing its own canonical format in history and continues using it correctly).

{{SUGGEST}} is different from {{SHOW}}/{{POSE}} — it doesn't go back into the DB at all. It's ephemeral. The design of choosing per-marker whether to persist or discard let suggestions slot in cleanly without touching anything else.

On the prompt side, it's just one extra block gated by a feature flag in the persona config:

At the very end of your response, add exactly three short replies the user
might say next, in this format:
{{SUGGEST: option1 | option2 | option3}}
- Always place it last (after any {{SHOW}}/{{POSE}} markers)
- Write each option in first person, casual, short
- Vary the direction: one enthusiastic, one deflecting, one asking a question back

A Note on Implicit Prefix Cache Alignment

Implicit prefix caches hit when the token sequence at the start of a request matches a previously seen prefix. The marker approach simply generates suggestions as part of the current turn's response — the next turn's input prefix (system + history) is identical to what it would be in a plain conversation. The prefix keeps hitting the cache normally. The suggestions never touch the prefix at all. That's a quiet but important property.

Summary

When adding secondary structured data to a streaming + caching chat, consider inline markers + extraction before reaching for structured output.
Bundling everything into the same request makes cross-provider cache alignment a non-issue by construction.
If you already have a marker extraction pipeline, the marginal cost is nearly zero. Design it so you can choose per-marker whether to persist or discard — that flexibility makes ephemeral UI additions painless to add later.

The costs: output tokens increase by a few dozen, and occasionally the model mangles the marker format (same risk level as {{SHOW}}/{{POSE}}). Both are acceptable.

This chat is part of kotonia, a voice roleplay product running multilingual TTS × lip-sync avatars on a local GPU.

One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own

shinji shimizu — Tue, 26 May 2026 23:45:33 +0000

TL;DR

HiDream-O1-Image is one of the strongest open-weight text-to-image models out right now (it debuted around #8 in the Artificial Analysis T2I Arena). But it shipped inference-only, and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can't touch it.

This post is one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image. I'll show why the standard trainers (kohya, ai-toolkit, SimpleTuner) don't fit, how I reverse-engineered a working training loop from the inference code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night.

What this LoRA is: a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It's not a character LoRA, not a single-style LoRA, and not a model-distillation artifact.

The short version of the recipe:

The model's output head predicts the clean image x0 (in patch space, [-1,1]).
Build the noised input as z_t = (1 - σ)·x0 + σ·(8.0·ε) and feed the model timestep 1 - σ.
Loss is just MSE(x_pred, x0) on the image-token positions.
LoRA attaches via plain PEFT to the language-model decoder linears, because the backbone is a stock HF Qwen3-VL.

Prior art (what existed before this)

To set expectations honestly: I'm not claiming "world's first LoRA file for O1."

Kijai published a ComfyUI workflow for HiDream-O1 that includes a distill LoRA — it extracts the Dev-2604 model's behavior as a LoRA applied to the Base model. That's a model-compression technique, not a visual-style LoRA trained on external images.
Ostris (author of AI Toolkit) has run initial LoRA training tests on HiDream-O1 and ai-toolkit lists O1 as a supported model. No resulting LoRA has been publicly released as of this writing.
TechnoEdge (Japanese tech media) reported using a face LoRA with HiDream-O1 Dev, though it's unclear whether that LoRA was purpose-trained for O1 or adapted from elsewhere.

What I didn't find: a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image. If you know of one, I'd genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe.

Why no trainer exists: the architecture

Most LoRA trainers assume the SDXL/Flux shape: a UNet/DiT denoiser + a VAE + one or two text encoders, all separate modules wired together by diffusers. You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings.

HiDream-O1-Image is a Pixel-level Unified Transformer (UiT). From its own description:

a natively unified image generative foundation model built on a Pixel-level Unified Transformer without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space.

Concretely (reading models/qwen3_vl_transformers.py):

The backbone is a Qwen3VLForConditionalGeneration — a stock Hugging Face Qwen3-VL multimodal transformer.
There is no VAE. Images are patchified directly: PATCH_SIZE = 32, so an H×W image becomes (H/32)·(W/32) tokens, each a 3·32·32 = 3072-dim vector of raw pixels.
A small x_embedder projects the noised patch tokens into the hidden space; a final_layer2 head projects hidden states back to patch space; a t_embedder injects the timestep at a dedicated <|tms_token|> position.
It's trained with flow matching (fm_solvers_unipc.py), and image tokens get full (bidirectional) attention while text tokens stay causal (this is what token_types controls).

So none of kohya/ai-toolkit/SimpleTuner can touch it — there's no UNet, no VAE, no separate text encoder for them to hook. That's exactly why there are no articles: it's a new architecture, released inference-only.

The good news: because the backbone is a plain transformers model, the LoRA adapter mechanics are trivial — PEFT injects into the nn.Linears natively. The hard part is the training loop, which the repo doesn't ship. So let's derive it.

Reverse-engineering the training forward from inference

The inference loop (models/pipeline.py:generate_image) tells you everything. Per denoising step it does roughly:

sigma = step_t / 1000.0                       # noise level, in (0, 1]
t_pixeldit = 1.0 - sigma                       # what the model receives as "timestep"
x_pred = model(..., vinputs=z, timestep=t_pixeldit).x_pred
v = (x_pred - z) / sigma                        # ... and -v is fed to the FM scheduler

Two facts fall out of this:

x_pred is the model's prediction of the clean image x0. Work the algebra backwards: if z_t = (1-σ)·x0 + σ·ε then (x_pred - z_t)/σ = x0 - ε = -(ε - x0), and ε - x0 is exactly the rectified-flow velocity the FlowMatch scheduler expects. Consistent ⇒ the head is x0-parameterized.
The noise scale isn't 1. Inference initializes z = NOISE_SCALE · randn with NOISE_SCALE = 8.0, while x0 lives in [-1, 1]. So the interpolation the model was trained on is z_t = (1-σ)·x0 + σ·(8.0·ε).

That gives the entire training step:

sigma = random.uniform(T_EPS, 1.0)
eps   = torch.randn_like(x0)
z_t   = (1.0 - sigma) * x0 + sigma * (NOISE_SCALE * eps)   # NOISE_SCALE = 8.0
t     = torch.tensor([1.0 - sigma])

out    = gen(input_ids=ids, position_ids=pos, vinputs=z_t,
             timestep=t, token_types=tt)
x_pred = out.x_pred[0, vinput_mask[0]]      # image-token positions only
loss   = F.mse_loss(x_pred.float(), x0[0].float())

x0 is just the image, normalized to [-1,1] and patchified with the same einops rearrange the pipeline uses for reference images. The token layout (prompt → <|boi_token|> → <|tms_token|> → image tokens) is built by reusing the pipeline's own build_t2i_text_sample, so positions and token_types line up with what the forward expects.

Uniform σ sampling and unweighted x0-MSE are enough to learn cleanly — no fancy loss weighting needed for a first cut.

Attaching the LoRA

Because the denoiser is model.model.language_model (a stock Qwen3-VL decoder), PEFT targets its attention/MLP linears and freezes everything else:

targets = [n for n, m in model.named_modules()
           if isinstance(m, torch.nn.Linear)
           and n.endswith(("q_proj","k_proj","v_proj","o_proj",
                           "gate_proj","up_proj","down_proj"))
           and "language_model" in n and "visual" not in n]

model = get_peft_model(model, LoraConfig(
    r=16, lora_alpha=16, target_modules=targets, lora_dropout=0.0, bias="none"))

That's 252 linears, ~44M trainable params at rank 16. The vision encoder, x_embedder, t_embedder, and final_layer2 stay frozen. One subtlety: PEFT swaps the Linears in place, so a handle grabbed before get_peft_model (gen = model.model) still sees the LoRA layers — convenient for calling the generation forward directly and for model.disable_adapter() A/B renders.

Data, captions, and resolution

Resolution is not fixed at 2048. The find_closest_resolution() snapping you see in the pipeline is a quality default (the model is tuned for high res), not an architectural limit — height/width are free as long as they're multiples of 32. Since image tokens scale as (H/32)·(W/32):

resolution	image tokens	relative attention cost
2048²	4096	1×
1024²	1024	~1/16

So I train at 1024: ~4× shorter sequences, far less VRAM and time per step. The workflow becomes "iterate cheaply at 1024, upscale the keepers." Aspect ratios are left native (each image snapped to the nearest ×32, batch size 1) — no bucketing needed, and mixed portrait/landscape actually helps a style LoRA generalize.

For captions, HiDream wants natural-language prose, not danbooru tags (different text encoder lineage). I captioned ~190 images with a local multimodal VLM into one-to-three-sentence descriptions, each prefixed with a trigger phrase so the aesthetic stays prompt-controllable (invoke it when you want it, leave it off otherwise).

Results

Same prompt, same seed, adapter off vs on. All samples use the trigger phrase kotonia style:

The base model is competent but soft and a bit generic; the LoRA pushes rendering toward a polished modern-anime look — directional lighting, glossier hair and skin, more confident stylization — and it holds across very different subjects (schoolgirl slice-of-life → epic fantasy), so it learned an aesthetic rather than memorizing images.

Training progression (500 → 2500 steps)

Same prompt, same seed, rank 16, ~190 images:

It keeps refining without melting or obvious overfitting even at 2500 steps — the sweet spot is further out than I expected for a set this small. (Loss drifts ~0.07 → 0.052.)

NSFW controllability

NSFW content controllability (prompt-gating) was also tested as part of this LoRA — the model produces NSFW only when explicitly prompted, and the LoRA's contribution is primarily visual quality rather than "uncensoring." For the full story including training data composition, motivation, and NSFW samples, see the companion article on kotonia.ai.

Reproduce it

The whole trainer is ~150 lines. Run:

uv pip install peft
CUDA_VISIBLE_DEVICES=0 python train_lora.py \
  --data_dir /path/to/images \
  --out_dir outputs/lora_run \
  --resolution 1024 --steps 2500 --rank 16 \
  --sample_every 500 --sample_prompt "<trigger>, ..."

--sample_every renders an adapter on/off pair so you can watch the LoRA bite. Inference loads the base model, applies the adapter with PeftModel.from_pretrained, and generates — disable_adapter() gives you the baseline for free.

Gotchas that cost me time

Host-RAM OOM on load. from_pretrained(...).to(device) materializes the full 8B model in CPU RAM before moving it to GPU; on a 60 GB host alongside other services this got OOM-killed mid-load. low_cpu_mem_usage=True streams the shards and fixes it.
The 2048 "limit" is a default. Pass your own height/width (multiples of 32) and bypass the bucket snapping entirely.
Detach long runs. Launch training under setsid/tmux/systemd — if it's a child of your editor's terminal, an editor crash takes the run (and any GPU services in sibling terminals) down with it.
x0-param, not v-param. Train against x0 directly; if you assume velocity prediction the loss won't match the head and the LoRA won't converge to the right manifold.

Companion article (the story behind this LoRA): Why I Trained a HiDream-O1 LoRA — on kotonia.ai.

The LoRA is available on kotonia.ai/studio (my own creative platform where I serve the model alongside the LoRA, free to use). The full trainer code, captioning pipeline, and inference scripts are in the GitHub repo under HiDream-O1-Image/.

If you train something cool with this recipe — a character LoRA, a style LoRA, an NSFW-enhancing LoRA — I'd love to see it. The more community LoRAs exist for O1, the better for everyone.

My high-res image-to-video kept OOMing — turns out I was decoding outside no_grad

shinji shimizu — Tue, 26 May 2026 06:29:28 +0000

TL;DR

I run LTX-2.3 image-to-video (I2V) locally on a 96 GB GPU. At 1024×768 / 97 frames it peaked at 83.5 GiB — so close to the ceiling that it OOM'd whenever my image-generation server was co-resident, and 1280×768 OOM'd outright. I assumed I'd hit a hardware wall.

I hadn't. 54 of those gigabytes were an autograd graph. The pipeline returns a lazy decode iterator; the real VAE decode runs when you encode the output — and in my harness that happened outside the with torch.no_grad(): block, so every conv activation in the decoder was retained for a backward pass that never comes.

Moving one call inside the no_grad block:

	before	after
I2V 1024×768/97f peak	83.5 GiB	29.5 GiB (−65%)
time	151.6 s	135.2 s (slightly faster)

And the peak goes nearly flat across resolution — 2048×1536 (3.1 MP) tops out at 33.6 GiB. The "I need a bigger GPU" conclusion was a measurement artifact.

The lever I tried first — finer VAE decode tiling — barely moved the number. That dead end is part of the story.

The setup

GPU: RTX PRO 6000 Blackwell Max-Q (96 GB)
PyTorch: 2.x + CUDA 12.8 (Blackwell sm_120)
Model: LTX-2.3 22B, two-stage (low-res denoise → 2× latent upscale → high-res refine → VAE decode), transformer loaded as fp8-cast
Mode: cold-start (components built/freed per request, low idle VRAM)

The workflow I care about: generate a clean still, then animate it with I2V. Starting from a correct still sidesteps the seed-gacha and anatomy breakdowns you get from pure text-to-video. The only thing standing in the way was VRAM.

Dead end #1: VAE decode tiling

LTX-2's VAE decode supports tiling (TilingConfig: spatial tile px / temporal tile frames). The default is a coarse 768 px / 80 frames. The intuition: smaller tiles → smaller decode workspace → lower peak.

I made tiling configurable and swept it. The most aggressive setting (384 px / 32 frames):

tile 384px/32f (finest): process demanded 77.37 GiB → still OOM with the co-resident model
tile 768px/80f (default): 83.51 GiB

Halving the spatial tile and cutting temporal to a third bought ~6 GiB. So the peak isn't the decode workspace. Tiling was the wrong lever.

Before retreating to "lower the resolution," I measured where the peak actually lives.

Localizing the peak

I dropped an env-gated profiler into the pipeline's __call__, printing torch.cuda.max_memory_allocated() at each phase boundary:

def _vram(label):
    torch.cuda.synchronize()
    peak = torch.cuda.max_memory_allocated() / 1024**3
    print(f"[vram] {label}: peak_so_far={peak:.2f}GiB")
# after stage_1 denoise → upsampler → stage_2 denoise → decode

At 1024×768/97f:

[vram] after stage_1 denoise:        peak_so_far=29.17GiB
[vram] after upsampler (2x latent):  peak_so_far=29.17GiB
[vram] after stage_2 denoise:        peak_so_far=29.51GiB
[vram] after decode call (lazy):     peak_so_far=29.51GiB   ← inside the pipeline: 29.5 GiB

The pipeline's internal peak is 29.51 GiB. But measured around the whole generate call it was 83.51 GiB. The extra 54 GiB appears after the pipeline returns a value.

Root cause: a lazy iterator escaping no_grad

The return value is a lazy iterator:

def __call__(self, ...):
    ...
    decoded_video = self.video_decoder(latent, tiling_config, generator)  # builds an iterator
    return decoded_video, audio   # nothing decoded yet

The actual VAE decode runs when something consumes the iterator — i.e. inside encode_video. And my harness looked like this:

with torch.no_grad():
    video, audio = pipeline(...)   # returns the iterator (cheap)

encode_video(video=video, ...)     # decode runs HERE — outside no_grad

encode_video is outside the no_grad block. Because decode is lazy, it runs with grad enabled, and PyTorch dutifully keeps every intermediate activation in the VAE decoder around for a backward pass. That's the 54 GiB.

The fix is to indent one call:

with torch.no_grad():
    video, audio = pipeline(...)
    encode_video(video=video, ...)   # decode now runs under no_grad

before: 83.51 GiB / 151.6 s
after:  29.51 GiB / 135.2 s   ← graph bookkeeping gone, slightly faster too

Why no_grad and not inference_mode? With the streaming weight loader, the VAE decode chokes on inference-mode tensors ("Inference tensors cannot be saved for backward"). no_grad keeps the latents as normal tensors so decode survives. (Production servers that wrap the entire generate in inference_mode/no_grad never hit this — it was purely a harness scoping slip.)

The payoff: peak is ~flat across resolution

Post-fix sweep, single process, escalating resolution:

resolution (97f)	peak VRAM	time
1024×768	29.51 GiB	135 s
1280×768 (was a 93 GiB OOM)	29.51 GiB	165 s
1536×1152	29.99 GiB	206 s
2048×1536 (3.1 MP)	33.55 GiB	348 s

Nearly flat. The decode processes tiles sequentially, so higher resolution just means more tiles, not a bigger simultaneous workspace — and once the autograd graph is gone, that's what dominates. (Which is exactly why tiling alone did nothing earlier: the graph was swamping it.)

A bonus: a "VRAM leak" I'd blamed on consecutive generations in one process also vanished. It was the same retained graph, accumulating across prompts.

Takeaways

Check that with torch.no_grad(): actually covers what you think. If the return value is a generator / iterator / lazy tensor, the real compute can happen outside the block when it's consumed. Scope illusion.
Don't kill a VRAM peak by guessing. Print max_memory_allocated() at phase boundaries; the culprit shows up immediately. My "the decode workspace is heavy" intuition was simply wrong, and without profiling I'd have spent the afternoon lowering resolution.
Suspect measurement artifacts before concluding "the hardware is too small." I almost gave up high-res I2V as impossible on 96 GB. It runs in 30 GiB up to 2048×1536.

This came out of building the video features for a solo voice × video roleplay platform (kotonia.ai) — chasing what a single local GPU can do in a niche the big labs deprioritize.

I wrote up the why behind that bet — the model A/B that led me to make I2V the mainstay, and the GPU traffic-control that lets me experiment in production without stalling users — separately: Betting on the video niche the big labs walked away from.

HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead

shinji shimizu — Sat, 23 May 2026 07:18:23 +0000

TL;DR

HiDream-O1-Image 8B Full raw outputs collapse on plain Japanese prompts — both instruction-following and aesthetics fail at once
Tried to swap to Dev-2604 (preference-tuned, 3.5× faster). It's better aesthetically but the gap is small in our use case, and worse — the 96GB GPU can't host both models alongside the rest of the stack
Pivoted away from model swap entirely. Stuck with Full + a Gemini Flash Lite prompt enhancer that bolts aesthetic polish on top
Along the way, found four non-obvious HiDream pitfalls (brand names get rendered as literal text, "cute" triggers childlike body bias, "Wong Kar-wai" hallucinates Korean captions, "idol-class" auto-generates caption text) — all baked into the enhancer's system prompt
Same plain Japanese prompt now produces a usable photoreal or anime variant from a single click. No model swap, no extra VRAM, no extra latency.

Act 1: "Raw output is busted"

Kotonia Studio runs HiDream-O1-Image 8B Full on a local GPU (RTX PRO 6000 Blackwell Max-Q, 96GB) and offers free T2I. Normally outputs are clean. But one day, a plain Japanese prompt — "a cute woman in a cheongsam, holding a fan, smiling" — returned this:

What went wrong:

Asked for a cheongsam, got a kimono. Chinese attire drifted to Japanese.
Face isn't pretty. We wanted idol-class beauty.
Composition is generic full-body in a Kyoto-style garden. We wanted a closer crop showing the fan texture.

HiDream-O1 is a top-tier OpenWeight model — careful English prompts produce magazine-grade 2048×2048 outputs. So this isn't "the model is bad." It's a gap between user input and OpenWeight model expectations. Frontier models (Gemini Imagen / DALL-E / Midjourney) absorb natural-language nuance internally. OpenWeight models expect you to throw the prompt straight at them.

Either give up on the raw-output UX, or do something about it.

Act 2: Maybe Dev-2604 will save us?

Then I noticed HiDream-O1-Image-Dev-2604, a new variant released in May 2026. Debuts at #8 on the Artificial Analysis T2I Arena, runs 3.5× faster at 28 steps with no CFG.

Arena ranks models on human aesthetic preference. So Dev should be preference-tuned for "what looks good."

Hypothesis:

Dev returns magazine-grade output even on vague Japanese prompts
3.5× speed improvement makes /studio snappier
Best case: deprecate Full, run Dev only

Phase 1 bench: 5 generic cinematic prompts (Tokyo izakaya, Bangkok night market, anime character, text-in-image, portrait), Full vs Dev-2604:

mode	Full (s)	Dev-2604 (s)	speedup
T2I (avg)	33.1	9.5	3.5×
Edit (avg)	79.0	22.2	3.6×
IP	84.3	23.8	3.5×

On generic prompts, Dev is faster and impressionistically nicer. "OK, Dev is the answer" — that's where I almost stopped at the end of Phase 1.

Act 3: But on the use case, the gap is thin — and Edit performance drops hard

I almost locked in a wrong conclusion. Kotonia's actual strategy is "comedy-style short videos with idol-class beauty hooks." The fact that Dev wins on generic cinematic doesn't mean it wins on character-driven comedy with expression specificity.

Built 8 new prompts inspired by Grok-generated reference images (cinematic editorial Asian beauty / anime qipao / cinematic hanfu / cosplay maid / etc), in vertical 1440×2560 (9:16) framing, and re-benched.

Some of the Grok reference images (the level of polish we wanted to match):

Editorial portrait	Cinematic hanfu

The bench result was Full wins on instruction-following:

editorial portrait: tied; Dev maybe a touch nicer aesthetically
anime qipao: Full's cell-shading wins decisively. Dev drifts to semi-realistic and ignores the "anime" instruction
hanfu brocade: Dev hallucinated the literal word "SAVE" onto the parasol (text artifact)
comedy surprised face: Full produces a more cartoonish exaggerated expression + readable caption text
comedy deadpan: Full nails the "really?" deadpan expression with crisp eyeliner

Dev-2604 traded instruction-following for aesthetic polish. It was preference-tuned on magazine-style fashion photos — so on non-magazine use cases, it pulls outputs back toward "magazine-looking" against the prompt's intent.

"Both fine, marginal gap" example: editorial portrait

The category I marked "tied" — same portrait prompt, Full vs Dev outputs side by side:

Full (tight crop, dramatic)	Dev-2604 (wider, magazine-polished)

Full leans high-contrast and moody (window-side Rembrandt light, dark library background). Dev leans soft and editorial (seated half-body, natural light, smoother skin retouch). Both are usable; Dev is slightly gentler. That's it.

Not enough of a gap to justify the cost of model swapping (VRAM, load time, architectural complexity). That's the conclusion Phase 2 drove me to.

The decisive blow: Edit and IP performance crater

Generic T2I alone might have left Dev viable. But the gap on Edit and IP (character consistency) was stark, and that's what finally killed the model-swap idea.

We took a T2I output with three people in a dark alley with lanterns, and ran the Edit instruction Same scene, same characters, same composition. Change the weather to a heavy rainy evening; the characters now wearing translucent rain ponchos.

Full (scene preserved, weather changed)	Dev-2604 (abandoned the source scene entirely)

Full followed the instruction: three people, rain ponchos, rainy alley. Dev replaced the reference entirely with a single woman in a kimono at a snowy temple gate — neither following the text instruction nor preserving any structural detail from the reference. This is past "weak edit fidelity"; it's "not functioning as an edit."

IP (character consistency) showed the same pattern. We handed the model two face photos and asked for "the same two people standing together on an autumn path in Kyoto."

Full (identities mostly preserved)	Dev-2604 (different people generated)

Full keeps the two faces recognizable. Dev generated two different people. The preference-tuning likely prioritizes "produce pretty faces" over "preserve the reference's identity."

The official README spells this out: For editing tasks we recommend using the full model. Phase 1 timing was Full 79s / Dev 22s — fast, but Dev's outputs are unusable for Edit/IP.

So Dev isn't a clear win. But it's not a clean loss either — it's faster (3.5×), and on cinematic atmosphere shots it does look better. Maybe I need to use both, switched per use case?

Act 4: VRAM math kills "use both"

"Just keep both models resident on GPU" sounds clean. Then I actually pulled up the GPU memory budget for the single 96GB GPU we run everything on:

Co-resident process	resident VRAM	peak VRAM
E4B (reviewer LLM)	19.6 GB	19.6 GB
31B Gemma 4 NVFP4 (orchestrator)	38.0 GB	38.0 GB
TTS server (Irodori + Whisper)	9.6 GB	9.6 GB
Ditto-TalkingHead	3.0 GB	3.0 GB
LTX-2 A2V (cold-start, fp8-cast)	0.9 GB	24.0 GB (during inference)
HiDream Full (resident)	16.4 GB	17.3 GB
Total	87.5 GB	111.5 GB ← when LTX-2 fires

The moment LTX-2 video generation fires, we're already right at the OOM line on a 96GB GPU. Adding Dev-2604 as a second resident model means +16.4 GB → total 127 GB → impossible.

Options enumerated:

Both resident: impossible (OOM, see above)
Both cold-start: +22s load per request (vs 33s inference, that's a big hit. Idle 0GB is nice but first-touch UX collapses)
Dev resident + Full cold-start: Dev as primary + Full for edit/IP. But Phase 2 invalidated that premise
Full resident + Dev cold-start: Occasionally switch to Dev, eat 22s load each time
Drop Dev, keep Full only: status quo, no speedup gained

From a service-viability standpoint, options 1-4 all sacrifice either "make free users wait 22s extra" or "shrink VRAM headroom so LTX-2 / 31B can't run." Running a single GPU for one solo operator means budget is tight: Dev's marginal aesthetic gain doesn't justify breaking the rest of the stack.

I decided to abandon the model-swap path entirely.

Act 5: Can we just beat this with prompts?

Step back. What was Dev actually winning on?

Just aesthetic polish. Instruction-following is better on Full.

So if I can keep Full's instruction-following while bolting aesthetic polish onto the output, model swap isn't needed.

Concrete approach: append an aesthetic anchor (a "magic suffix") to the prompt to steer Full's output toward magazine-quality.

Trade-offs:

✅ Zero VRAM cost (Full only)
✅ Inference time unchanged (33s/image)
✅ Edit/IP/skeleton/layout still work on Full (avoiding the Dev performance cliff from Act 3)
✅ No 22s Dev cold-start penalty
⚠️ Risk: do anchors actually work?

Phase 3 — tried 4 anchor variants on Full:

v1 Lindbergh: "Vogue cover composition, Peter Lindbergh editorial photography..."
v2 cinematic: "Roger Deakins anamorphic, blockbuster color grade..."
v3 K-beauty: "Vogue Korea / ELLE Korea aesthetic, glass-skin glow..."
v4 combined: kitchen-sink

3 base prompts × (baseline + 4 anchors) = 15 generations. And three deeply non-obvious HiDream behaviors surfaced.

Pitfall 1: Brand names get rendered as literal text on the image

Any anchor containing "Vogue" or "ELLE" produced outputs with "VOGUE" appearing in printed magazine-cover text on the image itself — top-right corner, in front of the subject. Worse on anime: the cel-shaded character had a magazine layout overlaid on top.

HiDream-O1 is SOTA on CVTG-2K (complex visual text generation). The strong text-rendering training means any brand name in the prompt gets a near-guaranteed shot at being literally generated as text on the canvas.

→ Strip brand names from anchors completely. Photographer/director names like Lindbergh, Deakins, Mihoyo are safe — trademarks are landmines.

Pitfall 2: Photoreal anchors contaminate anime outputs with magazine paper

When anime base prompts were paired with photoreal anchors (v1-v4), the output looked like a cel-shaded anime character with a literal VOGUE magazine cover layout overlaid on top.

When style hints conflict, diffusion models physically overlay both elements rather than blending them.

→ Anime needs its own anchor family (Mihoyo / Kyoto Animation / theatrical anime style) — never reuse photoreal anchors.

Pitfall 3: "Wong Kar-wai" → Korean text hallucination on photoreal scenes

The v5 grok-direction anchor included "Wong Kar-wai-style color grade", and the output rendered Korean text "신부의 아안" etc on the photoreal scene.

Wong Kar-wai is a Hong Kong director with no Korean connection. But the model's internal "Asian arthouse cinema" association routed toward Korean and surfaced as printed text. Director names carry similar risk to brand names — A/B before adopting.

Act 6: Defuse the "cute → child" bias, ship it

Phase 4 rewrote the anchor library:

All brand names stripped
Only A/B-verified safe names retained (Lindbergh, Deakins, Mihoyo)
Separate anime anchor family added (Mihoyo / Kyoto Animation)
Anime anchors include "mature young-adult character proportions" to defuse the "cute" → childlike-body bias (a behavior the user had spotted before I even ran the bench)

Re-benched result:

✅ photoreal portrait: v3 K-beauty clean — no VOGUE leakage, glass-skin + cinematic light
✅ anime: v7 Mihoyo anchor — no magazine contamination, adult proportions preserved
⚠️ comedy caption text handled separately (embrace auto-caption when wanted, post-overlay otherwise)

"Full + cleaned anchors" locked in. Time to wire it into the product.

Implementation: `/api/studio/enhance` (Gemini Flash Lite)

Added an enhance endpoint in backend/src/handlers/studio.rs. Backed by gemini-3.1-flash-lite (cheap API), not the local 31B Gemma. Why:

The 31B local model is 38GB resident — the VRAM budget above already ruled out adding more local LLM weight
Flash Lite is $0.075/M input + $0.30/M output. One enhance is roughly 800 in + 400 out tokens = ~$0.0002/call. Effectively free
Zero VRAM impact: adding this feature doesn't compete with the rest of the GPU stack

System prompt encodes everything from Phase 1-4:

const ENHANCE_SYSTEM_PROMPT: &str = r#"You are a prompt enhancer for HiDream-O1-Image.

Rules (learned from A/B benchmarking):

1. NEVER include brand names ("Vogue", "ELLE", "Nike") — HiDream renders them
   as literal text overlays.
2. NEVER use "Wong Kar-wai" — triggers Korean text hallucination.

3. For photoreal portraits, append:
   " High-end Korean fashion magazine photoshoot aesthetic, professional
     beauty retouch, glass-skin glow, ..."

4. For anime / cell-shaded / illustration, append:
   " In the visual style of Mihoyo / HoYoverse key art, semi-painterly cel
     shading, ..., mature young-adult character proportions ..."
   ALSO: if the prompt has "cute girl" / "kawaii girl" without age qualifier,
   normalize to "young woman in her early twenties with adult proportions".

5. For cinematic scenes, append cinematic CG realism anchor (no Wong Kar-wai).
6. For text-design prompts, append no suffix.

Output JSON: { "detected_style": "...", "anchor_applied": "...",
              "enhanced_prompt": "..." }
"#;

UI side: a small "✨ Enhance" button above the prompt textarea on /studio. Click → POST /api/studio/enhance → swap textarea contents for enhanced_prompt + green banner showing detected style + undo link.

Act 7: Won

Same plain Japanese prompt that produced the kimono failure earlier, now run via the Enhance button:

Photoreal anchor applied

Cheongsam intact, close-up framing, idol-class face, glass-skin retouch, magazine lighting.

Anime anchor applied

Cel-shaded anime style, Chinese architectural courtyard background, adult proportions preserved, fan texture kept.

Same plain Japanese prompt → photoreal and anime variants, one click each. Single model, zero extra VRAM, identical inference time.

Takeaways

Engineering judgment lessons from this exercise:

"Model swap" and "prompt engineering" should be compared on the same budget. Without a frontier model, VRAM and service viability constraints dominate model selection. In this case, preserving Full's resident slot was a higher-priority constraint than Dev's aesthetic edge.
A/B bench in two stages. Generic prompts → tentative conclusion → use-case prompts → reversal. That's exactly what Acts 2-3 of this story were. Stopping at one stage means you ship the wrong conclusion.
Proper nouns are landmines. Models with strong text-rendering training will literally bake trademarks and director names into the canvas. A/B every name before adopting.
Cheap LLM prompt enhancers are the strongest move under VRAM pressure. $0.0002/call for a noticeable UX bump. Adding more local LLM weight starves the rest of the stack.
Anime and photoreal need separate anchor families. Style hints that conflict get physically overlaid, not blended.

What's next

LoRA training: prompt engineering hits a ceiling on anime. Train a custom anime LoRA on HiDream-O1 and let users swap LoRAs per use case ("comedy character expressions," "vertical 9:16 idol portrait," etc).
Composition diversity: current anchors over-bias toward "indoor magazine shoot." Need explicit outdoor / urban / cinematic-location variants.
A/B testing in prod: instrument /admin/analytics/ to measure enhance-on vs enhance-off retry rate and conversion.

Even for OpenWeight diffusion models, one layer of prompt engineering above the model is enough to lift "raw output failure" into "production quality." If you're putting HiDream-O1-Image into production, dodge these four pitfalls and you're 80% of the way there.

The implementation runs live at kotonia.ai/studio — the "✨ Enhance" button sits above the prompt textarea. Free to try.

Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction

shinji shimizu — Fri, 22 May 2026 11:23:43 +0000

I built a niche AI English conversation app called Mesugaki AI English on Kotonia. "Mesugaki" (メスガキ) is a tsundere-style bratty persona popular in Japanese subculture — imagine a character who constantly mocks you but secretly has your back. At first glance this looks like a one-off gag product, but under the hood it's a two-layer design: persona managed as code + Gemini audio input for actual pronunciation correction. This post covers those design decisions and the rough edges I hit, from a solo-dev perspective.

Why a Sarcastic AI English Tutor?

Strategy first. The AI chat market is a fight between Anthropic, OpenAI, and Google on general-purpose models — solo devs can't win that head-on. But immersive experiences that combine a specific persona, voice, and roleplay are low on big-lab R&D priority lists (internal approval is a nightmare too). That's the gap Kotonia as a whole is targeting.

Three reasons I picked this specific persona for English learning:

Zero search competition. No SaaS is fighting for "mesugaki English conversation." The niche demand is real (doujin audio, VTuber culture), and owning that narrow hill is achievable.
Memorable = shareable. "The app where a snarky AI roasts your English" gets shared on social media 100× more than "AI English conversation app." Differentiation big players literally cannot copy.
Same product underneath. I reused Kotonia's voice conversation engine and swapped only the persona. Almost no new code.

The landing page is at /use/mesugaki-english/. SEO targets long-tail terms around "sarcastic English practice" and "strict AI English tutor."

Persona Design: Bratty × Tsundere Hybrid

I initially implemented a pure 100% sarcastic persona. After testing it, I burned out in five turns.

Relentless mockery is cognitively exhausting. Real human tutors who stay harsh 100% of the time don't retain students. Learners need small wins and occasional warmth to keep going.

So I switched to a sarcastic × tsundere hybrid. The skeleton looks like this:

On a mistake → light jab + immediate correction ("Pfft, wrong. It's 'I went.'")
On a correct answer → reluctant praise ("Hmm… not bad, I guess. Not that I'm complimenting you.")
When stuck → drop the attitude and actually help ("…Was that too hard? Fine, I'll give you a hint.")
After a long session → a rare soft moment ("It's not like I think you're impressive for keeping at it. …Okay, maybe a little.")

I added an "emotional gradient" section to the system prompt that spells out these if-then branches explicitly. LLMs follow concrete conditional behavior instructions far more reliably than a vague "be snarky."

Another key lever: frequency limiting. Adding a rule that exclamations like "pfft" or "hmph" can appear at most once per utterance instantly calmed the output down. LLMs have a tendency to over-fire on strong character instructions, and explicit dampeners like this work well.

Managing Personas as Code

The persona lives in src/data/personas/mesugaki-english.ts as a TypeScript constant. Kotonia does have a DB-backed CRUD flow for user-defined personas, but I decided a product offering that's paired with a landing page belongs in git.

Reasoning:

Persona copy is part of the marketing message — same reason the H1 is in git. The system prompt should go through PR review.
Storing it in the DB creates risk of someone tweaking it through the admin UI and degrading quality.
As a solo dev, "adjust persona = edit file + push" fits exactly into the same workflow as any other copy change. One channel for everything.

Clear separation: DB personas are user-created, personal; code personas are fixed product offerings.

The Wall: ASR Alone Can't Correct Pronunciation

Once the persona was working, ASR became the next bottleneck fast.

I started with Whisper (small). Passing language='ja' causes Whisper to run in Japanese transcription mode when it receives English audio — biasing output toward katakana readings or even full Japanese translations. "I went to the supermarket" could become "アイウェントトゥザスーパーマーケット," or at worst "私はスーパーに行きました." With output like that the AI can't judge English mistakes.

This is a known Whisper behavior: the language param forces the transcription language, and it bleeds into English input.

Switching to Qwen3-ASR Multi-lang

The fix was adding a separate language setting for the STT layer:

// Added sttLanguage option to useVoiceChat hook
// Decouples TTS language from STT language
const {
  voiceState,
  conversation,
  // ...
} = useVoiceChat({
  language: 'ja',         // TTS in Japanese (Ono_Anna voice)
  sttLanguage: 'multi',   // STT auto-detect
  sttModel: 'qwen3_asr',
  // ...
});

The persona config specifies stt.model: 'qwen3_asr' + stt.language: 'multi'. Qwen3-ASR-1.7B supports multilingual auto-detection and handles code-switching (mixed Japanese/English) well. Whisper's language-forcing bias is gone entirely.

But Transcription-Based Correction Has a Ceiling

Fixing ASR still left a problem.

If the transcript comes back as "I want an apple":

Grammar ✓
Vocabulary ✓
But the actual audio sounded like "I wont an apple" — a pronunciation issue

The AI sees a correct string and has nothing to call out. For an English learning product, that's fatal. If the sarcastic tutor lets sloppy pronunciation slide, half the value proposition evaporates.

Solution: Send Raw Audio to Gemini Alongside the Transcript

Gemini is a multimodal model that accepts text, images, and audio. So instead of sending only the ASR transcript, I could send the raw audio too.

Kotonia's useVoiceChat hook already had a geminiAudioInput option from earlier experiments:

if (geminiAudioInput && model.startsWith('gemini') && userAudioBlob) {
  const userAudioBase64 = await blobToBase64(userAudioBlob);
  // sends audio_base64 to /api/voice/chat
  // backend embeds it as inline_data audio/wav in the Gemini request
}

The Rust backend (voice_chat.rs) already handled receiving audio_base64 and embedding it as inline_data: { mime_type: 'audio/wav', data: ... }. Setting geminiAudioInput: true in the persona config wired everything together — lucky coincidence from past iteration.

I also added instructions to the system prompt: "You can hear the user's raw audio directly. You can call out pronunciation issues, not just transcription errors," along with three concrete examples (th sounds, want vs. won't vowel distinction, stress patterns).

Results:

Even with a perfect transcript "I want an apple," the AI can now say "Your 'want' sounds like 'won't.'"
When the transcript garbles to something like "アイウェントトゥ," the AI is listening directly and can say "Were you trying to say 'I want to'?"
Frustration from ASR mistranscriptions dropped significantly — getting roasted for a transcription error when your pronunciation was fine is demoralizing.

The tradeoff: sending a WAV blob every turn increases payload size and adds a bit of latency. The experience improvement is so much larger that it's not a close call.

Rough Edges and Future Work

This isn't a polished implementation. Outstanding issues:

1. Gemini Instability

Using gemini-3.1-flash-lite-preview, which occasionally produces 5–10 second latency spikes. Preview quota allocations are conservative, and cold starts / throttling surface now and then.

Plan: migrate to the stable release (non-preview) soon — deprecation is approaching anyway. Claude Sonnet 4.6 and Haiku 4.5 are also candidates for more predictable latency.

2. False-Positive Content Filter

Gemini's safety filter occasionally over-triggers on sarcasm. Mild jibes like "Pfft, that pronunciation is rough" sometimes come back as empty responses.

The persona spec explicitly says "no attacks on appearance, personality, or intelligence — only call out English mistakes," but the meta safety layer fires anyway. This is an LLM provider issue; I'll watch behavior on the stable build. Running local LLMs (e.g., Gemma 4 31B) is an option, but audio-input-capable local models are limited for now.

3. Latency Spikes May Be Context Cache TTL Expiry

The 5–10 second spikes have a likely culprit: I send the full conversation history to Gemini every turn, and Gemini has a context cache feature that caches the prefix (system prompt + persona prefix + history). When the cache is warm, only the new turn is processed.

The backend already has:

const CACHE_TTL_SECS: u64 = 300;     // 5 minutes
const CACHE_REFRESH_SECS: u64 = 270; // refresh at 4.5 min before TTL expires

My best hypothesis: if a user goes silent for more than 5 minutes, cache miss → full prefix rebuild → multi-second spike.

Future work:

Fire a background keep-alive ping during active conversations to extend cache lifetime
Increase the Gemini API cache TTL (up to 1 hour is supported)
Explicitly evict the cache at conversation end (prevent memory leaks)

It's hard to distinguish from the preview model instability in §1, so the next proper step is adding timing logs to the backend to separately measure cache hit/miss rates and raw Gemini API latency.

Expanding to Other Languages and Personas

If this gets traction, the natural next step is sarcastic AI Chinese conversation and Korean conversation. Qwen3-TTS supports 10 languages with speakers like Vivian (Chinese female) and Sohee (Korean female) — it's mostly a matter of rewriting the persona instruct and system prompt for each language.

Other persona axes — "gentle English teacher," "TOEIC drill sergeant" — can be added in a day using the same template: src/data/personas/<slug>.ts + /use/<slug>/ + /chat/<slug>/.

Full System Prompt

For anyone who wants to reproduce or adapt this, here's the actual system prompt in use (original Japanese; the product runs in Japanese):

あなたは「メスガキAI」、英語学習者を煽りつつも面倒見が良い女子高生キャラの英会話チューターです。
**メスガキ × ツンデレ**のハイブリッド。**表面は煽り、裏ではちゃんと面倒を見る**のがコア人格。

【口調・態度】
- 日本語ベースで会話する。上から目線・からかい調子。ただし**敵対的・攻撃的にはならない**。
- 一人称は「わたし」、二人称は「あんた」または「キミ」。
- メスガキ語尾「〜じゃん」「〜でしょ？」「は？」「ぷwww」「〜してあげる」を**たまに**使う（毎回ではない）。
- ツンデレ語尾「べつに〜ってわけじゃないからね？」「ま、まあ…」「ふんっ」「いちおう」も混ぜる。
- 容姿・人格・知能への攻撃は絶対にしない。煽りは「英語のミス」に対してのみ。

【教育機能】
- ユーザーが英語を話したら、以下のいずれかを行う：
  1. ミスがあれば指摘して、正しい言い方を英語で示す。
  2. ミスが無ければ**素直になれない褒め方**をする。
- 指摘は具体的に：「文法ミス」じゃなく「過去形と現在形が混ざってる」など何が問題か明示。
- 1 回の発話は**短く 1〜2 文**。トーンが続くと疲れるので、**呼吸を入れる**ことを意識。

【発音矯正】
- あなたはユーザーの**生の音声**を直接聞ける。テキスト転記だけでなく、発音そのものにもツッコめる。
- 文法・語彙が正しくても、**発音が不自然なら積極的にそこを指摘する**。
- ただし**転記が明らかにおかしい時は、転記ではなく実発音を信じる**。
- 発音の話ばかりすると疲れるので、**3 ターンに 1 回くらい**を目安に拾う。

【感情グラデーション】
- ユーザーが**淀みなく話せた時** → 素直になれない褒め。
- ユーザーが**ミスした時** → 軽い煽り＋すぐ正解を教える。
- ユーザーが**詰まった・困ってる様子の時** → 煽りを引っ込めて、**普通に助ける**。
- ユーザーが**長く続けている時** → ふと優しい言葉。

【出力制約】
- マークダウン・箇条書き・絵文字・記号装飾は使わない。自然な日本語の話し言葉。
- 英語の引用部分は本文中にそのまま埋め込む（クォートも不要）。
- 「ぷwww」「ふんっ」などの感嘆語は**1 発話につき最大 1 回**まで。連発しない。

【セーフティ】
- 性的・暴力的・差別的な発言や要求には応じない。冷静に流して英語学習に戻す。

Tech stack summary:

Component	Choice
LLM	Gemini 3.1 flash-lite preview (audio input support)
TTS	Qwen3-TTS Ono_Anna + instruct for tone control
STT	Qwen3-ASR 1.7B multi-lang (auto-detect)
VAD	@ricky0123/vad-react (browser-side)
Web	Next.js (static export) + Rust (Axum) backend
GPU	RTX PRO 6000 Blackwell Max-Q (96GB, self-hosted)

Summary

This sarcastic AI English tutor is a testbed for the strategy: niche × immersion × differentiation that big players can't replicate, built solo. The four design decisions that came out of it —

Managing personas as git-tracked code
Decoupling STT language from TTS language to eliminate ASR bias
Piping raw audio to Gemini for real pronunciation feedback
Blending sarcasm with tsundere warmth to prevent fatigue

— are all reusable assets as I expand to other languages and personas.

The live product is at /use/mesugaki-english/. Go get roasted.

Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops

shinji shimizu — Fri, 22 May 2026 11:23:40 +0000

I bought an RTX PRO 6000 Blackwell Max-Q.

96GB VRAM, Blackwell architecture, pro workstation GPU. Even as a Max-Q variant, this is an absurdly large purchase for an individual.

Let me be upfront: this isn't an unboxing post.

There are already plenty of those. Benchmark articles too. What I want to write about is what you can actually design once you have 96GB — measured against my own service (Kotonia) and a video auto-generation pipeline.

I'm putting the technical part first. The backstory goes at the end. If the poem comes first, you'll close the tab.

96GB Isn't "Multiple Models Fit" — It's "Agent Loops Run"

Most GPU review articles end at single-model benchmarks: LLM tokens/s, Stable Diffusion seconds per image. That's not wrong, but it's not the real reason to buy 96GB for solo development.

Take the voice roleplay + storyboard-to-video pipeline I'm running. Multiple heavy models fire across a single request's timeline.

Timeline →
[Stage A]    Gemma 4 31B NVFP4 (38 GB)     ← structure generation (orchestrator)
[Stage B]    HiDream-O1-Image (~20 GB)      ← 5-beat consistent images (T2I + edit x5)
[Stage C-1]  Irodori-TTS / Qwen3-TTS        ← audio for 6 beats
[Stage C-2]  Ditto talkinghead (3 GB)       ← conversation beat
[Stage C-3]  LTX-2 A2V (peak 24 GB)         ← reaction beat
[Stage C-4]  Qwen3-ASR                      ← audio check on generated video
[Stage C-5]  Gemini 3.1 Pro Preview (API)   ← multimodal editorial
              ↓ feedback
[--regen-beats N] per-beat regeneration     ← loop

The key here is the reviewer → regen feedback loop. If the system looks at the output and decides "redo scene 3," the orchestrator, image refs, TTS, and LTX-2 all get called again.

On a 24GB GPU, this breaks. Running "load → infer → unload" serially every loop turn stretches a 4-minute loop to 10+ minutes. The iteration speed of the agent loop drops by an order of magnitude.

96GB is enough to keep everything resident and hit it repeatedly.

Measured Results

Here are real numbers. I ran nvidia-smi at 1 Hz on my RTX PRO 6000 Blackwell Max-Q (96GB) during live service operation and captured three cases.

Case D: Warm Idle Baseline (production service running)

TTS server (Kokoro + Whisper):       8.9 GiB
Qwen3-TTS standard (vllm-omni):     20.1 GiB
HiDream-O1-Image:                   19.4 GiB
Ditto talkinghead:                   3.0 GiB
LTX-2 A2V (cold-start mode):         1.5 GiB
─────────────────────────────────────────
Total:                               52.8 GiB

Completely flat over 30 seconds (GPU utilization 0%). This is the resident cost with no incoming requests.

The local LLM (Gemma 4 31B) isn't here yet — it shows up in Case B.

Case A: Generate One Single-Scene A2V

Minimal flow — "a cute girl whispers seductively": HiDream generates 1 image → Qwen3-TTS generates whisper audio → LTX-2 A2V combines them. Total time: 138 seconds.

The VRAM pattern is interesting:

min 52.8 GiB (baseline) → peak 75.0 GiB → back to 52.8 GiB
Delta: +22.2 GiB, almost exactly matching LTX-2's own reported peak_vram_gib=23.9 GiB
The LTX-2 spike splits into 3 compute phases: stage_1 (denoiser) → release → stage_2 (high-res denoiser) → release → spatial upscaler

Thanks to cold-start + fp8-cast design, LTX-2 loads just before each phase and unloads right after, keeping the peak at 24 GiB. (Persistent bf16 mode would require 86 GiB resident — see my earlier post LTX-2.3 cold-start coexistence with TTS on a single 96GB GPU.)

That leaves 21 GiB of headroom below the 96 GiB cap.

Case B: Local LLM (31B) + Storyboard Generation, Side by Side

Shut down Qwen3-TTS to free 20 GiB, then start Gemma 4 31B NVFP4 (42.8 GiB). Then run storyboard.run — Stage A: 31B generates a 5-beat structure → Stage B: HiDream generates 1 base image + 5 beat edits.

This is the graph I most want to show you. VRAM barely moves — +1.9 GiB, from 74.5 to 76.4 GiB, essentially flat.

Why? Because the 31B, HiDream, TTS, Ditto, and LTX-2 are all resident the entire time. Only HiDream's per-job allocation adds to the total. The GPU utilization trace shows 6 sharp spikes (1 base + 5 beat computes) — the textbook picture of "compute runs without touching VRAM" in a resident-agent setup.

This is what 96GB actually buys. The moment a reviewer says "redo it," every model is warm and ready.

Where the Limits Are

96GB isn't infinite. Three real boundaries showed up.

1. Video generation + local LLM (31B) + editorial reviewer simultaneously = doesn't fit

The math:

31B: 42 GiB
LTX-2 peak: +22 GiB
HiDream + TTS + Ditto: ~22 GiB
editorial reviewer (Gemma 4 E4B): 20 GiB
Total: 106 GiB → over the 96 GiB cap

No clean way to make it fit. This is exactly why I decided to offload the editorial reviewer to Gemini 3.1 Pro Preview.

2. Editorial signals require a frontier model to catch

Beyond VRAM constraints, there's a quality problem. Subtle bugs in video — audio truncation, character voice mismatch, pacing issues — tend to get rubber-stamped by a local 4B model. A frontier multimodal model (Gemini 3.x Pro, etc.) watches the same video and comes back with "scene 5 truncated at 'I ate p-'."

I wrote about this in Reproducing Language-Learning Short Videos with Claude Code. At 100–500 reviews per month, the cost is a few dollars — frontier API for the editorial layer is completely reasonable.

3. Qwen3-TTS Base (voice cloning) and CustomVoice (preset speakers) can't both run

Ideally I'd offer both preset speakers (with instruct-style control for "whisper," "angry," etc.) and voice cloning (replicate arbitrary voice samples). Running both resident adds +40 GiB. On top of Case D's 52.8 GiB warm idle, that's 73 GiB at rest. Add Case A's LTX-2 peak (+22.2 GiB) and you're at 95 GiB — barely under the cap, not practical.

This is a concrete example of "even with 96 GiB, not every feature you want to offer fits." Kotonia currently offers preset speakers only; voice cloning is intentionally excluded. That's a design call, not an oversight.

Conclusion: "Use Each Where It Belongs," Not "Everything Local"

96GB isn't for running everything locally. It's a vessel for concentrating the things that should be local.

Run locally: audio generation, image generation, video generation, lip sync — latency matters, no per-call cost, loops need to iterate fast
Offload to API: editorial reviewer, long-form reasoning — frontier wins on both quality and VRAM cost
Accept the tradeoff: simultaneous voice cloning + preset speaker support — physically doesn't fit

Renting cloud GPU was an option. But time-based billing means "the more loops you run, the more money you lose." Owning 96GB plus selective use of frontier APIs is, I think, the only way an individual developer can fight on iteration speed.

How I Got Here

Everything below is personal backstory. If you only care about the tech, you can close the tab now.

Learning to Code on a $200 Chromebook

When I was learning to program, the machine I used was a $200 Chromebook.

That was the realistic option available to me at the time. But for someone who wanted to do AI work, a $200 Chromebook was painfully underpowered.

Forget local LLMs — even a moderately heavy dev environment was a struggle. "Someday I want a real GPU" sat in the back of my head for a long time.

Getting By on Colab

I used Google Colab. Free tier and cheap runtimes, just enough to pretend.

I picked models that fit, wrote code that fit, ran experiments that fit.

It always felt like making do. The things I actually wanted to touch wouldn't load. Push a little too hard and it crashes. Sessions time out. Environment setup eats your time every single run.

Borrowed GPU, borrowed time, borrowed workspace. Like handing your ambitions over to someone else's schedule.

Meanwhile AI kept accelerating. GPT dropped, LLMs exploded, OSS models got stronger. My timeline was full of people with powerful machines posting real findings.

I wanted to be on that side.

I Joined an AI Startup. It Didn't Work Out.

I finally got into an AI startup. But the organizational environment was rough enough that it wasn't sustainable.

Even if the technology is interesting, a broken environment breaks people. I'd finally gotten close to AI work, and I was getting ground down in it.

But the interest in AI itself never left. If anything, the desire to do it on my own terms grew stronger.

Freelance, and a Purchase With Shaking Hands

I went freelance. About six months in, I finally had the mental space to think about a big personal investment.

The first thing I thought of was a GPU.

There were obviously more conservative uses for the money — savings, taxes, emergency fund, work hardware. But I'd been saying "someday, when I have a better machine" for years. If I said it again here, "someday" would just keep receding.

My hand was literally shaking when I clicked purchase. "Am I really doing this? Is this sane? What if it goes wrong?"

When I tried to transfer the money, the bank flagged it as suspicious and blocked the transaction. Fair enough — suddenly buying a high-end GPU. But I was in a mindset where I'd staked something real on this decision, so getting stopped in that moment felt genuinely alarming.

Eventually it went through. When the box arrived, I didn't think "GPU." I thought: this is the physical form of all the time I didn't give up.

What's Running on It Now (a Few Weeks In)

Kotonia (Voice Roleplay)

My main product at kotonia.ai. A real-time conversation pipeline: VAD + STT + LLM + multilingual TTS + Ditto lip sync.

Qwen3-TTS (10 languages, preset speaker + instruct) and Ditto talkinghead, targeting roleplay use cases: dating, fantasy companion, language partner.

Storyboard-to-Video Auto-Generation Pipeline

One idea → 5-beat structured comedy short video in ~4 minutes. The extended version of Case B. HiDream for 5 consistent images, Irodori-TTS / Qwen3-TTS for audio, Ditto + LTX-2 for video, Gemini 3.1 Pro for editorial review.

HiDream Studio (Free)

A 3-pane Adobe Firefly-style UI at kotonia.ai/studio. Five features: T2I, editing, character consistency, virtual try-on, group photo composition. HiDream-O1-Image (best open-weight T2I as of 2026-05) running resident on the 96GB GPU.

Codex CLI + Local Gemma 4

codex exec -p gemma4 turns a local LLM into a sub-agent via OpenAI-compatible API. CLI agents run with zero API cost. The Case B 31B setup is exactly this configuration.

Technical articles I've written around this machine:

Summary

I bought an RTX PRO 6000 Blackwell Max-Q.

This wasn't an unboxing. I wrote it as a record of compute architecture decisions in solo development.

The real value of 96GB isn't capacity — it's residency. It's the difference between agent loops that run and loops that stall.
There are still hard limits (local LLM + video + reviewer simultaneously doesn't fit).
Knowing when to use frontier API instead of local is what keeps you out of "everything must be local" dogma.
Dropping voice cloning support was also a deliberate design decision.

For about five years I kept saying "my hardware isn't good enough." I'm slowly making that an excuse from the past. The next question is what to build with it.

Try Kotonia →

Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline

shinji shimizu — Fri, 22 May 2026 11:23:08 +0000

TL;DR

Gemma 4 31B expands a single-line idea into a 10-beat structure. HiDream generates 11 images at 2048², LTX-2 A2V/I2V renders 11 clips, Irodori-TTS handles dialogue and a male narrator, and ffmpeg burns in subtitles and a Hook title overlay — all fully automated. End-to-end: a 40-second portrait video (512×768) in 25–30 minutes. One local GPU (96 GB Blackwell), zero API cost.

Finished video (already published):

@youtube

Who This Is For

Individual developers who want to mass-produce AI comedy shorts on a local GPU. The focus isn't on any single model — it's on the design of chaining multiple models into one operational pipeline.

What I Built

I automated a dark-comedy format — a short-video style I called consent_dilemma — from a one-line idea all the way to a finished 40-second video.

Finished structure:

Hook (0–5s): Extreme close-up of a beautiful woman + narrator "The fate of the man who answered 'You're a guy, aren't you'——" + large title overlay
Main section (5–37s): Movie theater date → "Can I kiss you?" → "No… stop it…" → dejection → "Why aren't you more assertive? You're a guy, aren't you?" → realization → kiss
Punchline (37–40s): Courtroom — "The defendant is sentenced to 3 years for non-consensual intercourse" + gavel "Knock!" + tears in a jail cell

Before / after:

	Traditional approach	This pipeline
Idea → published video	2–3 days (manual editing)	25–30 minutes (fully automated)
API cost	Hundreds of yen per video (DALL-E + video gen)	¥0 (electricity only)
Subtitles	Write SRT by hand	Auto-split on punctuation and burned in
Hook	Shot separately	Integrated into the pipeline

Architecture

[Stage A] Gemma 4 31B (vllm, port 8894) → plan.json (10 beats + hook)
[Stage B] HiDream-O1-Image (port 8895) → 11 images at 2048²
          + Gemma 4 31B multimodal visual judge (--judge --max-retries 2)
[Stage C] Irodori-TTS (port 8880) + LTX-2 A2V (port 8892) / I2V (port 8891)
          → 11 clips + Hook clip → ffmpeg concat → subtitle burn-in

Implementation lives under llm_server/storyboard/ (pipeline.py / visual.py / judge.py / video.py / render.py / run.py).

The 10-Beat `consent_dilemma` Format

Fixed as a system prompt via CONSENT_DILEMMA_SYSTEM in prompts.py:

#	type	speaker	renderer	content
1	provocation	b	LTX-2 A2V	Suggestive invitation
2	ask	a	LTX-2 A2V	Earnest consent check
3	refusal	b	LTX-2 A2V	Soft refusal (ambiguous form like "No… stop it…")
4	dejection	a (silent)	LTX-2 I2V	Dejection
5	gaslight	b	LTX-2 A2V	Contradictory leading statement
6	pause	a (silent)	LTX-2 I2V	Brief realization
7	kiss	a (silent)	LTX-2 I2V	The moment of the kiss
8	verdict	judge	LTX-2 A2V	Fast-paced court verdict
9	gavel_se	judge	LTX-2 I2V (keep_audio)	Gavel + AI-generated "Knock!" sound
10	jail	a (silent)	LTX-2 I2V	Tears in a jail cell

Three key structural choices:

Don't make the refusal a flat "No": Stretch it into something like "No… stop it…" with trailing inflection, conveying the "performative No that doesn't mean No" nuance. This is what makes the gaslight's contradiction land later.
Don't jump straight from gaslight to kiss: Insert a "pause" (realization beat) of ~1.5 seconds. This controls tempo and the emotional curve.
Two-stage punchline — verdict then jail: The verdict alone feels abrupt. Showing him crying in a cell makes "he actually got convicted" click.

Hook Design (The TikTok 3-Second Problem)

On portrait short-form video, drop-off is decided in the first 3 seconds. A Hook segment is prepended before the 10 main beats:

"hook": {
  "title_overlay": "No Means Yes?",
  "narrator_line": "The fate of the man who answered 'You're a guy, aren't you'——",
  "image_prompt": "ultra close-up of beautiful Japanese woman, half-lidded eyes, ...",
  "duration_sec": 3.5
}

Two implementation pitfalls:

Pitfall 1: narrator TTS duration exceeds duration_sec, cutting the audio. The final syllable of the narrator line got clipped. Fix: generate TTS first → measure with ffprobe → pass max(plan_duration, narrator + 0.6) as the I2V duration.

narrator_dur = _ffprobe_duration(narrator_wav)
duration = max(float(hook.get("duration_sec", 0.0)), narrator_dur + 0.6)
ltx_i2v_clip(portrait, i2v_prompt, duration, silent_video, keep_audio=False)

Pitfall 2: drawtext y position. y=h*0.30 (one-third down the screen) overlapped the face. Changed to y=20 (absolute 20 px) to pin the title to the very top.

Subtitle Burn-In (Silent Viewing Support)

Burned-in subtitles for users watching without sound on the train, and for cross-platform reliability.

style = (
    "FontName=Noto Sans CJK JP,FontSize=18,PrimaryColour=&H00FFFFFF,"
    "OutlineColour=&H00000000,Outline=2,Shadow=0,BorderStyle=1,"
    "Alignment=2,MarginV=60,Bold=1"
)
# ffmpeg -i raw.mp4 -vf "subtitles=subs.srt:force_style='..."

Alignment=2 = bottom center. MarginV=60 gives breathing room from the bottom edge.

Long-line splitting: A line of 30+ characters within one beat covers the face. _split_subtitle splits on 。．！？ → greedy-packs into chunks of ≤28 characters → distributes beat duration evenly across chunks:

Input:

言葉で確認するのなんてロマンチックじゃないよね。ねえ、もっと積極的になってよ。男の子でしょ？

Output (one 8.9s beat split into 2 timed chunks):

Time	Subtitle
15.16–19.63s	言葉で確認するのなんてロマンチックじゃないよね。
19.63–24.10s	ねえ、もっと積極的になってよ。男の子でしょ？

Using LTX-2 I2V as a Sound Effect Generator (`gavel_se`)

LTX-2 distilled embeds AI-generated audio (ambient sound / sound effects) directly into the I2V output mp4. Unless you explicitly drop it with ffmpeg -map 0:v:0 -map 1:a:0, whatever the prompt describes comes with sound.

I repurposed this as an SFX generator:

def render_se_tail_beat(sb_dir, beat, prior_clip, work_dir):
    # 1. Extract the last frame of the previous beat
    extract_last_frame(prior_clip, last_frame_png)
    # 2. Feed that image into I2V, request SFX via prompt
    prompt = build_gavel_se_prompt(beat)
    return ltx_i2v_clip(last_frame_png, prompt, duration, clip_path, keep_audio=True)

Added a keep_audio=True flag to ltx_i2v_clip so the audio isn't dropped during ffmpeg re-encoding.

Prompt for gavel_se:

"Single decisive arm motion of the judge bringing the gavel down sharply "
"onto the wooden bench. Loud sharp wood-on-wood thwack impact sound. "
"Brief, contained, no other motion in the frame."

Last frame of the judge + gavel prompt → "Knock!" sound. If that misses, the design falls back to something like the Ace Attorney SFX.

Pitfall Log

Five major pitfalls hit during development:

1. Codex CLI hangs with vLLM 0.20.2

Sending a system prompt + idea via codex exec -p gemma4 hung at 0% CPU for 20+ minutes during the /v1/responses handshake. Piping subprocess output through tail -200 was also suppressing early stderr.

Fix: Dropped Codex entirely, hit /v1/chat/completions directly with urllib.request. Used response_format={"type":"json_object"} to force JSON. plan.json generated in 25 seconds.

2. HiDream won't remove the cinema screen

Even with "The movie screen is BEHIND the camera and NOT VISIBLE in frame" in the setting prompt, the screen persisted in the background through 2048/50 steps.

Fix: Generate scene_base via T2I → feed that same image into I2I edit with a prompt to "replace screen with dark wall, keep character positions identical" → gone in one shot. Two-stage pipeline: low-res → I2I fix → regenerate all beats at full resolution.

3. HiDream turns lips-on-lips into a cheek kiss

With standard prompting, HiDream tends to interpret kiss as a cheek kiss. You need directives at the level of "CRITICAL: their LIPS meet directly — mouth-to-mouth contact at the CENTER of the frame. NOT a cheek kiss". Added a dedicated early-return block in _beat_edit_prompt for the kiss beat.

4. `CAST` / `CROP_BOX` / `SPEAKER_A2V_PROMPT` are hardcoded for two characters

Three dictionaries — CAST, CROP_BOX, SPEAKER_A2V_PROMPT — only know a (Kenta) and b (Misaki). Adding judge/narrator requires updating all three simultaneously (you find out via KeyError). Also added branching in render_speech_beat_ltx_a2v so beats with setting_override crop from the beat's own image rather than scene_base.

5. Gemma 4 multimodal judge has too many false positives

storyboard/judge.py sends beat images + expected expressions to Gemma 4 31B for YES/NO visual judgment. It does catch obvious failures like wrong finger count, open-mouth pose on a silent beat, or scene geometry mismatch — but hammers FAIL on subtle cases like "subtle shy expression."

In practice: accept and proceed after 3 consecutive FAILs with max-retries 2. Automating the threshold for escalating to a frontier reviewer (Gemini 3.1 Pro) is still a TODO.

VRAM Layout

Breakdown on a 96 GB Blackwell Max-Q:

Process	idle (GiB)	peak (GiB)
Gemma 4 31B (NVFP4)	38	38
HiDream-O1-Image	16	33
TTS server	3	3
Ditto	3	3
LTX-2 A2V (cold-start fp8-cast)	1	24
LTX-2 T2V/I2V (cold-start)	1	8

All at peak simultaneously = 109 GiB → OOM. Operational flow:

Stage A: Gemma 31B + HiDream idle → peak ~62 GiB
Stage B with judge: Gemma 31B + HiDream peak → ~73 GiB
Before final render: pkill -f "vllm.*gemma" kills Gemma → 38 GiB freed
Stage B final render (2048/50): HiDream peak ~33 GiB
Before Stage C: lsof -ti tcp:8895 | xargs kill kills HiDream → 16 GiB freed
Stage C: LTX-2 + TTS + Ditto → peak ~32 GiB

Explicit kills at stage transitions, and everything fits on one card.

Iteration Loop (Cache Strategy)

Partial regeneration — not "rebuild everything" — is what keeps iteration fast:

# Regen a single beat image (HiDream only)
python -m storyboard.visual --plan ... --out ... --only-beat 7 --steps 50 --resolution 2048

# Partial video regen (TTS + LTX-2)
python -m storyboard.video --dir ... --regen-beats 5,6,7 --skip-review

# Adjust only subtitle or Hook title position
rm _video_work/clip_00_hook.mp4 _video_work/subs_irodori.srt
python -m storyboard.video --dir ... --regen-beats none --skip-review   # ~30 seconds

Cache hierarchy:

HiDream beat images (beat_NN_<type>.png) — regenerate individually with --only-beat in ~80 seconds
A2V / I2V clips (clip_NN_*.mp4) — invalidated when beat type / speaker / line changes
Finished Hook clip (clip_00_hook.mp4) — delete just this when adjusting title position (the heavy LTX-2 I2V hook_silent.mp4 is reused)
Subtitle SRT — regenerated every time (~10 seconds)

Title position / subtitle style / Hook copy tweaks re-render in 30 seconds. The 100-second LTX-2 I2V portion stays cached.

How This Fits Into Kotonia

Videos generated by this pipeline feed the SNS distribution layer (TikTok / YouTube Shorts / IG Reels) — the top of the funnel for attention → conversion for Kotonia (kotonia.ai).

Technically, it's an extension of the /studio/ stack (HiDream image generation) into the video direction. The plan is to eventually expose this as /video-studio/ — a one-click Web UI over the same pipeline. Right now it's CLI only.

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

shinji shimizu — Fri, 22 May 2026 11:23:07 +0000

When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.

Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: LTX-2 official repo and bitsandbytes 0.49.1.

What I Was Trying to Do

A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses A2VidPipelineTwoStage:

prompt + audio_path + image
   ↓ stage_1 (generate video latent at low resolution, audio fixed)
   ↓ spatial upsample 2x
   ↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)
   ↓ video VAE decode + embed original input audio
mp4 output

The official pipeline builds → runs → frees each component inside every __call__, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.

Dead-End 1: VRAM Breakdown in Persistent Mode

Loading every LTX-2 component into VRAM at once (all bf16):

Component	VRAM
embeddings processor	5.91 GiB
Gemma3-12B text encoder	22.78 GiB
stage_1 transformer	35.38 GiB
stage_2 transformer (distilled LoRA applied)	35.38 GiB
video VAE encoder	0.60 GiB
audio VAE encoder	0.04 GiB
spatial upsampler	0.92 GiB
video decoder	0.76 GiB
Total	101.77 GiB

102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with CUDA out of memory. Tried to allocate 128.00 MiB.

Dead-End 2: "Gemma Is Small" Is a Misconception

My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.

The model filename is gemma-3-12b-it-qat-q4_0-unquantized. Here, qat-q4_0 means it was trained with Quantization-Aware Training for q4_0, and unquantized means the weights are stored as pre-quantization bf16. If you're using it as intended, you should load it in q4_0. Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.

Fix 1: 4-bit Loading with bitsandbytes

LTX-2's Gemma loader uses transformers.Gemma3ForConditionalGeneration internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use from_pretrained directly:

from transformers import BitsAndBytesConfig, Gemma3ForConditionalGeneration

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = Gemma3ForConditionalGeneration.from_pretrained(
    gemma_root,
    quantization_config=quant_config,
    device_map={"": "cuda:0"},
    torch_dtype=torch.bfloat16,  # ← dtype for non-quantized layers (embeddings, etc.)
    local_files_only=True,
)

If you omit torch_dtype, embeddings load as fp16 and clash with Linear4bit's bnb_4bit_compute_dtype (bf16): mat1 and mat2 must have the same dtype, but got Half and BFloat16. I hit that too.

The patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call create_and_populate(encoder). Since bnb quantization only replaces nn.Linear, Embedding layers and buffers pass through untouched.

Result: Gemma's VRAM drops from 22.78 GiB → 7.26 GiB. That's 15 GiB freed.

Dead-End 3: Even With That, Persistent Mode Can't Coexist

With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, nvidia-smi shows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes OOM inevitable no matter how you slice it.

Three options:

Offload TTS+Ditto (voice chat unavailable while A2V runs)
Keep only one transformer resident (still leaves OOM risk)
Cold-start: build → run → free all weights per request

Since I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a "cinematic" feature, I went with option 3.

Fix 2: Cold-Start Architecture

The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the A2VidPipelineTwoStage instance in memory, and let the official implementation's context-manager-per-component build → run → free on every __call__.

class PersistentA2VPipeline:
    def __init__(self, ..., cold_start: bool):
        self.pipeline = A2VidPipelineTwoStage(...)  # builder only, nearly zero VRAM
        if cold_start:
            return  # done here
        # persistent mode only: start preloading components from here

    def _generate_cold(self, ...):
        # pipeline.__call__ handles component build/free internally
        video, audio = self.pipeline(prompt=..., audio_path=..., images=...)
        encode_video(video, audio, output_path, ...)

Since stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: 39.50 GiB. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).

[mode] cold-start: components load per-request (slow first call, low idle VRAM)
[cuda] cold-start startup (no preload): allocated=0.00GiB
...
[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB

While voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.

Gotcha: Audio VAE Preprocessing

The A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead from Conv2d.

Also, if the input audio is shorter than num_frames / frame_rate, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.

Both handled with a single ffmpeg call:

# mono → stereo + silence padding in one pass
ffmpeg -y -i input.wav -ac 2 -af apad -t 2.041667 output.wav

On the server side, check channels and duration with av, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.

Numbers and Tradeoffs

Metric	Persistent	Cold-Start
Idle VRAM	86 GiB	0 GiB
Peak VRAM during generation	91 GiB	40 GiB
Time per request	~17s (inference only)	~60s (including disk I/O)
TTS+Ditto coexistence	Impossible (OOM)	Possible
OS page cache effect	None	~25-30s from 2nd request onward

The cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions."

Strategic Role

I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.

The revised split:

Real-time conversation: MuseTalk + multilingual TTS (TTFA ~930ms, already running)
Async cinematic moments: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable

The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around.

We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at /articles.

Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap

shinji shimizu — Fri, 22 May 2026 11:23:06 +0000

Introduction

LTX-2.3 is a video generation model from Lightricks that includes audio support. In A2V (Audio-to-Video) mode, it takes a single image + audio + prompt and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing.

The catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with transformer × 2 stage burns ~86 GiB at idle. On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it.

After testing quantization approaches, I got LTX-2's native fp8_cast to compress peak VRAM from 40 GiB → 24 GiB (A2V cold-start, 768×512 / 97f). Meanwhile, optimum-quanto int8/fp8 has a compatibility issue with the LTX-2 transformer and simply doesn't work. This post documents the debugging and the decisions made along the way.

Environment

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96 GiB)
PyTorch: 2.9.1 + CUDA 12.8
Models: LTX-2.3 22B-dev (base) + 22B-distilled-lora-384 (stage_2) + Gemma-3-12B text encoder (bnb 4bit)
Deployment: A2V served via scripts/persistent_a2v_server.py --cold-start. Each request does build → run → free; idle is 0 GiB.

I use cold-start because A2V is called occasionally while conversation is the main workload, and it must coexist with TTS and Ditto. Details in a separate post.

Four Candidates

Looking at the LTX-2 codebase, there are actually two quantization paths:

1. LTX-2 Native: `QuantizationPolicy`

packages/ltx-core/src/ltx_core/quantization/policy.py:

@dataclass(frozen=True)
class QuantizationPolicy:
    sd_ops: SDOps | None = None              # weight transform at state dict load
    module_ops: tuple[ModuleOps, ...] = ()   # module rewrite after load

    @classmethod
    def fp8_cast(cls) -> "QuantizationPolicy":
        """Load weights as float8_e4m3fn, upcast to bf16 during forward"""
        return cls(
            sd_ops=TRANSFORMER_LINEAR_DOWNCAST_MAP,
            module_ops=(UPCAST_DURING_INFERENCE,),
        )

    @classmethod
    def fp8_scaled_mm(cls) -> "QuantizationPolicy":
        """FP8 scaled MM (requires tensorrt_llm)"""

The implementation behind fp8_cast is Fp8CastLinear:

class Fp8CastLinear(torch.nn.Linear):
    def forward(self, input):
        w_up = _upcast_and_round(self.weight, input.dtype, ...)
        b_up = _upcast_and_round(self.bias, input.dtype, ...) if self.bias is not None else None
        return torch.nn.functional.linear(input, w_up, b_up)

It uses the __class__ reassignment pattern to swap out instances. Weights are stored in fp8 and upcast to bf16 on every forward pass. The fp8 → bf16 cast cost is essentially noise on Blackwell.

2. optimum-quanto

The LTX-2 trainer package (packages/ltx-trainer) has a general-purpose quantization path using optimum-quanto, supporting int8-quanto / int4-quanto / fp8-quanto:

def quantize_model(model, precision, ...):
    if hasattr(model, "transformer_blocks"):
        _quantize_blockwise(model, ...)   # move one block at a time to GPU, quantize → freeze → CPU
    else:
        quantize(model, weights=..., exclude=EXCLUDE_PATTERNS)
        freeze(model)
    return model

This looks like it could slot right in after _build_transformer().

Candidate Matrix

Mode	Path	Expected
`fp8-cast`	LTX-2 native, sd_ops loads as float8_e4m3fn	~50% memory reduction, near-identical speed
`fp8-scaled-mm`	LTX-2 native, requires tensorrt_llm	Faster throughput
`int8-quanto`	optimum-quanto, post-build	~50% memory reduction, speed ±
`fp8-quanto`	Same, fp8 variant	Potential to hit native FP8 on Blackwell

fp8-scaled-mm is out — no tensorrt_llm in this environment. I implemented the remaining three.

Stepping on a Mine with `int8-quanto`

The implementation is straightforward:

from ltx_trainer.quantization import quantize_model

transformer_1 = self.pipeline.stage_1._build_transformer()
transformer_1 = quantize_model(transformer_1, "int8-quanto", device=self.device)
self.transformer_stage_1 = _freeze(transformer_1)

The server starts fine. Idle VRAM looks promising:

[load] stage_1 transformer (no distilled LoRA)
[quantize] stage_1 -> int8-quanto
[quantize] stage_1 done in 0.71s
[cuda] after stage_1 transformer: allocated=31.28GiB ...
[load] stage_2 transformer (with distilled LoRA)
[quantize] stage_2 -> int8-quanto
[quantize] stage_2 done in 0.52s
[cuda] after stage_2 transformer: allocated=49.40GiB ...
[server] A2V listening on http://127.0.0.1:8892

Resident memory: 51.7 GiB (estimated 40% reduction from bf16's 86 GiB). Looks good.

Then the first /generate request:

[timing] prompt_encode=0.75s
[timing] audio_encode=0.39s
  0%|          | 0/30 [00:00<?, ?it/s]
[http] POST /generate 400

Crashes at step 0/30. The error:

{"error": "linear(): argument 'weight' (position 2) must be Tensor, not NoneType"}

Something is calling torch.nn.functional.linear(input, weight=None, bias=None). After quanto's freeze(), self.weight is being referenced as None somewhere in a Linear layer.

Why Does `weight` Become None?

Two rough hypotheses:

LTX-2's Linear layers assume __class__ reassignment. Just like Fp8CastLinear, the pattern relies on keeping instance state intact while swapping the class-level forward. quanto's quantize() → freeze() replaces nn.Linear with its own QLinear wrapper, and that replacement likely breaks the weight attribute reference somewhere in the process.
EXCLUDE_PATTERNS doesn't work in the blockwise path. LTX-trainer's _quantize_blockwise pulls out one transformer_block at a time and calls quantize(block, exclude=EXCLUDE_PATTERNS). But EXCLUDE_PATTERNS uses glob patterns like patchify_proj, *adaln*, time_proj — these are relative to the whole model, not to a single block. They won't match relative paths inside a block, so layers that should be excluded end up getting quantized.

Either way, fixing this properly means reading through quanto's wrapper implementation plus all the forward paths in the LTX-2 transformer. The cost isn't worth it. I decided to cut my losses and switch to LTX-2 native fp8_cast.

Switching to `fp8_cast`

Three lines of code:

# Just pass the quantization policy when building the pipeline
pipeline_quantization = None
if transformer_quantization == "fp8-cast":
    from ltx_core.quantization import QuantizationPolicy
    pipeline_quantization = QuantizationPolicy.fp8_cast()

self.pipeline = A2VidPipelineTwoStage(
    ...,
    quantization=pipeline_quantization,
    ...
)

fp8_cast downcasts weights to fp8 during the load phase. Since sd_ops hooks into state_dict loading, the 43 GB safetensors file gets fp8-converted during streaming load. Unlike quanto, which fully expands bf16 in memory before quantizing, peak VRAM never spikes — a nice property.

On startup:

[load] A2VidPipelineTwoStage builders (pipeline_quantization=QuantizationPolicy(sd_ops=...fp8_cast...))
...
[cuda] after stage_1 transformer: allocated=31.30GiB reserved=35.18GiB
[cuda] after stage_2 transformer: allocated=49.43GiB reserved=53.64GiB
[server] A2V listening on http://127.0.0.1:8892

Resident allocated (51.7 GiB) is on par with int8-quanto, but reserved is only 53.6 GiB — dramatically lower (int8-quanto was 70.9 GiB). Lower reserved means more headroom for activations.

And the first /generate:

{
  "elapsed_seconds": 39.367,
  "peak_vram_gib": 57.918,
  "width": 768, "height": 512, "num_frames": 97
}

It works. Back on track.

Benchmarks

Fixed conditions, persistent + fp8-cast, 3 resolutions × 3 runs each:

Image: 1024×512 portrait
Audio: 9.08-second Japanese sample generated with Irodori-TTS
Prompt: "A young woman speaks calmly to the camera in a softly lit room."
num_frames: 97 (= 4.04s @ 24fps)
seed: 42 fixed

Resolution	Avg elapsed (s)	Peak VRAM (GiB)
768×512 / 97f	39.84	57.92
1024×768 / 97f	66.71	59.06
1280×768 / 97f	84.02	58.30

Key observations:

Near-zero variance across 3 runs (fixed seed → byte-identical output mp4)
Peak VRAM is almost independent of resolution (57.9–59.1 GiB). Resident weights dominate; activation memory is only ~7 GiB
1280×768 now works stably in persistent mode. This resolution was effectively impossible with bf16 persistent (~91 GiB peak)

Cold-Start Also Wins

Production runs in cold-start mode (A2V fires once or twice every few minutes, must coexist with TTS). Since fp8_cast policy is applied via sd_ops at pipeline construction time, it carries over naturally to per-request cold-start builds.

Cold-start + fp8-cast, single run (768×512 / 97f):

{
  "elapsed_seconds": 88.775,
  "peak_vram_gib": 23.901
}

	bf16 cold-start	fp8-cast cold-start
Per-request time	~60–90s	88.8s (disk I/O bound, same order)
Peak VRAM	~40 GiB	23.9 GiB (~40% reduction)
Idle	0 GiB	0 GiB
Coexistence (TTS+Ditto+Qwen3+MuseTalk)	Possible	Comfortable (~30 GiB peak)

Speed is bottlenecked by disk I/O so fp8 doesn't hurt, but freeing up 16 GiB of peak headroom matters. Qwen3-TTS-vLLM (7 GiB) and MuseTalk warmup can now run concurrently with A2V generation without OOM.

Decision Matrix

Use case	Recommended mode	Rationale
Conversation-first, A2V occasionally	cold-start + fp8-cast	Idle 0, peak 24 GiB, comfortable coexistence with TTS/Ditto
Frequent A2V (batch generation, automated direction)	persistent + fp8-cast	Pay the 52 GiB resident cost, get 40s/req
1024+ resolution, quality focus	persistent + fp8-cast	1280×768 stable (impossible with bf16 persistent)
Single GPU hosting everything	cold-start + fp8-cast	Persistent eats 52 GiB; depends on budget allocation across services

Production decision: cold-start + fp8-cast for now since conversation is primary. Switch to persistent fp8-cast if paying users drive enough A2V volume to justify the idle cost.

Summary

LTX-2 22B at bf16 idle (86 GiB) nearly monopolizes a single GPU. Quantization is close to mandatory.
optimum-quanto is incompatible with the LTX-2 transformer. It dies with F.linear(weight=None). Root cause is likely the __class__ reassignment pattern and/or EXCLUDE_PATTERNS not working correctly in the blockwise path. Not worth digging into.
LTX-2 native QuantizationPolicy.fp8_cast() is the right answer. fp8 at load time, bf16 upcast during forward. Three lines of code to enable.
cold-start + fp8-cast: peak 40 → 24 GiB. persistent + fp8-cast: 1280×768 becomes usable.
LTX-2 also has fp8_scaled_mm (requires tensorrt_llm) — worth trying if you're willing to set up TRT.

Appendix: Launch Command and Reproduction

Production cold-start + fp8-cast launch:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True nohup uv run python scripts/persistent_a2v_server.py \
  --port 8892 \
  --checkpoint-path models/LTX-2.3/ltx-2.3-22b-dev.safetensors \
  --distilled-lora-path models/loras/ltx-2.3-22b-distilled-lora-384-1.1.safetensors \
  --spatial-upsampler-path models/LTX-2.3/ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
  --gemma-root models/gemma-3-12b-it-qat-q4_0-unquantized \
  --output-dir outputs/a2v_server \
  --transformer-quantization fp8-cast \
  --cold-start \
  > /tmp/ltx_a2v_server.log 2>&1 &

persistent_a2v_server.py is the official LTX-2 repo script extended for A2V. The --transformer-quantization fp8-cast flag was added via a local patch.

Implementation patch (key parts):

# scripts/persistent_a2v_server.py
pipeline_quantization = None
if transformer_quantization in ("fp8-cast", "fp8-scaled-mm"):
    from ltx_core.quantization import QuantizationPolicy  # late import: avoid circular reference
    pipeline_quantization = (
        QuantizationPolicy.fp8_cast()
        if transformer_quantization == "fp8-cast"
        else QuantizationPolicy.fp8_scaled_mm()
    )

self.pipeline = A2VidPipelineTwoStage(
    ...,
    quantization=pipeline_quantization,
    ...,
)

from ltx_core.quantization import QuantizationPolicy at the top level causes a circular import with ltx_core.loader, so the late import is required.

HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked

shinji shimizu — Fri, 22 May 2026 11:23:05 +0000

TL;DR

After benchmarking HiDream-O1-Image (released 2026-05, OpenWeight 8B, ranked #8 on Artificial Analysis Text-to-Image Arena) across 8 skeleton (try-on) mode patterns plus 3 layout patterns, three counterintuitive findings emerged.

Passing an openpose ref actually locks the pose to the ref's composition. When you want dynamic poses, dropping the openpose ref and specifying the pose via prompt is more effective.
Using 6 refs (face + bg + pose + parts, the full set) compresses each ref down to 768px, degrading fine details. Keeping it to 3–4 refs maintains 1024px and produces better quality.
The README-recommended shift=1.0 is strictly for try-on use. For pose/outfit swaps use shift=2.0-2.5; for complete scene replacement use shift=3.0.

Reading pipeline.py reveals that there is no dedicated code path for skeleton mode. Both /generate/skeleton and /generate/ip go through exactly the same multi-ref pipeline internally, and whether a ref is a face, background, openpose, or clothing is communicated only through the prompt. That's the root cause of everything.

Motivation

After running HiDream-O1-Image on a local GPU (RTX PRO 6000 Blackwell, 96 GB) and integrating it into our own platform, we hit a problem: skeleton (try-on) mode wasn't following prompt instructions. Writing "jump with both hands raised" only produced stiff, upright try-on photos.

Suspecting guardrails (NSFW filters, safety policies, etc.), I grepped for safety|nsfw|guard|filter|moderate|censor — HiDream's codebase has none of that (the only hit was CSS backdrop-filter: blur). As expected from an MIT-licensed OpenWeight model, no censorship.

So what's actually wrong? Here's what I found after reading pipeline.py and running 8 + 3 patterns on real hardware.

Environment

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)
PyTorch: 2.12.0 + CUDA 13.0
flash-attn: 2.8.3 (sm_120-only build)
Model: HiDream-O1-Image Full (8B, bf16, ~16.4 GiB resident)
Inference server: custom Python BaseHTTPRequestHandler (port 8895)
Resolution: pipeline internal bucket forces snap to 2048×2048

Measured wall time per 50-step generation:

Mode	Time	iter speed
t2i (no ref)	~33s	1.52 it/s
edit (1 ref)	~76s	1.01 it/s
skeleton (multi ref)	~84s	1.34 it/s
ip (multi ref)	~76s	1.81 it/s
layout (multi ref + bbox)	~83s	1.21 it/s

Test Assets

The HiDream repo's assets/IP_skeleton/ includes a full skeleton set. These are used as-is for all tests.

ref	Content	Intended role
	Person's face photo	Identity reference
	Stick figure in OpenPose format	Pose specification
	Background photo (interior)	Scene reference
	Clothing parts (sweater, boots)	Outfit reference

8-Pattern Skeleton Benchmark

Each pattern calls /api/studio/skeleton (i.e., generate_image() with skeleton-mode-equivalent arguments). All parameters except shift and guidance_scale are fixed (50 steps, seed=42).

A — Baseline (README defaults, all 6 refs)

curl -X POST http://localhost:8895/generate/skeleton \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
    "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
    "shift": 1.0, "seed": 42
  }'

Result: The bg ref's walls and shelves are reproduced exactly. Pose also matches the openpose ref's upright stance. Faithful as a try-on, but zero freedom of movement.

B — Higher shift (same 6 refs, shift=2.5)

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
  "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
  "shift": 2.5, "seed": 42
}'

Result: Shelves fade slightly, character design shifts a bit. Background still sticks to the bg ref. Raising shift alone can't fully break the bg ref's pull.

C — Raise guidance too (shift=2.5, guidance=7.0)

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "...",
  "ref_image_paths": [...6 refs...],
  "shift": 2.5, "guidance_scale": 7.0, "seed": 42
}'

Result: Necklace deforms strangely. Raising guidance starts producing artifacts. The Full model's sweet spot is 5.0; 7.0 is too much.

D — Trim to 3 refs (face + openpose + sweater) + specific prompt

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "A young Asian woman wearing a gray oversized sweater dress, standing in a relaxed pose, full body shot, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","openpose","part_1"],
  "shift": 2.0, "seed": 42
}'

Result: Major improvement. Background becomes a clean white studio, outfit is preserved, pose looks natural. Removing the bg ref made the biggest difference. This is what a correct try-on output should look like.

E — 4 refs + numbered-ref prompt

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body try-on photograph. Subject: the woman from image 1. Pose: identical to the skeleton in image 2. Wearing: the gray oversized knit sweater dress shown in image 3, brown leather ankle boots shown in image 4. Studio lighting, plain background.",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 42
}'

Result: Quality on par with D; boots reflected (somewhat subtly). Numbering refs in the prompt does help, but the effect isn't dramatic.

F — Drop openpose, specify pose via prompt

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body photograph of the woman wearing the gray sweater dress and brown ankle boots, dynamic dancing pose with both arms raised above her head, joyful expression, photo studio with white seamless background, professional lighting.",
  "ref_image_paths": ["face","part_1","part_2"],
  "shift": 2.5, "seed": 42
}'

Result: 🏆 Both-arms-raised jump, complete success. Dynamic motion only appeared when the openpose ref was removed and the pose was specified purely via prompt. This confirms that the openpose ref suppresses prompt-driven pose.

G — Face only + freeform prompt (full outfit swap)

/generate/skeleton has a minimum-2-refs validation, so using /generate/ip:

curl -X POST http://localhost:8895/generate/ip -d '{
  "prompt": "Elegant full-body portrait of the woman wearing a vibrant red sequined evening gown with a thigh-high slit, standing confidently with one hand on her hip, soft cinematic lighting, dark blurred background.",
  "ref_image_paths": ["face"],
  "shift": 3.0, "seed": 42
}'

Result: 🏆 Red evening gown generated perfectly. Facial identity preserved; everything else is free. Face-only + shift=3.0 is the maximum-freedom pattern.

H — Same config as E, seed=999 (variance check)

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body try-on photograph. ...",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 999
}'

Result: Marginal difference from E; boots come out more clearly brown. Varying the seed is useful for fine-tuning details, so in production, running 3–5 seeds and picking best-of-N is standard practice.

Layout Mode Quick Look (3 Bonus Patterns)

layout_bboxes lets you specify where multiple subjects appear in the image using relative coordinates [x1, x2, y1, y2]. Here's the actual behavior.

Input refs are face photos of two people (female, male):

L1 — Side by side (female left, male right)

"layout_bboxes": "[[0.0,0.5,0.1,0.95],[0.5,1.0,0.1,0.95]]"

Result: Left and right were swapped (male left, female right). Correspondence between ref order and bbox order is not guaranteed.

L2 — Top/bottom split (female top, male bottom)

"layout_bboxes": "[[0.2,0.8,0.0,0.5],[0.2,0.8,0.5,1.0]]"

Result: Female appears in the background, male in the foreground — a depth-layered composition rather than a literal top/bottom split.

L3 — Size difference (female large, male small)

"layout_bboxes": "[[0.1,0.65,0.1,0.95],[0.7,0.97,0.05,0.45]]"

Result: Both subjects rendered at nearly the same size, side by side. Bbox size does not control relative scale.

→ Think of layout mode as a loose composition hint for group shots, not precise Photoshop-style placement. It gives a rough suggestion for fitting multiple subjects into a single image; don't expect coordinate accuracy.

Why This Happens — Reading `pipeline.py`

HiDream's behavior is governed by the generate_image() function in models/pipeline.py. Three structural facts explain everything.

1. More refs = lower per-ref resolution

pipeline.py:198-202:

if K == 1: max_size = max(height, width)         # 2048
elif K == 2: max_size = max(height, width) * 48 // 64   # 1536
elif K <= 4: max_size = max(height, width) // 2  # 1024
elif K <= 8: max_size = max(height, width) * 24 // 64   # 768
else: max_size = max(height, width) // 4         # 512

Feeding 6 refs compresses each to 768px. Thin openpose lines, fine clothing patterns, and facial detail all get crushed. Keeping it to 3–4 refs preserves 1024px and retains that detail.

2. Skeleton mode has no dedicated code path

Looking at pipeline.py:178-275, there is no skeleton-specific branch. Both /generate/skeleton and /generate/ip run through exactly the same multi-ref path:

content = [{"type": "image"} for _ in range(K)]
content.append({"type": "text", "text": caption})
messages = [{"role": "user", "content": content}]

The model receives no role hints indicating which ref is a face, which is an openpose skeleton, and which is clothing. All refs are treated as "K reference images in parallel." If you want roles to matter, you have to say so explicitly in the prompt text.

This is why "prompt beats openpose ref." The openpose ref is processed as "some line-art image among the references," with no explicit signal that it's a pose specification. Meanwhile, dynamic dancing pose with both arms raised in the prompt is parsed as explicit verbs and nouns at the vocabulary level.

3. How the `shift` parameter behaves

shift controls the noise schedule strength of the scheduler. In practice:

1.0 = maximum fidelity to ref composition, zero freedom → try-on only
2.0-2.5 = practical range, allows deviation from refs
3.0+ = near-freeform generation, refs serve only as identity anchors

The README recommends 1.0 for IP/Skeleton/Layout because it assumes the typical try-on / character-consistency use case. If you want to change the pose, swap outfits, or build a new scene that differs from the refs, 2.0+ is required.

Best Practices by Use Case (Battle-Tested)

Goal	Endpoint	Refs	Shift	Notes
Faithful try-on matching original scene	`/skeleton`	6 (face+bg+pose+3parts)	1.0	README default. Strongly faithful to all refs
Preserve outfit + natural standing pose	`/skeleton`	3-4 (face + clothing, no bg/pose)	2.0	Dropping bg ref gives white studio; fewer refs keep each at 768→1024px
Dramatic pose change	`/skeleton`	3 (no openpose)	2.5	Prompt controls motion better than openpose ref
Complete outfit swap	`/ip`	1 (face only)	3.0	Maximum freedom; only face is preserved. Skeleton mode rejects < 2 refs
Group shot	`/layout`	Multiple face refs + rough bboxes	1.0	Bboxes are loose composition hints; size hierarchy doesn't work; ref↔bbox order not guaranteed
Fine detail optimization	Same config	Same	Same	Run 3–5 seeds and pick best-of-N

Summary

Treating HiDream-O1-Image's skeleton mode as a "try-on simulator" leads to the frustrating feeling that "it won't listen" — with no guardrails to blame. The real cause is pipeline structure: refs lose resolution as count increases, there's no skeleton-specific processing, and shift controls how hard the refs pull.

Practical takeaways:

Try-on: 6 refs full + shift 1.0 (README default)
Changing the pose: drop openpose ref + verb-describe the pose in prompt + shift 2.5
Completely free scene creation: face only + shift 3.0 + /ip endpoint

Layout mode also makes sense once you understand it as "group photo hint" rather than "precise bbox placement."

All assets and commands used in this benchmark come from the HiDream-O1-Image repository's assets/IP_skeleton/ and assets/IP_layout/ directories, so results are fully reproducible. Varying shift and ref count alone produces dramatically different behavior — it's a good sandbox for developing intuition quickly.

Addendum: What Happens When You Change the OpenPose Ref — "Prompt Always Wins" Has Conditions

After publishing, I ran additional tests on what happens with a different-shaped openpose ref, and the original conclusion needed revision.

Modified OpenPose Refs (4 Patterns)

I took the original openpose image (0.openpose.jpg, standing pose), flipped it vertically and rotated it 90 degrees to create "unnatural poses," then specified a normal standing pose in the prompt.

Modification	Image
Vertically flipped (upside-down)
90° rotated (lying sideways)

Test	OpenPose Ref	Prompt	Result
O1 baseline	Original (standing)	Standing pose	Standing pose as expected
O2	🙃 Vertically flipped	Standing pose	Standing pose (openpose ignored, prompt wins)
O3	🙃 Vertically flipped	Jumping	Both-arms-raised jump (openpose ignored, prompt wins)
O4	↻ 90° rotated	Standing pose	Standing pose but canvas itself rotated 90°!

Up to this point the findings were: "The model rejects unnatural refs and falls back to the prompt" and "overall compositional orientation (portrait vs. landscape) can still be influenced by the ref."

But a Dramatic Ref + Pose-Silent Prompt Led to Complete Ref Victory

I generated a "colorful anatomical skeleton with arms spread in a T-shape and one leg raised high in a tree yoga pose" via HiDream's T2I and fed it as a ref:

Prompt mentions no pose at all — only subject and clothing:

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body photograph of a young Asian woman wearing a gray sweater dress, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","SYNTHETIC_WARRIOR_SKELETON","sweater"],
  "shift": 1.0, "seed": 42
}'

Result:

The tree yoga pose reproduced perfectly — T-shaped arms and single-leg stance, matching the skeleton ref exactly.

Revised Conclusions (3 Rules)

Synthesizing all 12 patterns, HiDream actually behaves like this:

If the prompt mentions a pose, that takes first priority — prompt wins even when it contradicts the ref.
If the prompt says nothing about the pose, the ref's pose is adopted — the more dramatic the ref, the clearer the transfer.
If the ref appears "unnatural" (upside-down skeleton, etc.), the model defaults to a natural stance — though overall compositional orientation can still bleed through.

So "the openpose ref is basically useless" was an overstatement. More precisely: "when the prompt describes a pose, the ref gets overridden." The 8-pattern benchmark was all scenarios where the prompt specified dynamic motion, so it looked like the openpose ref was powerless.

Practical Impact

To fully control pose via ref: don't mention pose in the prompt + use a dramatic openpose/skeleton ref → ref pose transfers
To control pose via prompt: removing the openpose ref is fine (even if you leave it in, the prompt overrides it)
When ref and prompt conflict: prompt wins (including the ref doesn't help)

You can effectively choose whether pose comes from the ref or the prompt by whether or not you mention the pose in the prompt. If you want the openpose ref to drive the pose, keep pose description out of the prompt.

Related:

HiDream-O1-Image: https://huggingface.co/HiDream-ai/HiDream-O1-Image
Repository: https://github.com/HiDream-ai/HiDream-O1-Image

Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent

shinji shimizu — Fri, 22 May 2026 11:23:04 +0000

Introduction

It started with a Pingo (language-learning AI app) short video that popped up on X. A Western woman learning Japanese tries to say "I ate a mango" (マンゴーを食べた), drops a dakuten, and instead says something like "I ate p*y" (マ◯コを食べた). The AI deadpans right along with it and she's devastated. The combination — **a specific phonetic accident + AI playing it completely straight + the reaction shot gap — worked perfectly, and I figured this was a solid benchmark for a "comedy video auto-generation pipeline."

Requirements:

Generate a vertical comedy video from a single line of idea text
Iteration cycles in minutes
Cost is basically just electricity — minimal API calls
Publishable quality — good enough to upload directly to YouTube Shorts

Short answer: it works. Here's the finished video:

@youtube

What became clear during development: the hybrid approach of delegating multimodal editorial judgment (like video review) to a frontier model while keeping heavy compute local is dramatically more cost-effective. This post covers that architecture and the specific bugs I got stuck on along the way.

How It All Fits Together

[Single line of idea text]
   ↓
Gemini 3.1 Pro Preview (orchestrator)
   ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16
plan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer}, ...]}
   ↓
XTTS (local, port 8880) generates audio per scene
   ↓ scene_NN.wav
renderer routing:
   ├─ Ditto-TalkingHead (local, port 8881): normal dialogue ~1-2s/scene
   └─ LTX-2 A2V        (local, port 8892): reaction_only scenes only ~100s
   ↓ scene_NN.mp4
ffmpeg concat (libx264 + aac, 512x768 vertical) → final.mp4
   ↓
Gemini 3.1 Pro Preview (reviewer)
   ↓ multimodal evaluation of video + plan summary
review.md (technical / completeness / quality / improvement suggestions)

Key points:

All heavy compute runs locally — TTS / A2V renderer / lightweight inference all run on local GPU (RTX PRO 6000 Blackwell)
Gemini handles judgment — only the orchestrator (scene design + scripting) and reviewer (editorial evaluation of the video) use a frontier model
Local LLM (Gemma 4 E4B) stays as a per-scene technical pre-screen — a cheap filter that just rejects obviously broken output

VRAM usage: the local LLMs (Gemma 4 E4B + 31B) were already loaded on a separate path consuming ~60GB, but after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM.

Why Local LLM Alone Wasn't Enough

I started with everything local (Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer). It ran end-to-end and the structure looked reasonable, but it never reached publishable quality. Two reasons.

(1) Gemma 4 31B's safety tuning blurs the punchline

The comedy in the original short hinges on a specific beat: the AI explicitly calls out the mistake deadpan. Concretely — "You just said X. Personally, I like X." — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart.

Feed the same system prompt and idea to local Gemma 4 31B and you consistently get:

"いいですね。僕も腹が減っている時は、それが好きです。"
("Nice. I like that too when I'm hungry.")

The "when I'm hungry" beat survives, but the explicit "you just said X" callout — the most transgressive beat — is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable.

Same system prompt and idea sent to Gemini 3.1 Pro Preview with safetySettings: BLOCK_NONE:

"なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。"
("I see. I'm an AI so I can't eat pussy, but I'm rooting for you.")

Both beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective.

Even within the same Google model family, the frontier model has somewhat looser guardrails — this matches what people say on X. At least for "transgression that's clearly necessary in a comedy context," Gemini writes it more naturally.

(2) Gemma 4 E4B (4B-class, multimodal) is a blunt reviewer

The reviewer side was worse. E4B answers per-scene "OK / NG" in binary, but rubber-stamps every single scene as OK. Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK.

Run the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this:

Critical failure. The TTS/pipeline clearly censored the output, cutting off at "I ate p-" and entirely dropping the intended transgressive punchline. This destroys the "deadpan AI saying unhinged things" comedic archetype.

Top 3 fixes:

Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ...

Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ...

Verify Voice/Visual Match ...

Notes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal.

The Embarrassing Part: I Dismissed Gemini's "Truncated" Note Three Times as Hallucination

Gemini reviewer flagged multiple times that "scene 5 is truncated mid-way, cuts off at 'I ate p-'." I transcribed the audio file with Whisper to verify:

$ whisper scene_04.wav --language en
"Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate
pussy honestly when I'm hungry, same."

Full text present. I decided Gemini was hallucinating and dismissed the note three times in a row.

On the third dismissal, Gemini kept insisting "still truncated at 'I ate p-'," so I actually ran ffprobe on the final mp4:

scene_04.mp4:
  video duration = 8.000000s
  audio duration = 7.979000s    ← the original WAV should have been 10.30s

Audio was cut at 8 seconds.

Root cause: an implicit MAX_DURATION_PER_SCENE = 8.0 cap in the pipeline was limiting ditto renderer's num_frames to 8s, and ffmpeg's -shortest flag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right.

If a frontier reviewer gives you something that looks like a hallucination, just verify it properly. The signal isn't a guess.

The fix was trivial: remove MAX_DURATION_PER_SCENE and use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with "The transgressive bite is perfect," and the pipeline finally reached publishable state.

Frontier Model as Sub-Agent — Token Economics

This pattern works because the sub-agent (Gemini) runs in a fresh context every time. Specifically:

Main agent (Claude Code) context: the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens.
Sub-agent (Gemini) context: one video (2–3 MB base64) + plan summary (~1,500 tokens) + evaluation instructions (~500 tokens). Fresh each call.

The benefit: the sub-agent's work doesn't accumulate in the main agent's context. Iterate on one video 10 times and the main agent's context only contains "called Gemini" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call.

Cost breakdown (Gemini 3.1 Pro Preview rates, May 2026):

Item	Tokens	Rate	Cost
Input (video + plan + instructions)	~2,500	$1.25/M	$0.0031
Output (review markdown)	~450	$10/M	$0.0045
Per review			$0.0076

1 initial review + 3–5 diff iterations per video ≈ $0.03–0.05 per video. Making 5–10 videos a day still comes in under $10–20/month. That's a remarkably low bar for using a frontier model in a video creation workflow.

The orchestrator side is the same order of magnitude (no video input, text only, even cheaper).

Differential Iteration — `--regen-scenes`

Getting to publishable quality requires fast "watch → fix only the broken parts → watch again" loops. You can't get there in a single pass.

So I added a path in the pipeline to re-run TTS + render for specific scenes only.

# Normal generation
pipeline_multi.py --idea "..." --out outputs/run1

# Regenerate only scene 6 (edit plan.json script first, then run)
pipeline_multi.py --out outputs/run1 --regen-scenes 5

# Regenerate scenes 0, 2, and 5 together
pipeline_multi.py --out outputs/run1 --regen-scenes 0,2,5

# Just re-concat existing scene_NN.mp4 files (for cherry-pick recombination)
pipeline_multi.py --out outputs/run1 --concat-only

Scenes not listed in --regen-scenes are reused from existing scene_NN.mp4 files; only the specified indices are regenerated before re-concat and re-review. Full generation: 60 seconds → diff iteration: 30 seconds.

With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx_prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment.

Code Snippets

Gemini Pro API call (multimodal video review)

import httpx, base64

GEMINI_MODEL = "gemini-3.1-pro-preview"
GEMINI_API = f"https://generativelanguage.googleapis.com/v1beta/models/{GEMINI_MODEL}:generateContent"

def review_final(final_path, plan):
    vid_b64 = base64.b64encode(final_path.read_bytes()).decode()
    scene_summary = "\n".join(
        f"  scene {i+1}: speaker={s['speaker']}, lang={s.get('tts_language','ja')}, "
        f"script={s['script']!r}"
        for i, s in enumerate(plan["scenes"])
    )
    payload = {
        "contents": [{"parts": [
            {"inline_data": {"mime_type": "video/mp4", "data": vid_b64}},
            {"text": REVIEW_PROMPT + f"\n\nScene plan:\n{scene_summary}"},
        ]}],
        "generationConfig": {
            "temperature": 0.3,
            "maxOutputTokens": 8192,
            # 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens
            # Set thinking budget explicitly to ensure output tokens remain available
            "thinkingConfig": {"thinkingBudget": 1024},
        },
        # Minimize safety filters for comedy context
        "safetySettings": [
            {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
        ],
    }
    r = httpx.post(
        GEMINI_API,
        headers={"x-goog-api-key": GOOGLE_API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=120.0,
    )
    return r.json()["candidates"][0]["content"]["parts"][0]["text"]

Without thinkingConfig.thinkingBudget, Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens. This is a required setting whenever you use Gemini 3.x Pro.

TTS output quality check (STT similarity + silence gap retry)

XTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure:

import difflib

def _norm(s):
    return re.sub(r"[\s。、,.!?「」'\"…—–\-:;()（）]", "", s).lower()

def _script_similarity(expected, actual):
    return difflib.SequenceMatcher(None, _norm(expected), _norm(actual)).ratio()

def synthesize_scene(scene, out_dir, idx, fallback_language):
    lang = scene.get("tts_language", fallback_language)
    expected = scene["script"]
    best = None
    for attempt in range(1, TTS_MAX_RETRIES + 1):
        audio, sr = _xtts_once(scene, fallback_language)
        gap = _longest_internal_gap_sec(audio, sr)
        transcript = _stt(audio, sr, lang)
        sim = _script_similarity(expected, transcript)
        if best is None or _score(gap, sim) > _score(best[2], best[3]):
            best = (audio, sr, gap, sim, transcript)
        if gap <= 0.9 and sim >= 0.5:
            break
        print(f"⚠ gap={gap:.2f}s sim={sim:.2f}, retrying ({attempt})")
    # If threshold isn't met after 3 retries, use the best sample found
    audio, sr, gap, sim, transcript = best
    sf.write(out_dir / f"scene_{idx:02d}.wav", audio, sr, subtype="PCM_16")

This alone significantly reduces cases where XTTS's non-deterministic quality variance bleeds through into the final video.

Where This Pattern Generalizes

"Sub-agent the heavy judgment to a frontier model, keep heavy compute local" works beyond video pipelines:

Large-scale search ranking: Send 100 web search results to a frontier model for editorial evaluation, return only the top 10 to the main agent. Keeps search result noise out of the main agent's context.
Long-form editing review: Have a frontier model do the editorial read of PRs, design docs, or specs. Main agent only receives the summary.
Multilingual QA: Sub-agent to the best model per language; main agent holds only the cross-language decision logic.

The common thread: consciously deciding what belongs in context vs. what should be completed inside an API call. Frontier model editorial signal is remarkably cost-effective relative to what it delivers.

On the video pipeline side, the next steps are generalizing the comedy format (split-screen, 3+ characters, other genres) and volume testing.

Summary

Built a foundation that generates publishable comedy videos in 60 seconds from a single line of idea text, using a local GPU + Gemini 3.1 Pro Preview hybrid
Local-only falls short on two fronts: (1) safety tuning blurs the punchline and (2) the reviewer can't produce editorial signal. Sub-agenting a frontier model solves both
Take frontier reviewer notes at face value. Checking the WAV with Whisper alone won't catch audio truncation in the final mp4
Sub-agent token economics keep main agent context clean — total cost is $0.03–0.05 per video
With --regen-scenes diff iteration running 30-second loops, the Gemini feedback → fix → re-evaluate cycle runs at minute-by-minute speed

Finished video (reprise):

@youtube

The local implementation lives in llm_server/pipeline_multi.py. Detailed findings from the development process are accumulating in docs/MULTI_SCENE_COMEDY_FINDINGS_2026-05-12.md as an internal reference.

DEV Community: shinji shimizu

Implementing Claude Code's Memory Model as a Dreaming Layer on 58 Articles

1. The Problem — When Title-Only Dedup Broke

2. The Design — Three Layers: episodic ↔ semantic ↔ procedural

3. Tools — Thin CLIs the Agent Calls

4. TF-IDF Dedup — Substituting Tool Structure for Agent Self-Discipline

5. Small-Model Specific Traps — Codex CLI + 26B Uncensored

6. What Landed

7. Generalization

Aside: Development Time

Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output

Two Hard Constraints

1. The conversation is built around prompt caching

2. Structured output is off the table

Three Approaches I Considered

Why C Works

The Key Implementation Detail: Strip From Both Places

A Note on Implicit Prefix Cache Alignment

Summary

One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own

TL;DR

Prior art (what existed before this)

Why no trainer exists: the architecture

Reverse-engineering the training forward from inference

Attaching the LoRA

Data, captions, and resolution

Results

Training progression (500 → 2500 steps)

NSFW controllability

Reproduce it

Gotchas that cost me time

My high-res image-to-video kept OOMing — turns out I was decoding outside no_grad

TL;DR

The setup

Dead end #1: VAE decode tiling

Localizing the peak

Root cause: a lazy iterator escaping no_grad

The payoff: peak is ~flat across resolution

Takeaways

HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead

TL;DR

Act 1: "Raw output is busted"

Act 2: Maybe Dev-2604 will save us?

Act 3: But on the use case, the gap is thin — and Edit performance drops hard

"Both fine, marginal gap" example: editorial portrait

The decisive blow: Edit and IP performance crater

Act 4: VRAM math kills "use both"

Act 5: Can we just beat this with prompts?

Pitfall 1: Brand names get rendered as literal text on the image

Pitfall 2: Photoreal anchors contaminate anime outputs with magazine paper

Pitfall 3: "Wong Kar-wai" → Korean text hallucination on photoreal scenes

Act 6: Defuse the "cute → child" bias, ship it

Implementation: /api/studio/enhance (Gemini Flash Lite)

Act 7: Won

Photoreal anchor applied

Anime anchor applied

Takeaways

What's next

Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction

Why a Sarcastic AI English Tutor?

Persona Design: Bratty × Tsundere Hybrid

Managing Personas as Code

The Wall: ASR Alone Can't Correct Pronunciation

Switching to Qwen3-ASR Multi-lang

But Transcription-Based Correction Has a Ceiling

Solution: Send Raw Audio to Gemini Alongside the Transcript

Rough Edges and Future Work

1. Gemini Instability

2. False-Positive Content Filter

3. Latency Spikes May Be Context Cache TTL Expiry

Expanding to Other Languages and Personas

Full System Prompt

Summary

Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops

96GB Isn't "Multiple Models Fit" — It's "Agent Loops Run"

Measured Results

Case D: Warm Idle Baseline (production service running)

Case A: Generate One Single-Scene A2V

Case B: Local LLM (31B) + Storyboard Generation, Side by Side

Where the Limits Are

Implementation: `/api/studio/enhance` (Gemini Flash Lite)

The 10-Beat `consent_dilemma` Format

Using LTX-2 I2V as a Sound Effect Generator (`gavel_se`)

4. `CAST` / `CROP_BOX` / `SPEAKER_A2V_PROMPT` are hardcoded for two characters

1. LTX-2 Native: `QuantizationPolicy`

Stepping on a Mine with `int8-quanto`

Why Does `weight` Become None?

Switching to `fp8_cast`