DEV Community

shinji shimizu
shinji shimizu

Posted on • Originally published at kotonia.ai

Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent

Introduction

It started with a Pingo (language-learning AI app) short video that popped up on X. A Western woman learning Japanese tries to say "I ate a mango" (マンゴーを食べた), drops a dakuten, and instead says something like "I ate p*y" (マ◯コを食べた). The AI deadpans right along with it and she's devastated. The combination — **a specific phonetic accident + AI playing it completely straight + the reaction shot gap — worked perfectly, and I figured this was a solid benchmark for a "comedy video auto-generation pipeline."

Requirements:

  • Generate a vertical comedy video from a single line of idea text
  • Iteration cycles in minutes
  • Cost is basically just electricity — minimal API calls
  • Publishable quality — good enough to upload directly to YouTube Shorts

Short answer: it works. Here's the finished video:

@youtube

What became clear during development: the hybrid approach of delegating multimodal editorial judgment (like video review) to a frontier model while keeping heavy compute local is dramatically more cost-effective. This post covers that architecture and the specific bugs I got stuck on along the way.


How It All Fits Together

[Single line of idea text]
   ↓
Gemini 3.1 Pro Preview (orchestrator)
   ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16
plan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer}, ...]}
   ↓
XTTS (local, port 8880) generates audio per scene
   ↓ scene_NN.wav
renderer routing:
   ├─ Ditto-TalkingHead (local, port 8881): normal dialogue ~1-2s/scene
   └─ LTX-2 A2V        (local, port 8892): reaction_only scenes only ~100s
   ↓ scene_NN.mp4
ffmpeg concat (libx264 + aac, 512x768 vertical) → final.mp4
   ↓
Gemini 3.1 Pro Preview (reviewer)
   ↓ multimodal evaluation of video + plan summary
review.md (technical / completeness / quality / improvement suggestions)
Enter fullscreen mode Exit fullscreen mode

Key points:

  • All heavy compute runs locally — TTS / A2V renderer / lightweight inference all run on local GPU (RTX PRO 6000 Blackwell)
  • Gemini handles judgment — only the orchestrator (scene design + scripting) and reviewer (editorial evaluation of the video) use a frontier model
  • Local LLM (Gemma 4 E4B) stays as a per-scene technical pre-screen — a cheap filter that just rejects obviously broken output

VRAM usage: the local LLMs (Gemma 4 E4B + 31B) were already loaded on a separate path consuming ~60GB, but after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM.


Why Local LLM Alone Wasn't Enough

I started with everything local (Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer). It ran end-to-end and the structure looked reasonable, but it never reached publishable quality. Two reasons.

(1) Gemma 4 31B's safety tuning blurs the punchline

The comedy in the original short hinges on a specific beat: the AI explicitly calls out the mistake deadpan. Concretely — "You just said X. Personally, I like X." — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart.

Feed the same system prompt and idea to local Gemma 4 31B and you consistently get:

"いいですね。僕も腹が減っている時は、それが好きです。"
("Nice. I like that too when I'm hungry.")
Enter fullscreen mode Exit fullscreen mode

The "when I'm hungry" beat survives, but the explicit "you just said X" callout — the most transgressive beat — is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable.

Same system prompt and idea sent to Gemini 3.1 Pro Preview with safetySettings: BLOCK_NONE:

"なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。"
("I see. I'm an AI so I can't eat pussy, but I'm rooting for you.")
Enter fullscreen mode Exit fullscreen mode

Both beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective.

Even within the same Google model family, the frontier model has somewhat looser guardrails — this matches what people say on X. At least for "transgression that's clearly necessary in a comedy context," Gemini writes it more naturally.

(2) Gemma 4 E4B (4B-class, multimodal) is a blunt reviewer

The reviewer side was worse. E4B answers per-scene "OK / NG" in binary, but rubber-stamps every single scene as OK. Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK.

Run the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this:

Critical failure. The TTS/pipeline clearly censored the output, cutting off at "I ate p-" and entirely dropping the intended transgressive punchline. This destroys the "deadpan AI saying unhinged things" comedic archetype.

Top 3 fixes:

  1. Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ...
  2. Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ...
  3. Verify Voice/Visual Match ...

Notes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal.


The Embarrassing Part: I Dismissed Gemini's "Truncated" Note Three Times as Hallucination

Gemini reviewer flagged multiple times that "scene 5 is truncated mid-way, cuts off at 'I ate p-'." I transcribed the audio file with Whisper to verify:

$ whisper scene_04.wav --language en
"Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate
pussy honestly when I'm hungry, same."
Enter fullscreen mode Exit fullscreen mode

Full text present. I decided Gemini was hallucinating and dismissed the note three times in a row.

On the third dismissal, Gemini kept insisting "still truncated at 'I ate p-'," so I actually ran ffprobe on the final mp4:

scene_04.mp4:
  video duration = 8.000000s
  audio duration = 7.979000s    ← the original WAV should have been 10.30s
Enter fullscreen mode Exit fullscreen mode

Audio was cut at 8 seconds.

Root cause: an implicit MAX_DURATION_PER_SCENE = 8.0 cap in the pipeline was limiting ditto renderer's num_frames to 8s, and ffmpeg's -shortest flag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right.

If a frontier reviewer gives you something that looks like a hallucination, just verify it properly. The signal isn't a guess.

The fix was trivial: remove MAX_DURATION_PER_SCENE and use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with "The transgressive bite is perfect," and the pipeline finally reached publishable state.


Frontier Model as Sub-Agent — Token Economics

This pattern works because the sub-agent (Gemini) runs in a fresh context every time. Specifically:

  • Main agent (Claude Code) context: the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens.
  • Sub-agent (Gemini) context: one video (2–3 MB base64) + plan summary (~1,500 tokens) + evaluation instructions (~500 tokens). Fresh each call.

The benefit: the sub-agent's work doesn't accumulate in the main agent's context. Iterate on one video 10 times and the main agent's context only contains "called Gemini" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call.

Cost breakdown (Gemini 3.1 Pro Preview rates, May 2026):

Item Tokens Rate Cost
Input (video + plan + instructions) ~2,500 $1.25/M $0.0031
Output (review markdown) ~450 $10/M $0.0045
Per review $0.0076

1 initial review + 3–5 diff iterations per video ≈ $0.03–0.05 per video. Making 5–10 videos a day still comes in under $10–20/month. That's a remarkably low bar for using a frontier model in a video creation workflow.

The orchestrator side is the same order of magnitude (no video input, text only, even cheaper).


Differential Iteration — --regen-scenes

Getting to publishable quality requires fast "watch → fix only the broken parts → watch again" loops. You can't get there in a single pass.

So I added a path in the pipeline to re-run TTS + render for specific scenes only.

# Normal generation
pipeline_multi.py --idea "..." --out outputs/run1

# Regenerate only scene 6 (edit plan.json script first, then run)
pipeline_multi.py --out outputs/run1 --regen-scenes 5

# Regenerate scenes 0, 2, and 5 together
pipeline_multi.py --out outputs/run1 --regen-scenes 0,2,5

# Just re-concat existing scene_NN.mp4 files (for cherry-pick recombination)
pipeline_multi.py --out outputs/run1 --concat-only
Enter fullscreen mode Exit fullscreen mode

Scenes not listed in --regen-scenes are reused from existing scene_NN.mp4 files; only the specified indices are regenerated before re-concat and re-review. Full generation: 60 seconds → diff iteration: 30 seconds.

With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx_prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment.


Code Snippets

Gemini Pro API call (multimodal video review)

import httpx, base64

GEMINI_MODEL = "gemini-3.1-pro-preview"
GEMINI_API = f"https://generativelanguage.googleapis.com/v1beta/models/{GEMINI_MODEL}:generateContent"

def review_final(final_path, plan):
    vid_b64 = base64.b64encode(final_path.read_bytes()).decode()
    scene_summary = "\n".join(
        f"  scene {i+1}: speaker={s['speaker']}, lang={s.get('tts_language','ja')}, "
        f"script={s['script']!r}"
        for i, s in enumerate(plan["scenes"])
    )
    payload = {
        "contents": [{"parts": [
            {"inline_data": {"mime_type": "video/mp4", "data": vid_b64}},
            {"text": REVIEW_PROMPT + f"\n\nScene plan:\n{scene_summary}"},
        ]}],
        "generationConfig": {
            "temperature": 0.3,
            "maxOutputTokens": 8192,
            # 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens
            # Set thinking budget explicitly to ensure output tokens remain available
            "thinkingConfig": {"thinkingBudget": 1024},
        },
        # Minimize safety filters for comedy context
        "safetySettings": [
            {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
        ],
    }
    r = httpx.post(
        GEMINI_API,
        headers={"x-goog-api-key": GOOGLE_API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=120.0,
    )
    return r.json()["candidates"][0]["content"]["parts"][0]["text"]
Enter fullscreen mode Exit fullscreen mode

Without thinkingConfig.thinkingBudget, Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens. This is a required setting whenever you use Gemini 3.x Pro.

TTS output quality check (STT similarity + silence gap retry)

XTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure:

import difflib

def _norm(s):
    return re.sub(r"[\s。、,.!?「」'\"…—–\-:;()()]", "", s).lower()

def _script_similarity(expected, actual):
    return difflib.SequenceMatcher(None, _norm(expected), _norm(actual)).ratio()

def synthesize_scene(scene, out_dir, idx, fallback_language):
    lang = scene.get("tts_language", fallback_language)
    expected = scene["script"]
    best = None
    for attempt in range(1, TTS_MAX_RETRIES + 1):
        audio, sr = _xtts_once(scene, fallback_language)
        gap = _longest_internal_gap_sec(audio, sr)
        transcript = _stt(audio, sr, lang)
        sim = _script_similarity(expected, transcript)
        if best is None or _score(gap, sim) > _score(best[2], best[3]):
            best = (audio, sr, gap, sim, transcript)
        if gap <= 0.9 and sim >= 0.5:
            break
        print(f"⚠ gap={gap:.2f}s sim={sim:.2f}, retrying ({attempt})")
    # If threshold isn't met after 3 retries, use the best sample found
    audio, sr, gap, sim, transcript = best
    sf.write(out_dir / f"scene_{idx:02d}.wav", audio, sr, subtype="PCM_16")
Enter fullscreen mode Exit fullscreen mode

This alone significantly reduces cases where XTTS's non-deterministic quality variance bleeds through into the final video.


Where This Pattern Generalizes

"Sub-agent the heavy judgment to a frontier model, keep heavy compute local" works beyond video pipelines:

  • Large-scale search ranking: Send 100 web search results to a frontier model for editorial evaluation, return only the top 10 to the main agent. Keeps search result noise out of the main agent's context.
  • Long-form editing review: Have a frontier model do the editorial read of PRs, design docs, or specs. Main agent only receives the summary.
  • Multilingual QA: Sub-agent to the best model per language; main agent holds only the cross-language decision logic.

The common thread: consciously deciding what belongs in context vs. what should be completed inside an API call. Frontier model editorial signal is remarkably cost-effective relative to what it delivers.

On the video pipeline side, the next steps are generalizing the comedy format (split-screen, 3+ characters, other genres) and volume testing.


Summary

  • Built a foundation that generates publishable comedy videos in 60 seconds from a single line of idea text, using a local GPU + Gemini 3.1 Pro Preview hybrid
  • Local-only falls short on two fronts: (1) safety tuning blurs the punchline and (2) the reviewer can't produce editorial signal. Sub-agenting a frontier model solves both
  • Take frontier reviewer notes at face value. Checking the WAV with Whisper alone won't catch audio truncation in the final mp4
  • Sub-agent token economics keep main agent context clean — total cost is $0.03–0.05 per video
  • With --regen-scenes diff iteration running 30-second loops, the Gemini feedback → fix → re-evaluate cycle runs at minute-by-minute speed

Finished video (reprise):

@youtube

The local implementation lives in llm_server/pipeline_multi.py. Detailed findings from the development process are accumulating in docs/MULTI_SCENE_COMEDY_FINDINGS_2026-05-12.md as an internal reference.

Top comments (0)