Two-host AI dialogue specs: how I structure YouTube longform scripts with A/B speaker JSON

#showdev #productivity #ai #webdev

The video pipeline I've been building for two YouTube channels running off this monorepo started with short-form vertical clips — a single narrator, a single slide, done. Longform is different. A ten-minute explainer with one voice and no conversational variation is hard to watch even when the content is good. I wanted something that felt like two people talking through a problem, not a text-to-speech audiobook.

The solution was a two-host dialogue spec: a JSON file where each line of audio is tagged with a speaker (A or B), and the build script renders it as a full-length video alternating between two neural voices.

What the spec looks like

The simplest possible spec:

{
  "title": "How Turso libSQL compares to Cloudflare D1",
  "description": "...",
  "tags": ["database", "cloudflare", "turso"],
  "privacy": "public",
  "segments": [
    {
      "speaker": "A",
      "text": "Turso and D1 look similar from the outside — both are SQLite-compatible edge databases.",
      "slide": { "kind": "title", "heading": "Turso vs D1" }
    },
    {
      "speaker": "B",
      "text": "Right. But they differ significantly on branching, replication topology, and cost model once you scale past the free tier."
    },
    {
      "speaker": "A",
      "text": "Let's go through each. First, branching.",
      "slide": { "kind": "section", "heading": "Branching" }
    }
  ]
}

A few things to notice. The slide field is optional — when omitted, the build script holds the previous slide while the audio plays. This means you don't need a new visual for every sentence, which would be exhausting to maintain and would produce a choppy video. A new slide appears only when there's something worth showing.

The speaker field maps to a voice. In build_longform.py:

VOICE = {
    "A": os.environ.get("LF_VOICE_A", "en-US-GuyNeural"),
    "B": os.environ.get("LF_VOICE_B", "en-US-AvaNeural"),
}

Both are edge-tts neural voices — a Python wrapper around Microsoft Edge's text-to-speech API that gives access to the same neural voices as the browser without requiring an Azure subscription. The A/B assignment came from testing: one lower/measured voice for exposition, one that sounds more conversational for follow-up and counterpoint. You can override both with environment variables, which matters if the default voices aren't available in a given edge-tts catalog version.

How the build works

build_longform.py processes the spec linearly:

For each segment with a slide, render the slide to PNG via slides.py
Synthesize the segment's text with edge-tts for the assigned speaker voice, writing to an mp3
Build a silent video clip from the PNG, then mux it with the audio
After all segments: concatenate all clips with ffmpeg

The result is a single output.mp4 where each visual change happens exactly when a new slide is specified in the spec — usually at section transitions, not on every sentence.

If a segment has no slide key, the previous slide's PNG is reused. The timing automatically matches the audio duration because each clip is built from its own audio file. No manual timestamp editing.

The CI step that runs this writes the output path to an environment file:

YT_OUTPUT_PATH=/tmp/longform/output.mp4

The downstream YouTube publish step reads that variable and uploads. Same pattern as the short-form video pipeline I wrote about.

What the spec generator produces

The specs aren't written by hand. A Claude call takes a topic and an outline and produces the full segment list, deciding where slides should appear, which speaker handles which part of the argument, and what heading text goes on each slide.

The prompt instructs the model to split responsibilities clearly: speaker A leads and introduces, speaker B challenges, adds nuance, or extends with examples. This produces a conversational dynamic that's more engaging than a single narrator even though neither voice is a real person.

One thing that took adjustment: Claude tends to generate very even A/B splits — roughly alternating every sentence. Real dialogue isn't that regular. I added an instruction to vary the run lengths: sometimes A speaks three sentences before B responds, sometimes B only adds a single sentence. That small change makes the output feel less mechanical.

What I haven't solved yet

The PNGtuber-style character art mentioned in the build script (_host_assets() function) is asset-gated and returns None currently — I haven't made the visual assets for either host. The code path is there for when I do, but for now the video is slide-only with audio.

The slide renderer (slides.py) is also limited to a few layouts: title cards, section headers, comparison tables, bullet lists. Richer layouts like code blocks with syntax highlighting or real diagrams would require more work in Pillow or a headless browser, which I'm deferring. The Mermaid + matplotlib diagram pipeline I built for short-form videos doesn't cleanly transfer to longform because the timing model is different.

The two-voice format is working for the content I'm producing. Whether it affects watch time versus a single-voice format — I don't have enough data yet to say anything reliable. I'll publish numbers once there are 30+ videos in the channel.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.