I open-sourced an AI pipeline that turns any topic into a YouTube Short

#opensource #typescript #showdev #ai

What is OpenReels?

OpenReels takes a topic and produces a YouTube Short. It handles the research, script, voiceover, visuals, music, captions, and assembly. You get a vertical MP4 at the end.

It's MIT licensed, runs via Docker Compose, and costs about $0.68 per video. You bring your own API keys.

GitHub: github.com/tsensei/OpenReels

How it works

Give it a topic. Six stages run automatically:

Stage	What happens
Research	Web search grounds the script in real facts
Script	AI creative director writes a "DirectorScore" — a per-scene production plan
Voiceover	TTS with word-level timestamps for karaoke-style captions
Visuals	AI images (Gemini, DALL-E), AI video (Veo, Kling), vision-verified stock
Music	AI-generated via Lyria 3 Pro, synced to the video's emotional arc
Assembly	Remotion composites everything with transitions and animated captions

Every stage streams progress to a web UI. You can watch it work in real time.

The DirectorScore

This is the design choice that made everything else click. Early versions generated assets independently and the results felt disconnected. The fix: make the AI write a structured plan before generating anything.

{
  "scenes": [
    {
      "sceneNumber": 1,
      "visual": {
        "type": "ai_image",
        "description": "Close-up of astronaut's visor reflecting Earth",
        "motion": { "type": "zoom_in", "intensity": "subtle" }
      },
      "voiceover": "On April 13, 1970, three men heard a bang that changed space history.",
      "transition": { "type": "crossfade", "durationMs": 500 }
    }
  ]
}

Every downstream stage reads from this score. The image generator follows the visual description, the music prompter maps the emotional arc, the caption renderer syncs to word timestamps. Same idea as film production: director writes the vision, departments execute against it.

Stock footage verification

Stock footage search is bad. Like, really bad. Search "astronaut's visor reflecting Earth" on Pexels and you get generic space B-roll.

So there's a VLM (vision-language model) that reviews each stock result and checks if it actually matches what the scene needs. Mismatch? The pipeline rewrites the search query and tries again. If stock is totally exhausted, it falls back to AI image generation.

The query reformulation step is where most of the improvement comes from. The initial search terms are rarely what stock APIs want to hear.

Music

I didn't want random background tracks. The music prompter writes a Lyria 3 Pro prompt with:

Per-scene timestamp sections
Intensity ratings (1-10)
Instrument specs
Dynamics ("sparse piano at 0:00, build strings at 0:15, full orchestra at 0:30, settle to solo cello at 0:45")

The track ducks under voiceover automatically.

Archetypes

There are 14 visual archetypes. Each one is a config that controls the entire look and feel:

anime_illustration: fast cuts, vibrant, cel-shaded
moody_cinematic: dark, slow, atmospheric
editorial_caricature: satirical, exaggerated
infographic: clean, data-heavy, rapid
pastoral_watercolor: soft, painterly
surreal_dreamscape: ethereal, impossible geometry
comic_book: bold outlines, halftone dots

They control pacing (fast: 8-12 scenes, moderate: 7-10, cinematic: 5-8), color palette, caption style, the image generation style bible, and transition defaults.

Try it

git clone https://github.com/tsensei/OpenReels.git
cd OpenReels
cp .env.example .env   # add your API keys
docker compose up      # starts Redis + API + Worker
# Open http://localhost:3000

Or single command:

docker run --env-file .env --shm-size=2gb -v ./output:/output \
  ghcr.io/tsensei/openreels "the apollo 13 disaster"

Cost

A typical video costs about $0.68:

LLM calls: $0.003 (7 calls)
TTS: $0.017
AI images: $0.30 (3 images)
AI video clip: $0.30 (1 clip)
Music: $0.08 (Lyria generation)
Stock footage: free

You can also go cheaper: --provider local uses Kokoro for voiceover (free, no API key), there's a bundled music library with 25 tracks, and you can use stock footage only.