DEV Community

Cover image for I open-sourced an AI pipeline that turns any topic into a YouTube Short
tsensei
tsensei

Posted on

I open-sourced an AI pipeline that turns any topic into a YouTube Short

What is OpenReels?

OpenReels takes a topic and produces a YouTube Short. It handles the research, script, voiceover, visuals, music, captions, and assembly. You get a vertical MP4 at the end.

Pipeline Demo

It's MIT licensed, runs via Docker Compose, and costs about $0.68 per video. You bring your own API keys.

GitHub: github.com/tsensei/OpenReels

How it works

Give it a topic. Six stages run automatically:

Stage What happens
Research Web search grounds the script in real facts
Script AI creative director writes a "DirectorScore" — a per-scene production plan
Voiceover TTS with word-level timestamps for karaoke-style captions
Visuals AI images (Gemini, DALL-E), AI video (Veo, Kling), vision-verified stock
Music AI-generated via Lyria 3 Pro, synced to the video's emotional arc
Assembly Remotion composites everything with transitions and animated captions

Every stage streams progress to a web UI. You can watch it work in real time.

The DirectorScore

This is the design choice that made everything else click. Early versions generated assets independently and the results felt disconnected. The fix: make the AI write a structured plan before generating anything.

{
  "scenes": [
    {
      "sceneNumber": 1,
      "visual": {
        "type": "ai_image",
        "description": "Close-up of astronaut's visor reflecting Earth",
        "motion": { "type": "zoom_in", "intensity": "subtle" }
      },
      "voiceover": "On April 13, 1970, three men heard a bang that changed space history.",
      "transition": { "type": "crossfade", "durationMs": 500 }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Every downstream stage reads from this score. The image generator follows the visual description, the music prompter maps the emotional arc, the caption renderer syncs to word timestamps. Same idea as film production: director writes the vision, departments execute against it.

Stock footage verification

Stock footage search is bad. Like, really bad. Search "astronaut's visor reflecting Earth" on Pexels and you get generic space B-roll.

So there's a VLM (vision-language model) that reviews each stock result and checks if it actually matches what the scene needs. Mismatch? The pipeline rewrites the search query and tries again. If stock is totally exhausted, it falls back to AI image generation.

The query reformulation step is where most of the improvement comes from. The initial search terms are rarely what stock APIs want to hear.

Music

I didn't want random background tracks. The music prompter writes a Lyria 3 Pro prompt with:

  • Per-scene timestamp sections
  • Intensity ratings (1-10)
  • Instrument specs
  • Dynamics ("sparse piano at 0:00, build strings at 0:15, full orchestra at 0:30, settle to solo cello at 0:45")

The track ducks under voiceover automatically.

Archetypes

There are 14 visual archetypes. Each one is a config that controls the entire look and feel:

  • anime_illustration: fast cuts, vibrant, cel-shaded
  • moody_cinematic: dark, slow, atmospheric
  • editorial_caricature: satirical, exaggerated
  • infographic: clean, data-heavy, rapid
  • pastoral_watercolor: soft, painterly
  • surreal_dreamscape: ethereal, impossible geometry
  • comic_book: bold outlines, halftone dots

They control pacing (fast: 8-12 scenes, moderate: 7-10, cinematic: 5-8), color palette, caption style, the image generation style bible, and transition defaults.

Available Archetype Grid

Try it

git clone https://github.com/tsensei/OpenReels.git
cd OpenReels
cp .env.example .env   # add your API keys
docker compose up      # starts Redis + API + Worker
# Open http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

Or single command:

docker run --env-file .env --shm-size=2gb -v ./output:/output \
  ghcr.io/tsensei/openreels "the apollo 13 disaster"
Enter fullscreen mode Exit fullscreen mode

Cost

A typical video costs about $0.68:

  • LLM calls: $0.003 (7 calls)
  • TTS: $0.017
  • AI images: $0.30 (3 images)
  • AI video clip: $0.30 (1 clip)
  • Music: $0.08 (Lyria generation)
  • Stock footage: free

You can also go cheaper: --provider local uses Kokoro for voiceover (free, no API key), there's a bundled music library with 25 tracks, and you can use stock footage only.

Stack

TypeScript, Mastra, Vercel AI SDK 6, Fastify 5, BullMQ + Redis, Remotion 4, React 19, Tailwind, shadcn/ui.

Web UI

Feedback welcome

I'm looking for thoughts on the architecture and the DirectorScore approach specifically. Contributors welcome too.

github.com/tsensei/OpenReels

Top comments (0)