DEV Community

Cover image for ffmpeg-ai: A Free CLI That Turns a Prompt Into a Finished YouTube Short
v. Splicer
v. Splicer

Posted on

ffmpeg-ai: A Free CLI That Turns a Prompt Into a Finished YouTube Short

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge


What I Built

Short-form video has a tooling problem. Every step lives in a different window. Script in one app. Images in another. Voice in a third. Timeline in a fourth. Each one wants a subscription. Most of them are slow.

I got tired of it and built a pipeline instead.

ffmpeg-ai is a free Python CLI that takes a single prompt and produces a finished, upload-ready 1080x1920 MP4. Script, voiceover, captions, visuals, motion — all of it. One command, one output file.

ffmpeg-ai generate "the history of pager hacking"
Enter fullscreen mode Exit fullscreen mode

That single command:

  1. Calls OpenRouter (free tier) to generate a structured short-form script
  2. Fetches scene images from Pollinations.ai — no auth required, no cost
  3. Synthesizes narration via edge-tts (Microsoft TTS, completely free)
  4. Transcribes the generated audio locally with faster-whisper
  5. Uses the transcription to produce word-timed ASS captions
  6. Composes everything through FFmpeg into a 30fps H.264/AAC vertical short

Output: 1080x1920, up to 60 seconds, burned-in captions. Ready for Shorts, Reels, or TikTok. Zero paid API calls — one free OpenRouter account is the only external dependency.

The project means something personal beyond the tooling. I run cybersecurity content channels and spend more time fighting creator infrastructure than actually making content. This was the fix.


Demo

Repository: github.com/numbpill3d/ffmpeg-ai

Install and run:

git clone https://github.com/numbpill3d/ffmpeg-ai.git
cd ffmpeg-ai
uv pip install -e ".[dev]"
cp .env.example .env
# add your free OpenRouter key at https://openrouter.ai

ffmpeg-ai generate "why analog radio still works when everything else fails"
Enter fullscreen mode Exit fullscreen mode

Test the full pipeline without making any API call
THALOSs first:

ffmpeg-ai generate --dry-run "any topic"
Enter fullscreen mode Exit fullscreen mode

Command Line Interface

CLI Again

Pipeline screenshot from the repo:

pipeline running

example output


The Comeback Story

Before this challenge, the repo had the right bones but didn't work end-to-end. The pieces existed as separate modules that had never been integrated into a real pipeline run. The synchronization layer — the part that makes the whole thing actually function — was the part I kept deferring.

What the project was: A collection of functional-but-disconnected modules. Script generation worked. Image fetching worked. TTS worked. They had never been wired together in a way that produced a real output file.

What was broken or missing:

  • No integration between the audio synthesis and caption generation steps
  • Timing was estimated from word count, which drifted badly on anything over 30 seconds
  • The FFmpeg compose step existed as loose, untested subprocess calls
  • No installable entrypoint — you couldn't actually run it as a CLI

What I finished:

The core fix was the synchronization architecture. Generated speech doesn't match expected durations. A 400-word script doesn't predictably produce a 45-second audio file. Early versions estimated scene durations from word count. That approach drifts, especially on longer scripts with uneven pacing.

The solution: derive all timing from the actual generated audio. faster-whisper transcribes the edge-tts output locally and returns word-level timestamps. Those timestamps feed directly into the ASS subtitle generator and the FFmpeg compose step. The audio became the source of truth — everything else conforms to it.

Prompt
  ↓
Script Generation (OpenRouter)
  ↓
Scene Extraction
  ↓
Image Generation (Pollinations)
  ↓
Voice Synthesis (edge-tts)
  ↓
Local Transcription (faster-whisper)  ← this was the missing link
  ↓
Caption Generation (ASS)
  ↓
FFmpeg Assembly
  ↓
Finished Short
Enter fullscreen mode Exit fullscreen mode

I also wrapped all FFmpeg subprocess calls into composer.py so nothing else in the pipeline touches raw filter graph syntax. Any typo in an FFmpeg filter graph silently corrupts output or throws an error three minutes into a render. Centralizing it meant one place to fix, one place to test.

The repo went from fragmented modules to a complete, installable CLI that produces real output files.

What's still ahead: batch generation mode, local model support to remove the OpenRouter dependency entirely, custom voice profiles, and improved motion systems beyond basic Ken Burns.


My Experience with GitHub Copilot

Copilot was most useful on the parts of this project that are high-volume and low-creativity: the FFmpeg filter graph construction and the ASS subtitle format generation.

FFmpeg filter graphs for multi-input composition with motion effects and subtitle overlays are verbose by nature. The syntax is precise and the failure modes are opaque — a misplaced bracket or wrong pixel format string produces either silence or a corrupted render, not a useful error message. Copilot autocompleted filter graph segments accurately enough that I could iterate on the logic rather than debug syntax. That's the right use of it.

The ASS subtitle format has its own timestamp syntax and style block conventions. Rather than referring back to the spec constantly, I described what I needed in a comment and Copilot generated the correct format string. It was right on the first try, which is not something I can say for my own attempts at ASS format strings from memory.

Where I didn't use it: pipeline architecture, the timing synchronization approach, and anything involving the free service integration logic. Those decisions required understanding the actual constraints of edge-tts, Pollinations rate behavior, and faster-whisper's output format — context Copilot doesn't have. The structural thinking stayed mine. The boilerplate went faster.

Net result: the parts of the project that would have taken the most time to get syntactically right (FFmpeg, subtitle format) took the least time. That freed up the actual problem-solving time for the synchronization architecture, which is what makes the project work.

Top comments (0)