Ever tried producing a 10-minute video solo?
Script. Voiceover. Visuals. Editing. Color. Music. Subtitles. Export.
That's not a weekend project — that's a full-time team. Four people minimum. Eight thousand dollars a month, easy.
I refused to accept that. So I wrote a Python script that does all of it.
What This Thing Actually Does
You give it a topic. It gives you a finished MP4.
Topic → Script → Voice → Images → Video → BGM + Subtitles → MP4
No babysitting. No timeline dragging. No "just one more export." You run one command, walk away, and come back to a video.
python generate_plan.py "How quantum computing works" --produce
That's the whole interaction.
The Pipeline: 5 Stages, Zero Clicks
Stage 1 — Script
An LLM takes your topic and writes the full narration plus scene-by-scene image prompts. Plug in any OpenAI-compatible provider: Ollama (free, runs locally), DeepSeek, OpenAI, Gemini — your call.
Stage 2 — Voice
Edge-TTS turns the script into speech. It's Microsoft's free TTS service. Multi-language, decent quality, zero cost.
Stage 3 — Visuals
ComfyUI + Flux generates every scene image on your local GPU. No cloud calls. No API bills. No rate limits.
Stage 4 — Motion (optional)
HunyuanVideo animates the static images into video clips. Requires 16GB VRAM. Don't have it? Skip this — static images still make a perfectly watchable video.
Stage 5 — Assembly
BGM gets layered in. Subtitles get burned. Everything stitches together into a final MP4.
Each stage is independent. Kill the process halfway through? Re-run the same command — it picks up exactly where it stopped. Checkpoint resume, built in.
Inside the Repo
Under 20 files. Nothing hidden, nothing clever:
-
generate_plan.py— topic in, production plan out -
produce_from_plan.py— plan in, video out -
main.py— the pipeline core -
modules/— one file per stage (LLM, TTS, image gen, video assembly, BGM) -
setup.py— interactive wizard, 3 questions, done
Hardware? Lower Than You Think
| What | Minimum | Sweet Spot |
|---|---|---|
| GPU | 8GB VRAM (images only) | 16GB VRAM (images + motion) |
| RAM | 16GB | 32GB |
| Disk | 50GB free | 100GB+ |
A used RTX 2070 handles it fine.
Getting Started
Three commands. That's the setup.
git clone https://github.com/counter-eng/ai-video-factory.git
cd ai-video-factory && pip install -r requirements.txt
python setup.py
The wizard asks three things: which LLM, where's ComfyUI, GPU or CPU encoding. It writes your config. You're done.
Then:
python generate_plan.py "How radar works" --produce
Go make coffee. Come back to a video.
Why I Open-Sourced This
I built it because I needed it. Running a content channel solo means choosing between quality and quantity — unless you automate.
After months of running this pipeline, my output as one person matched a three-person team. That felt too useful to keep private.
So here it is. MIT license. Fork it, break it, improve it, ship it. If you hit a bug, open an issue.
The entire source is yours.
Links
GitHub: https://github.com/counter-eng/ai-video-factory
YouTube: https://www.youtube.com/@CounterintuitiveEng
Star it if you find it useful. PRs welcome.
git clone https://github.com/counter-eng/ai-video-factory.git
Your code makes videos. You make ideas.



Top comments (0)