CounterIntEng

Posted on Mar 21

My Code Makes Videos While I Sleep

#engineering #ai #software

Ever tried producing a 10-minute video solo?

Script. Voiceover. Visuals. Editing. Color. Music. Subtitles. Export.

That's not a weekend project — that's a full-time team. Four people minimum. Eight thousand dollars a month, easy.

I refused to accept that. So I wrote a Python script that does all of it.

What This Thing Actually Does

You give it a topic. It gives you a finished MP4.

Topic → Script → Voice → Images → Video → BGM + Subtitles → MP4

No babysitting. No timeline dragging. No "just one more export." You run one command, walk away, and come back to a video.

python generate_plan.py "How quantum computing works" --produce

That's the whole interaction.

The Pipeline: 5 Stages, Zero Clicks

Stage 1 — Script
An LLM takes your topic and writes the full narration plus scene-by-scene image prompts. Plug in any OpenAI-compatible provider: Ollama (free, runs locally), DeepSeek, OpenAI, Gemini — your call.

Stage 2 — Voice
Edge-TTS turns the script into speech. It's Microsoft's free TTS service. Multi-language, decent quality, zero cost.

Stage 3 — Visuals
ComfyUI + Flux generates every scene image on your local GPU. No cloud calls. No API bills. No rate limits.

Stage 4 — Motion (optional)
HunyuanVideo animates the static images into video clips. Requires 16GB VRAM. Don't have it? Skip this — static images still make a perfectly watchable video.

Stage 5 — Assembly
BGM gets layered in. Subtitles get burned. Everything stitches together into a final MP4.

Each stage is independent. Kill the process halfway through? Re-run the same command — it picks up exactly where it stopped. Checkpoint resume, built in.

Inside the Repo

Under 20 files. Nothing hidden, nothing clever:

generate_plan.py — topic in, production plan out
produce_from_plan.py — plan in, video out
main.py — the pipeline core
modules/ — one file per stage (LLM, TTS, image gen, video assembly, BGM)
setup.py — interactive wizard, 3 questions, done

Hardware? Lower Than You Think

What	Minimum	Sweet Spot
GPU	8GB VRAM (images only)	16GB VRAM (images + motion)
RAM	16GB	32GB
Disk	50GB free	100GB+

A used RTX 2070 handles it fine.

Getting Started

Three commands. That's the setup.

git clone https://github.com/counter-eng/ai-video-factory.git
cd ai-video-factory && pip install -r requirements.txt
python setup.py

The wizard asks three things: which LLM, where's ComfyUI, GPU or CPU encoding. It writes your config. You're done.

Then:

python generate_plan.py "How radar works" --produce

Go make coffee. Come back to a video.

Why I Open-Sourced This

I built it because I needed it. Running a content channel solo means choosing between quality and quantity — unless you automate.

After months of running this pipeline, my output as one person matched a three-person team. That felt too useful to keep private.

So here it is. MIT license. Fork it, break it, improve it, ship it. If you hit a bug, open an issue.

The entire source is yours.

Links

GitHub: https://github.com/counter-eng/ai-video-factory

YouTube: https://www.youtube.com/@CounterintuitiveEng

Star it if you find it useful. PRs welcome.

git clone https://github.com/counter-eng/ai-video-factory.git

Your code makes videos. You make ideas.

DEV Community