Series: Zero to Autopilot — Building a Self-Improving AI Media Channel. Part 2 of 7. Part 1 covered the landscape and my $10 wake-up call. This one is the architecture: how a single line of text becomes an uploaded Short without me ever opening a video editor.
Data status (Part 2): real-now. Code, file layout, and measured costs straight from the repo. No audience metrics — those are sandbagged to Part 7.
⭐ The whole thing is open source: github.com/dasein108/slope-studio. Clone along — there's a zero-API-key smoke test at the bottom.
The mental model: a video is a Makefile
Most "AI video generator" tools are a single monolith — one giant button, one black box, and when scene 14 comes out cursed you get to regenerate all 14. I've shipped enough software to know that's the wrong shape.
So I stole the model from build systems: a video is a directed pipeline of stages, each stage is a pure function from files to files, and the whole thing is idempotent. Re-run a stage, it skips work that's already done. Blow away one artifact, only that stage (and its dependents) rebuild. It's make with a YouTube upload at the end.
Here's the pipeline, top to bottom:
idea ──► [1 script] ──► 01_script.json (timed scenes + narration)
│
├──► [2 visuals] ──► 02_visuals/scene_NN.png
│
├──► [2.5 narrate] ─► 05_voice/scenes/*.mp3 + timing.json + captions.srt
│
├──► [3 clips] ────► 03_clips/scene_NN.mp4 (animate the stills)
│
├──► [4 stitch] ───► 04_stitched.mp4 (transitions, no audio)
│
├──► [5 voice] ────► 05_voice/final.mp4 (TTS + music muxed)
│
├──► [6 save] ─────► 06_final.mp4 (platform master)
│
└──► [7 publish] ──► YouTube
Every arrow writes a file. Every file lives under one run directory. Which brings us to the most important design decision in the whole project.
Everything is a file under runs/<id>/
No database. No hidden state. One run = one directory, and the directory is the state:
runs/lobachevsky/
├── project.json # the manifest: provider + cost + done-flag per stage
├── 01_script.json # scenes, narration, title, hashtags
├── 02_visuals/scene_01..15.png
├── 03_clips/scene_NN.mp4
├── 04_stitched.mp4
├── 05_voice/
│ ├── scenes/*.mp3 # per-scene TTS
│ ├── timing.json # per-scene durations (drives clip lengths)
│ ├── captions.srt
│ └── final.mp4
├── 06_final.mp4 # the master you upload
├── 06_final.json # SEO title/description/tags
└── 07_publish.json # the YouTube video id, once live
This sounds almost too simple, but it buys you everything:
-
Debuggability — something looks off? Open the PNG. Read the JSON. No "inspect the pipeline state" tooling needed;
lsand an image viewer are the debugger. - Resumability — kill the process at scene 9, restart, it picks up at scene 9.
-
Idempotency — stages check for their own output and skip it. Re-running
visualswon't re-bill you for 15 images you already have (--forcewhen you actually want to regenerate). - Version control of *artifacts* — every authored video in the repo is a folder you can diff, copy, or hand-edit.
Canonical paths live in exactly one place (studio/paths.py), so no stage ever hardcodes a filename:
def scene_image(d: Path, sid: int) -> Path:
return visuals_dir(d) / f"scene_{sid:02d}.png"
def master(d: Path) -> Path:
return d / "06_final.mp4"
Each stage is a CLI subcommand (and they chain)
The pipeline is a Typer app. Every stage is its own subcommand, so you can run the whole thing or surgically poke one stage:
# the whole pipeline, one idea in, one Short out:
studio run "lobachevsky geometry explained in a fun way" --duration 150
# or drive it stage by stage and inspect between steps:
RID=$(studio init "lobachevsky..." --duration 150)
studio script $RID # → 01_script.json (read it! confirm the narration is real)
studio visuals $RID # → 02_visuals/*.png
studio status $RID # render the manifest: what's done, what it cost
The stage order is one list, and run just walks it:
STAGE_ORDER = ["script", "visuals", "narrate", "clips", "stitch", "audio", "voice", "save"]
Adding a stage = write a pure function in stages/, add a subcommand, drop its name in that list. Adding a provider (a new image model, a new TTS) doesn't touch the pipeline at all — more on that next.
The provider contract: every model reports its own cost
Here's the design choice I'm proudest of, because it's what makes the whole rest of the series possible. Every media-producing provider — every LLM, image model, video model, TTS — returns the same dataclass:
@dataclass
class GenResult:
path: Path | None = None
cost_usd: float = 0.0 # the REAL cost, computed by the provider
latency_s: float = 0.0
provider: str = ""
note: str = ""
That cost_usd is not an estimate I jotted in a spreadsheet. The Nano Banana provider returns $0.039. The kling provider computes seconds × $0.07. The Ken-Burns animator returns $0.00. So when a stage runs, the manifest records measured cost, not guessed:
class StageRecord(BaseModel):
done: bool = False
provider: str = ""
cost_usd: float = 0.0
class Manifest(BaseModel):
# ...
def total_cost_usd(self) -> float:
return round(sum(s.cost_usd for s in self.stages.values()), 4)
This is the foundation. You can't optimize what you don't measure, and you definitely can't put a budget-aware bandit (Part 6) on top of costs you're guessing at. Every dollar in this series is a real dollar the system reported on itself.
Watching it actually run
Here's the real log from the Lobachevsky run — note each stage announcing its provider and cost as it goes:
» visuals
visuals 15 images via fal-nanobanana $0.585
» clips
clips 15 clips via fal-i2v $0.75
» stitch
stitch 15 clips
» voice
voice captions=burn via edge $0.0
» save
save runs/lobachevsky/06_final.mp4
done lobachevsky total $1.335
Fifteen stills, fifteen animated clips, narration, captions, muxed and mastered — $1.34, fully automated, from one line of text. (That run used a bit of paid AI video; the all-Ken-Burns version of the same Short is $0.585, and the cheap-tier playbook from Part 1 gets a similar video to six cents. The cost knobs are Part 4.) Here's a frame from the finished thing:
And the data shape underneath each scene — the script stage emits timed scenes the rest of the pipeline consumes:
// 01_script.json (one scene)
{
"id": 1,
"start_s": 0, "end_s": 8,
"narration": "What if everything you were taught about parallel lines was secretly a lie?",
"visual_prompt": "railroad tracks vanishing toward a glowing question mark, retro poster",
"on_screen_text": "...a lie?",
"motion_hint": "slow push-in toward the vanishing point"
}
narration drives the TTS (and therefore the clip length — audio leads, video follows, so nothing ever desyncs). visual_prompt drives the image model. motion_hint drives the free animator. One JSON object, three downstream stages.
Try it yourself (zero API keys, zero dollars)
The repo ships an offline mode so you can watch the whole pipeline run without a single key or cent. Stub providers stand in for the paid ones; everything else is real ffmpeg:
git clone https://github.com/dasein108/slope-studio
cd slope-studio
uv venv && source .venv/bin/activate
uv pip install -e ".[fal]"
# free, offline, end-to-end smoke test:
studio run "how black holes bend time" --duration 12 \
--script-provider stub --image-provider stub \
--video-provider kenburns --voice-provider edge
You'll get a real runs/<id>/ folder with a stitched, narrated 06_final.mp4 — built entirely from free local tooling. (Heads up: stub is a wiring generator — it emits placeholder text so you can test the plumbing. Swap in a real LLM key before you spend money on visuals, or you'll lovingly render meaningless filler. Ask me how I know.)
What I'd tell another AI engineer
Takeaway: Resist the monolith. Model your AI pipeline as stages of pure file-to-file functions over a single run directory, make each one an independently runnable command, and give every provider a uniform result type that reports its own cost. You get free debuggability (
lsis your inspector), free resumability, free idempotency, and — crucially — a measured cost ledger that everything smarter you build later (budgets, auto-strategies, bandits) gets to stand on. Boring architecture is a feature.
Next — Part 3: Free Motion. The fun part. AI video is $0.07/second; I'm going to take a single still image and give it real motion — drift, parallax with subject inpainting, kinetic type, atmospheric rain and embers — for $0.00, with a deep dive into the ffmpeg filtergraphs and the indie-game-dev tricks behind them. (Spoiler: it's all already running in the live effects gallery.)
▶ Live effects gallery: dasein108.github.io/slope-studio
⭐ Star the repo to follow along: github.com/dasein108/slope-studio
🔔 Subscribe to the channel to watch the experiment grow from zero: the Lobachevsky Short


Top comments (0)