Maksims Gavrilovs

Posted on Jun 6 • Edited on Jun 12

Zero to Autopilot, Part 2: One Line of Text a Published Short, in 7 Stages

#ai #python #architecture #video

Series: Zero to Autopilot — Building a Self-Improving AI Media Channel. Part 2 of 7. Part 1 covered the landscape and my $10 wake-up call. This one is the architecture: how a single line of text becomes an uploaded Short without me ever opening a video editor.

Data status (Part 2): real-now. Code, file layout, and measured costs straight from the repo. No audience metrics — those are sandbagged to Part 7.

⭐ The whole thing is open source: github.com/dasein108/slope-studio. Clone along — there's a zero-API-key smoke test at the bottom.

The mental model: a video is a Makefile

Most "AI video generator" tools are a single monolith — one giant button, one black box, and when scene 14 comes out cursed you get to regenerate all 14. I've shipped enough software to know that's the wrong shape.

So I stole the model from build systems: a video is a directed pipeline of stages, each stage is a pure function from files to files, and the whole thing is idempotent. Re-run a stage, it skips work that's already done. Blow away one artifact, only that stage (and its dependents) rebuild. It's make with a YouTube upload at the end.

Here's the pipeline, top to bottom:

 idea ──► [1 script] ──► 01_script.json        (timed scenes + narration)
            │
            ├──► [2 visuals] ──► 02_visuals/scene_NN.png
            │
            ├──► [2.5 narrate] ─► 05_voice/scenes/*.mp3 + timing.json + captions.srt
            │
            ├──► [3 clips] ────► 03_clips/scene_NN.mp4   (animate the stills)
            │
            ├──► [4 stitch] ───► 04_stitched.mp4         (transitions, no audio)
            │
            ├──► [5 voice] ────► 05_voice/final.mp4      (TTS + music muxed)
            │
            ├──► [6 save] ─────► 06_final.mp4            (platform master)
            │
            └──► [7 publish] ──► YouTube

Every arrow writes a file. Every file lives under one run directory. Which brings us to the most important design decision in the whole project.

Everything is a file under `runs/<id>/`

No database. No hidden state. One run = one directory, and the directory is the state:

runs/lobachevsky/
├── project.json          # the manifest: provider + cost + done-flag per stage
├── 01_script.json        # scenes, narration, title, hashtags
├── 02_visuals/scene_01..15.png
├── 03_clips/scene_NN.mp4
├── 04_stitched.mp4
├── 05_voice/
│   ├── scenes/*.mp3       # per-scene TTS
│   ├── timing.json        # per-scene durations (drives clip lengths)
│   ├── captions.srt
│   └── final.mp4
├── 06_final.mp4          # the master you upload
├── 06_final.json         # SEO title/description/tags
└── 07_publish.json       # the YouTube video id, once live

This sounds almost too simple, but it buys you everything:

Debuggability — something looks off? Open the PNG. Read the JSON. No "inspect the pipeline state" tooling needed; ls and an image viewer are the debugger.
Resumability — kill the process at scene 9, restart, it picks up at scene 9.
Idempotency — stages check for their own output and skip it. Re-running visuals won't re-bill you for 15 images you already have (--force when you actually want to regenerate).
Version control of *artifacts* — every authored video in the repo is a folder you can diff, copy, or hand-edit.

Canonical paths live in exactly one place (studio/paths.py), so no stage ever hardcodes a filename:

def scene_image(d: Path, sid: int) -> Path:
    return visuals_dir(d) / f"scene_{sid:02d}.png"

def master(d: Path) -> Path:
    return d / "06_final.mp4"

Each stage is a CLI subcommand (and they chain)

The pipeline is a Typer app. Every stage is its own subcommand, so you can run the whole thing or surgically poke one stage:

# the whole pipeline, one idea in, one Short out:
studio run "lobachevsky geometry explained in a fun way" --duration 150

# or drive it stage by stage and inspect between steps:
RID=$(studio init "lobachevsky..." --duration 150)
studio script  $RID     # → 01_script.json   (read it! confirm the narration is real)
studio visuals $RID     # → 02_visuals/*.png
studio status  $RID     # render the manifest: what's done, what it cost

The stage order is one list, and run just walks it:

STAGE_ORDER = ["script", "visuals", "narrate", "clips", "stitch", "audio", "voice", "save"]

Adding a stage = write a pure function in stages/, add a subcommand, drop its name in that list. Adding a provider (a new image model, a new TTS) doesn't touch the pipeline at all — more on that next.

The provider contract: every model reports its own cost

Here's the design choice I'm proudest of, because it's what makes the whole rest of the series possible. Every media-producing provider — every LLM, image model, video model, TTS — returns the same dataclass:

@dataclass
class GenResult:
    path: Path | None = None
    cost_usd: float = 0.0     # the REAL cost, computed by the provider
    latency_s: float = 0.0
    provider: str = ""
    note: str = ""

That cost_usd is not an estimate I jotted in a spreadsheet. The Nano Banana provider returns $0.039. The kling provider computes seconds × $0.07. The Ken-Burns animator returns $0.00. So when a stage runs, the manifest records measured cost, not guessed:

class StageRecord(BaseModel):
    done: bool = False
    provider: str = ""
    cost_usd: float = 0.0

class Manifest(BaseModel):
    # ...
    def total_cost_usd(self) -> float:
        return round(sum(s.cost_usd for s in self.stages.values()), 4)

This is the foundation. You can't optimize what you don't measure, and you definitely can't put a budget-aware bandit (Part 6) on top of costs you're guessing at. Every dollar in this series is a real dollar the system reported on itself.

Six small LLMs, not one big one

A thing worth flagging early, because it shapes the whole design: there is no single "AI" in this system. There are six narrow LLM jobs, each doing one small thing, each with a deterministic fallback so the pipeline runs with zero API keys. Where each call sits:

idea
 └─► [scriptwriter LLM] ──► timed scenes + narration
        └─► [art-director LLM] picks each scene's motion + look (animator, fx, atmosphere)
              └─► [vision LLM] locates a face's mouth for lip-sync (only on talkinghead)
 visuals → clips → stitch → voice → save
        └─► [SEO LLM] polishes title / description / tags before publish
 (growth loop)
   [ideator LLM] next falsifiable bet (+ web-search trends) → produce → measure →
   [reflector LLM] turns measured results into an updated strategy ─┘

Role	Where	Job	Fallback (keyless)
Scriptwriter	`stages/script.py`	idea → timed scenes + narration	offline `stub` split
Art director	`artdirect.py`	pick per-scene animator / fx / atmosphere / transition	heuristic rules
Vision / mouth locator	`animate._detect_mouth`	find a face's mouth (pos + size) for lip-sync	explicit coords / default
SEO metadata	`stages/metadata.py`	polish title / description / tags	script-derived
Ideator	`marketing/ideate.py`	next viral bet + trend signals	strategy seeds
Reflector	`marketing/learn.py`	measured bets → updated strategy	top/bottom heuristic

And, deliberately, the parts that must be reproducible and auditable are not LLMs: the explore/exploit bandit (Part 6) is plain Thompson sampling, and virality scoring (Part 5) is a fixed formula. LLMs write and judge taste; statistics make the decisions. Keeping that line clean is most of what makes the system debuggable.

Watching it actually run

Here's the real log from the Lobachevsky run — note each stage announcing its provider and cost as it goes:

» visuals
visuals 15 images via fal-nanobanana  $0.585
» clips
clips 15 clips via fal-i2v  $0.75
» stitch
stitch 15 clips
» voice
voice captions=burn via edge  $0.0
» save
save runs/lobachevsky/06_final.mp4
done lobachevsky  total $1.335

Fifteen stills, fifteen animated clips, narration, captions, muxed and mastered — $1.34, fully automated, from one line of text. (That run used a bit of paid AI video; the all-Ken-Burns version of the same Short is $0.585, and the cheap-tier playbook from Part 1 gets a similar video to six cents. The cost knobs are Part 4.) Here's a frame from the finished thing:

And the data shape underneath each scene — the script stage emits timed scenes the rest of the pipeline consumes:

// 01_script.json (one scene)
{
  "id": 1,
  "start_s": 0, "end_s": 8,
  "narration": "What if everything you were taught about parallel lines was secretly a lie?",
  "visual_prompt": "railroad tracks vanishing toward a glowing question mark, retro poster",
  "on_screen_text": "...a lie?",
  "motion_hint": "slow push-in toward the vanishing point"
}

narration drives the TTS (and therefore the clip length — audio leads, video follows, so nothing ever desyncs). visual_prompt drives the image model. motion_hint drives the free animator. One JSON object, three downstream stages.

Try it yourself (zero API keys, zero dollars)

The repo ships an offline mode so you can watch the whole pipeline run without a single key or cent. Stub providers stand in for the paid ones; everything else is real ffmpeg:

git clone https://github.com/dasein108/slope-studio
cd slope-studio
uv venv && source .venv/bin/activate
uv pip install -e ".[fal]"

# free, offline, end-to-end smoke test:
studio run "how black holes bend time" --duration 12 \
  --script-provider stub --image-provider stub \
  --video-provider kenburns --voice-provider edge

You'll get a real runs/<id>/ folder with a stitched, narrated 06_final.mp4 — built entirely from free local tooling. (Heads up: stub is a wiring generator — it emits placeholder text so you can test the plumbing. Swap in a real LLM key before you spend money on visuals, or you'll lovingly render meaningless filler. Ask me how I know.)

What I'd tell another AI engineer

Takeaway: Resist the monolith. Model your AI pipeline as stages of pure file-to-file functions over a single run directory, make each one an independently runnable command, and give every provider a uniform result type that reports its own cost. You get free debuggability (ls is your inspector), free resumability, free idempotency, and — crucially — a measured cost ledger that everything smarter you build later (budgets, auto-strategies, bandits) gets to stand on. Boring architecture is a feature.

Next — Part 3: Free Motion. The fun part. AI video is $0.07/second; I'm going to take a single still image and give it real motion — drift, parallax with subject inpainting, kinetic type, atmospheric rain and embers — for $0.00, with a deep dive into the ffmpeg filtergraphs and the indie-game-dev tricks behind them. (Spoiler: it's all already running in the live effects gallery.)

▶ Live effects gallery: dasein108.github.io/slope-studio
⭐ Star the repo to follow along: github.com/dasein108/slope-studio
🔔 Subscribe to the channel to watch the experiment grow from zero: the Lobachevsky Short

DEV Community

Zero to Autopilot, Part 2: One Line of Text a Published Short, in 7 Stages

The mental model: a video is a Makefile

Everything is a file under `runs/<id>/`

Each stage is a CLI subcommand (and they chain)

The provider contract: every model reports its own cost

Six small LLMs, not one big one

Watching it actually run

Try it yourself (zero API keys, zero dollars)

What I'd tell another AI engineer

Top comments (0)

The mental model: a video is a Makefile

Everything is a file under runs/<id>/

Each stage is a CLI subcommand (and they chain)

The provider contract: every model reports its own cost

Six small LLMs, not one big one

Watching it actually run

Try it yourself (zero API keys, zero dollars)

What I'd tell another AI engineer

Everything is a file under `runs/<id>/`